scispace - formally typeset
Search or ask a question

Showing papers presented at "American Medical Informatics Association Annual Symposium in 2012"


Proceedings Article
03 Nov 2012
TL;DR: A novel approach for ICU patient risk stratification is proposed by combining the learned "topic" structure of clinical concepts extracted from the unstructured nursing notes with physiologic data (from SAPS-I) for hospital mortality prediction.
Abstract: We propose a novel approach for ICU patient risk stratification by combining the learned "topic" structure of clinical concepts (represented by UMLS codes) extracted from the unstructured nursing notes with physiologic data (from SAPS-I) for hospital mortality prediction. We used Hierarchical Dirichlet Processes (HDP), a non-parametric topic modeling technique, to automatically discover "topics" as shared groups of co-occurring UMLS clinical concepts. We evaluated the potential utility of the inferred topic structure in predicting hospital mortality using the nursing notes of 14,739 adult ICU patients (mortality 14.6%) from the MIMIC II database. Our results indicate that learned topic structure from the first 24-hour ICU nursing notes significantly improved the performance of the SAPS-I algorithm for hospital mortality prediction. The AUC for predicting hospital mortality from the first 24 hours of physiologic data and nursing text notes was 0.82. Using the physiologic data alone with the SAPS-I algorithm, an AUC of 0.72 was achieved. Thus, the clinical topics that were extracted and used to augment the SAPS-I algorithm significantly improved the performance of the baseline algorithm.

103 citations


Proceedings Article
03 Nov 2012
TL;DR: Diabetes risk forecasting using data from EMR is innovative and has the potential to identify, automatically, high-risk populations for early intervention with life style modifications such as diet and exercise to prevent or delay the development of type 2 diabetes.
Abstract: Objective: To test the feasibility of using data collected in electronic medical records for development of effective models for diabetes risk forecasting.

98 citations


Proceedings Article
01 Jan 2012
TL;DR: The proposed systematic framework can identify complementary risk factors that are not in the existing known factors and can better predict the onset of HF, and those additional risk factors are confirmed to be clinically meaningful by a cardiologist.
Abstract: Background: The ability to identify the risk factors related to an adverse condition, eg, heart failures (HF) diagnosis, is very important for improving care quality and reducing cost Existing approaches for risk factor identification are either knowledge driven (from guidelines or literatures) or data driven (from observational data) No existing method provides a model to effectively combine expert knowledge with data driven insight for risk factor identification

87 citations


Proceedings Article
Adam Perer1, Jimeng Sun1
03 Nov 2012
TL;DR: MatrixFlow as mentioned in this paper is a visual analytic system that takes clinical event sequences of patients as input, constructs time-evolving networks and visualizes them as a temporal flow of matrices.
Abstract: OBJECTIVE To develop a visual analytic system to help medical professionals improve disease diagnosis by providing insights for understanding disease progression. METHODS We develop MatrixFlow, a visual analytic system that takes clinical event sequences of patients as input, constructs time-evolving networks and visualizes them as a temporal flow of matrices. MatrixFlow provides several interactive features for analysis: 1) one can sort the events based on the similarity in order to accentuate underlying cluster patterns among those events; 2) one can compare co-occurrence events over time and across cohorts through additional line graph visualization. RESULTS MatrixFlow is applied to visualize heart failure (HF) symptom events extracted from a large cohort of HF cases and controls (n=50,625), which allows medical experts to reach insights involving temporal patterns and clusters of interest, and compare cohorts in novel ways that may lead to improved disease diagnoses. CONCLUSIONS MatrixFlow is an interactive visual analytic system that allows users to quickly discover patterns in clinical event sequences. By unearthing the patterns hidden within and displaying them to medical experts, users become empowered to make decisions influenced by historical patterns.

72 citations


Proceedings Article
03 Nov 2012
TL;DR: It is suggested that accurate identification of clinical abbreviations is a challenging task and that more advanced abbreviation recognition modules might improve existing clinical NLP systems.
Abstract: Clinical Natural Language Processing (NLP) systems extract clinical information from narrative clinical texts in many settings. Previous research mentions the challenges of handling abbreviations in clinical texts, but provides little insight into how well current NLP systems correctly recognize and interpret abbreviations. In this paper, we compared performance of three existing clinical NLP systems in handling abbreviations: MetaMap, MedLEE, and cTAKES. The evaluation used an expert-annotated gold standard set of clinical documents (derived from from 32 de-identified patient discharge summaries) containing 1,112 abbreviations. The existing NLP systems achieved suboptimal performance in abbreviation identification, with F-scores ranging from 0.165 to 0.601. MedLEE achieved the best F-score of 0.601 for all abbreviations and 0.705 for clinically relevant abbreviations. This study suggested that accurate identification of clinical abbreviations is a challenging task and that more advanced abbreviation recognition modules might improve existing clinical NLP systems.

71 citations


Proceedings Article
01 Jan 2012
TL;DR: The construction of three annotated corpora to serve as gold standards for medical natural language processing (NLP) tasks are presented and their annotation schemas are aligned with a large scale, NIH-funded clinical text annotation project.
Abstract: We present the construction of three annotated corpora to serve as gold standards for medical natural language processing (NLP) tasks. Clinical notes from the medical record, clinical trial announcements, and FDA drug labels are annotated. We report high inter-annotator agreements (overall F-measures between 0.8467 and 0.9176) for the annotation of Personal Health Information (PHI) elements for a de-identification task and of medications, diseases/disorders, and signs/symptoms for information extraction (IE) task. The annotated corpora of clinical trials and FDA labels will be publicly released and to facilitate translational NLP tasks that require cross-corpora interoperability (e.g. clinical trial eligibility screening) their annotation schemas are aligned with a large scale, NIH-funded clinical text annotation project.

68 citations


Proceedings Article
01 Jan 2012
TL;DR: Electronic health records contain important data elements for detection of novel adverse drug reactions, genotype/phenotype identification and psychosocial factor analysis, and the role of each of these as risk factors for suicidality warrants further investigation.
Abstract: Electronic health records contain important data elements for detection of novel adverse drug reactions, genotype/phenotype identification and psychosocial factor analysis, and the role of each of these as risk factors for suicidality warrants further investigation Suicide and suicidal ideation are documented in clinical narratives The specific purpose of this study was to define an algorithm for automated detection of this serious event We found that ICD-9 E-Codes had the lowest positive predictive value: 055 (90% CI: 042-067), while combining ICD-9 and NLP had the best PPV: 097 (90% CI: 092-099) A qualitative analysis and classification of the types of errors by ICD-9 and NLP automated coding compared to manual review are also discussed

68 citations


Proceedings Article
03 Nov 2012
TL;DR: The objective was to detect the presence of sepsis soon after the patient visits the emergency department using Dynamic Bayesian Networks, a temporal probabilistic technique to model a system whose state changes over time.
Abstract: Sepsis is a systemic inflammatory state due to an infection, and is associated with very high mortality and morbidity. Early diagnosis and prompt antibiotic and supportive therapy is associated with improved outcomes. Our objective was to detect the presence of sepsis soon after the patient visits the emergency department. We used Dynamic Bayesian Networks, a temporal probabilistic technique to model a system whose state changes over time. We built, trained and tested the model using data of 3,100 patients admitted to the emergency department, and measured the accuracy of detecting sepsis using data collected within the first 3 hours, 6 hours, 12 hours and 24 hours after admission. The area under the curve was 0.911, 0.915, 0.937 and 0.944 respectively. We describe the data, data preparation techniques, model, results, various statistical measures and the limitations of our experiments. We also briefly discuss techniques to improve accuracy, and the generalizability of our methods to other diseases.

64 citations


Proceedings Article
01 Jan 2012
TL;DR: The use of EpiDEA is demonstrated for cohort identification through use of an intuitive visual query interface that can be directly used by clinical researchers.
Abstract: Sudden Unexpected Death in Epilepsy (SUDEP) is a poorly understood phenomenon. Patient cohorts to power statistical studies in SUDEP need to be drawn from multiple centers due to the low rate of reported SUDEP incidences. But the current practice of manual chart review of Epilepsy Monitoring Units (EMU) patient discharge summaries is time-consuming, tedious, and not scalable for large studies. To address this challenge in the multi-center NIH-funded Prevention and Risk Identification of SUDEP Mortality (PRISM) Project, we have developed the Epilepsy Data Extraction and Annotation (EpiDEA) system for effective processing of discharge summaries. EpiDEA uses a novel Epilepsy and Seizure Ontology (EpSO), which has been developed based on the International League Against Epilepsy (ILAE) classification system, as the core knowledge resource. By extending the cTAKES natural language processing tool developed at the Mayo Clinic, EpiDEA implements specialized functions to address the unique challenges of processing epilepsy and seizure-related clinical free text in discharge summaries. The EpiDEA system was evaluated on a corpus of 104 discharge summaries from the University Hospitals Case Medical Center EMU and achieved an overall precision of 93.59% and recall of 84.01% with an F-measure of 88.53%. The results were compared against a gold standard created by two epileptologists. We demonstrate the use of EpiDEA for cohort identification through use of an intuitive visual query interface that can be directly used by clinical researchers.

60 citations


Proceedings Article
03 Nov 2012
TL;DR: This paper presents the structure and two case studies of a framework that has provided the ability to create a number of decision support applications that are dependent on the integration of previous enterprise-wide data in addition to a patient’s current information in the EMR.
Abstract: The enormous amount of data being collected by electronic medical records (EMR) has found additional value when integrated and stored in data warehouses. The enterprise data warehouse (EDW) allows all data from an organization with numerous inpatient and outpatient facilities to be integrated and analyzed. We have found the EDW at Intermountain Healthcare to not only be an essential tool for management and strategic decision making, but also for patient specific clinical decision support. This paper presents the structure and two case studies of a framework that has provided us the ability to create a number of decision support applications that are dependent on the integration of previous enterprise-wide data in addition to a patient's current information in the EMR.

60 citations


Proceedings Article
03 Nov 2012
TL;DR: It is found that the use of HAND-IT led to fewer transition breakdowns, greater tool resilience, and likely led to better learning outcomes for less-experienced clinicians when compared to the current tool.
Abstract: Successful handoffs ensure smooth, efficient and safe patient care transitions Tools and systems designed for standardization of clinician handoffs often focuses on ensuring the communication activity during transitions, with limited support for preparatory activities such as information seeking and organization We designed and evaluated a Handoff Intervention Tool (HAND-IT) based on a checklist-inspired, body system format allowing structured information organization, and a problem-case narrative format allowing temporal description of patient care events Based on a pre-post prospective study using a multi-method analysis we evaluated the effectiveness of HAND-IT as a documentation tool We found that the use of HAND-IT led to fewer transition breakdowns, greater tool resilience, and likely led to better learning outcomes for less-experienced clinicians when compared to the current tool We discuss the implications of our results for improving patient safety with a continuity of care-based approach

Proceedings Article
03 Nov 2012
TL;DR: The effectiveness of Peer-Led Proficiency Training of existing experienced clinician EHR users in improving self-reported efficiency and satisfaction with an EHR and improvements in perceived work-life balance and job satisfaction are highlighted.
Abstract: The best way to train clinicians to optimize their use of the Electronic Health Record (EHR) remains unclear. Approaches range from web-based training, class-room training, EHR functionality training, case-based training, role-based training, process-based training, mock-clinic training and "on the job" training. Similarly, the optimal timing of training remains unclear--whether to engage in extensive pre go-live training vs. minimal pre go-live training followed by more extensive post go-live training. In addition, the effectiveness of non-clinician trainers, clinician trainers, and peer-trainers, remains unclearly defined. This paper describes a program in which relatively experienced clinician users of an EHR underwent an intensive 3-day Peer-Led EHR advanced proficiency training, and the results of that training based on participant surveys. It highlights the effectiveness of Peer-Led Proficiency Training of existing experienced clinician EHR users in improving self-reported efficiency and satisfaction with an EHR and improvements in perceived work-life balance and job satisfaction.

Proceedings Article
01 Jan 2012
TL;DR: The Text Retrieval Conference (TREC) 2011 Medical Records Track was a challenge evaluation allowing comparison of systems and algorithms to retrieve patients eligible for clinical studies from a corpus of de-identified medical records, grouped by patient visit as mentioned in this paper.
Abstract: Objective: Secondary use of electronic health record (EHR) data relies on the ability to retrieve accurate and complete information about desired patient populations. The Text Retrieval Conference (TREC) 2011 Medical Records Track was a challenge evaluation allowing comparison of systems and algorithms to retrieve patients eligible for clinical studies from a corpus of de-identified medical records, grouped by patient visit. Participants retrieved cohorts of patients relevant to 35 different clinical topics, and visits were judged for relevance to each topic. This study identified the most common barriers to identifying specific clinic populations in the test collection. Methods: Using the runs from track participants and judged visits, we analyzed the five non-relevant visits most often retrieved and the five relevant visits most often overlooked. Categories were developed iteratively to group the reasons for incorrect retrieval for each of the 35 topics. Results: Reasons fell into nine categories for non-relevant visits and five categories for relevant visits. Non-relevant visits were most often retrieved because they contained a non-relevant reference to the topic terms. Relevant visits were most often infrequently retrieved because they used a synonym for a topic term. Conclusions : This failure analysis provides insight into areas for future improvement in EHR-based retrieval with techniques such as more widespread and complete use of standardized terminology in retrieval and data entry systems.

Proceedings Article
01 Jan 2012
TL;DR: Amongst the three expansion methods, the topic model-based method performed the best in terms of recall and F-measure, and was developed and tested for the retrieval of clinical documents.
Abstract: We present a study that developed and tested three query expansion methods for the retrieval of clinical documents. Finding relevant documents in a large clinical data warehouse is a challenging task. To address this issue, first, we implemented a synonym expansion strategy that used a few selected vocabularies. Second, we trained a topic model on a large set of clinical documents, which was then used to identify related terms for query expansion. Third, we obtained related terms from a large predicate database derived from Medline abstracts for query expansion. The three expansion methods were tested on a set of clinical notes. All three methods successfully achieved higher average recalls and average F-measures when compared with the baseline method. The average precisions and precision at 10, however, decreased with all expansions. Amongst the three expansion methods, the topic model-based method performed the best in terms of recall and F-measure.

Proceedings Article
01 Jan 2012
TL;DR: It is shown that machine-learning methods are able to induce models that identify high-risk patients with accuracy that exceeds previously developed scoring models for VTE.
Abstract: We consider the task of predicting which patients are most at risk for post-hospitalization venothromboembolism (VTE) using information automatically elicited from an EHR. Given a set of cases and controls, we use machine-learning methods to induce models for making these predictions. Our empirical evaluation of this approach offers a number of interesting and important conclusions. We identify several risk factors for VTE that were not previously recognized. We show that machine-learning methods are able to induce models that identify high-risk patients with accuracy that exceeds previously developed scoring models for VTE. Additionally, we show that, even without having prior knowledge about relevant risk factors, we are able to learn accurate models for this task.

Proceedings Article
03 Nov 2012
TL;DR: A clinical tool capable of identifying discriminatory characteristics that can predict patients who will return within 72 hours to the Pediatric emergency department (PED) is developed and demonstrated to be consistent and predictive across multiple PED sites.
Abstract: The primary purpose of this study was to develop a clinical tool capable of identifying discriminatory characteristics that can predict patients who will return within 72 hours to the Pediatric emergency department (PED). We studied 66,861 patients who were discharged from the EDs during the period from May 1 2009 to December 31 2009. We used a classification model to predict return visits based on factors extracted from patient demographic information, chief complaint, diagnosis, treatment, and hospital real-time ED statistics census. We began with a large pool of potentially important factors, and used particle swarm optimization techniques for feature selection coupled with an optimization-based discriminant analysis model (DAMIP) to identify a classification rule with relatively small subsets of discriminatory factors that can be used to predict — with 80% accuracy or greater — return within 72 hours. The analysis involves using a subset of the patient cohort for training and establishment of the predictive rule, and blind predicting the return of the remaining patients. Good candidate factors for revisit prediction are obtained where the accuracy of cross validation and blind prediction are over 80%. Among the predictive rules, the most frequent discriminatory factors identified include diagnosis (> 97%), patient complaint (>97%), and provider type (> 57%). There are significant differences in the readmission characteristics among different acuity levels. For Level 1 patients, critical readmission factors include patient complaint (>57%), time when the patient arrived until he/she got an ED bed (> 64%), and type/number of providers (>50%). For Level 4/5 patients, physician diagnosis (100%), patient complaint (99%), disposition type when patient arrives and leaves the ED (>30%), and if patient has lab test (>33%) appear to be significant. The model was demonstrated to be consistent and predictive across multiple PED sites. The resulting tool could enable ED staff and administrators to use patient specific values for each of a small number of discriminatory factors, and in return receive a prediction as to whether the patient will return to the ED within 72 hours. Our prediction accuracy can be as high as over 85%. This provides an opportunity for improving care and offering additional care or guidance to reduce ED readmission.

Proceedings Article
03 Nov 2012
TL;DR: In this article, the authors examined the transportability of the smoking status detection module in the clinical Text Analysis and Knowledge Extraction System (cTAKES) on the Vanderbilt University Hospital's EMR data.
Abstract: Electronic Medical Records (EMRs) are valuable resources for clinical observational studies Smoking status of a patient is one of the key factors for many diseases, but it is often embedded in narrative text Natural language processing (NLP) systems have been developed for this specific task, such as the smoking status detection module in the clinical Text Analysis and Knowledge Extraction System (cTAKES) This study examined transportability of the smoking module in cTAKES on the Vanderbilt University Hospital’s EMR data Our evaluation demonstrated that modest effort of change is necessary to achieve desirable performance We modified the system by filtering notes, annotating new data for training the machine learning classifier, and adding rules to the rule-based classifiers Our results showed that the customized module achieved significantly higher F-measures at all levels of classification (ie, sentence, document, patient) compared to the direct application of the cTAKES module to the Vanderbilt data

Proceedings Article
01 Jan 2012
TL;DR: The objective was to delineate the prevalence of hedge phrase usage in clinical documentation which may have a profound impact on patient care and provider-patient communication, and may become a source of unintended consequences when such documents are made directly accessible to patients via patient portals.
Abstract: In this study, we quantified the use of uncertainty expressions, referred to as ‘hedge’ phrases, among a corpus of 100,000 clinical documents retrieved from our institution’s electronic health record system. The frequency of each hedge phrase appearing in the corpus was characterized across document types and clinical departments. We also used a natural language processing tool to identify clinical concepts that were spatially, and potentially semantically, associated with the hedge phrases identified. The objective was to delineate the prevalence of hedge phrase usage in clinical documentation which may have a profound impact on patient care and provider–patient communication, and may become a source of unintended consequences when such documents are made directly accessible to patients via patient portals.

Proceedings Article
01 Jan 2012
TL;DR: While patients' tracking behaviors without a real-time tracking tool were fragmented and sporadic, these behaviors with a tool were more consistent and used tracked data to see patterns among symptoms, feel psychosocial comfort, and improve symptom communication with clinicians.
Abstract: People with cancer experience many unanticipated symptoms and struggle to communicate them to clinicians. Although researchers have developed patient-reported outcome (PRO) tools to address this problem, such tools capture retrospective data intended for clinicians to review. In contrast, real-time tracking tools with visible results for patients could improve health outcomes and communication with clinicians, while also enhancing patients' symptom management. To understand potential benefits of such tools, we studied the tracking behaviors of 25 women with breast cancer. We provided 10 of these participants with a real-time tracking tool that served as a "technology probe" to uncover behaviors and benefits from voluntary use. Our findings showed that while patients' tracking behaviors without a tool were fragmented and sporadic, these behaviors with a tool were more consistent. Participants also used tracked data to see patterns among symptoms, feel psychosocial comfort, and improve symptom communication with clinicians. We conclude with design implications for future real-time tracking tools.

Proceedings Article
01 Jan 2012
TL;DR: The Intelligent Care Delivery Analytics platform (ICDA), a system which enables risk assessment analytics that process large collections of dynamic electronic medical data to identify at-risk patients, is described.
Abstract: The identification of high-risk patients is a critical component in improving patient outcomes and managing costs This paper describes the Intelligent Care Delivery Analytics platform (ICDA), a system which enables risk assessment analytics that process large collections of dynamic electronic medical data to identify at-risk patients ICDA works by ingesting large volumes of data into a common data model, then orchestrating a collection of analytics that identify at-risk patients It also provides an interactive environment through which users can access and review the analytics results In addition, ICDA provides APIs via which analytics results can be retrieved to surface in external applications A detailed review of ICDA’s architecture is provided Descriptions of four use cases are included to illustrate ICDA’s application within two different data environments These use cases showcase the system’s flexibility and exemplify the types of analytics it enables

Proceedings Article
01 Jan 2012
TL;DR: It is concluded that while improving estimates of surgery durations is possible, the inherent variability in such estimates remains high, necessitating caution in their use when optimizing OR schedules.
Abstract: Inherent uncertainties in surgery durations impact many critical metrics about the performance of an operating room (OR) environment. OR schedules that are robust to natural variability in surgery durations require surgery duration estimates that are unbiased, with high accuracy, and with few cases with large absolute errors. Earlier studies have shown that factors such as patient severity, personnel, and procedure type greatly affect the accuracy of such estimations. In this paper we investigate whether operational and temporal factors can be used to improve these estimates further. We present an adjustment method based on a combination of these operational and temporal factors. We validate our method with two years of detailed operational data from an electronic medical record. We conclude that while improving estimates of surgery durations is possible, the inherent variability in such estimates remains high, necessitating caution in their use when optimizing OR schedules.

Proceedings Article
01 Jan 2012
TL;DR: This work proposes developing predictive models by first generating derived variables that characterize clinical phenotype that reduces the number of variables, reduces noise, introduces clinical knowledge into model building, and abstracts away the underlying data representation, thus facilitating use of standard data mining algorithms.
Abstract: Hospital readmissions depend on numerous factors. Automated risk calculation using electronic health record (EHR) data could allow targeting care to prevent them. EHRs usually are incomplete with respect to data relevant to readmissions prediction. Lack of standard data representations in EHRs restricts generalizability of predictive models. We propose developing such models by first generating derived variables that characterize clinical phenotype. This reduces the number of variables, reduces noise, introduces clinical knowledge into model building, and abstracts away the underlying data representation, thus facilitating use of standard data mining algorithms. We combined this pre-processing step with a random forest algorithm to compute risk for readmission within 30 days for patients in ten disease categories. Results were promising for encounters that our algorithm assigned very high or very low risk. Assigning patients to either of these two risk groups could be of value to patient care teams aiming to prevent readmissions.

Proceedings Article
01 Jan 2012
TL;DR: In this article, a corpus-driven semantic lexicon, MedLex, has been constructed where the semantics is based on the UMLS assisted with variants mined and usage information gathered from clinical text.
Abstract: A semantic lexicon which associates words and phrases in text to concepts is critical for extracting and encoding clinical information in free text and therefore achieving semantic interoperability between structured and unstructured data in Electronic Health Records (EHRs) Directly using existing standard terminologies may have limited coverage with respect to concepts and their corresponding mentions in text In this paper, we analyze how tokens and phrases in a large corpus distribute and how well the UMLS captures the semantics A corpus-driven semantic lexicon, MedLex, has been constructed where the semantics is based on the UMLS assisted with variants mined and usage information gathered from clinical text The detailed corpus analysis of tokens, chunks, and concept mentions shows the UMLS is an invaluable source for natural language processing Increasing the semantic coverage of tokens provides a good foundation in capturing clinical information comprehensively The study also yields some insights in developing practical NLP systems

Proceedings Article
03 Nov 2012
TL;DR: In this paper, the authors evaluated the recently developed National Quality Forum (NQF) information model designed for EHR-based quality measures: the Quality Data Model (QDM).
Abstract: The development of Electronic Health Record (EHR)-based phenotype selection algorithms is a non-trivial and highly iterative process involving domain experts and informaticians To make it easier to port algorithms across institutions, it is desirable to represent them using an unambiguous formal specification language For this purpose we evaluated the recently developed National Quality Forum (NQF) information model designed for EHR-based quality measures: the Quality Data Model (QDM) We selected 9 phenotyping algorithms that had been previously developed as part of the eMERGE consortium and translated them into QDM format Our study concluded that the QDM contains several core elements that make it a promising format for EHR-driven phenotyping algorithms for clinical research However, we also found areas in which the QDM could be usefully extended, such as representing information extracted from clinical text, and the ability to handle algorithms that do not consist of Boolean combinations of criteria

Proceedings Article
01 Jan 2012
TL;DR: In this paper, a framework and an approach for executing phenotyping criteria modeled in QDM using the Drools business rules management system is presented, and demonstrated their execution on real patient data from Mayo Clinic to identify cases for Coronary Artery Disease and Diabetes.
Abstract: With increasing adoption of electronic health records (EHRs), the need for formal representations for EHR-driven phenotyping algorithms has been recognized for some time. The recently proposed Quality Data Model from the National Quality Forum (NQF) provides an information model and a grammar that is intended to represent data collected during routine clinical care in EHRs as well as the basic logic required to represent the algorithmic criteria for phenotype definitions. The QDM is further aligned with Meaningful Use standards to ensure that the clinical data and algorithmic criteria are represented in a consistent, unambiguous and reproducible manner. However, phenotype definitions represented in QDM, while structured, cannot be executed readily on existing EHRs. Rather, human interpretation, and subsequent implementation is a required step for this process. To address this need, the current study investigates open-source JBoss® Drools rules engine for automatic translation of QDM criteria into rules for execution over EHR data. In particular, using Apache Foundation's Unstructured Information Management Architecture (UIMA) platform, we developed a translator tool for converting QDM defined phenotyping algorithm criteria into executable Drools rules scripts, and demonstrated their execution on real patient data from Mayo Clinic to identify cases for Coronary Artery Disease and Diabetes. To the best of our knowledge, this is the first study illustrating a framework and an approach for executing phenotyping criteria modeled in QDM using the Drools business rules management system.

Proceedings Article
03 Nov 2012
TL;DR: A profile-based method that used dictated discharge summaries as an external source to automatically build sense profiles and applied them to disambiguate abbreviations in hospital admission notes via the vector space model showed that it performed better than two baseline methods and achieved a best average precision of 0.792.
Abstract: Abbreviations are widely used in clinical notes and are often ambiguous. Word sense disambiguation (WSD) for clinical abbreviations therefore is a critical task for many clinical natural language processing (NLP) systems. Supervised machine learning based WSD methods are known for their high performance. However, it is time consuming and costly to construct annotated samples for supervised WSD approaches and sense frequency information is often ignored by these methods. In this study, we proposed a profile-based method that used dictated discharge summaries as an external source to automatically build sense profiles and applied them to disambiguate abbreviations in hospital admission notes via the vector space model. Our evaluation using a test set containing 2,386 annotated instances from 13 ambiguous abbreviations in admission notes showed that the profile-based method performed better than two baseline methods and achieved a best average precision of 0.792. Furthermore, we developed a strategy to combine sense frequency information estimated from a clustering analysis with the profile-based method. Our results showed that the combined approach largely improved the performance and achieved a highest precision of 0.875 on the same test set, indicating that integrating sense frequency information with local context is effective for clinical abbreviation disambiguation.

Proceedings Article
03 Nov 2012
TL;DR: In this paper, the authors used Support Vector Machines (SVM), Naive Bayes (NB), and Decision Trees (DT) to optimize the window size and orientation and determine the minimum training sample size needed for optimal performance.
Abstract: Acronyms and abbreviations within electronic clinical texts are widespread and often associated with multiple senses. Automated acronym sense disambiguation (WSD), a task of assigning the context-appropriate sense to ambiguous clinical acronyms and abbreviations, represents an active problem for medical natural language processing (NLP) systems. In this paper, fifty clinical acronyms and abbreviations with 500 samples each were studied using supervised machine-learning techniques (Support Vector Machines (SVM), Naive Bayes (NB), and Decision Trees (DT)) to optimize the window size and orientation and determine the minimum training sample size needed for optimal performance. Our analysis of window size and orientation showed best performance using a larger left-sided and smaller right-sided window. To achieve an accuracy of over 90%, the minimum required training sample size was approximately 125 samples for SVM classifiers with inverted cross-validation. These findings support future work in clinical acronym and abbreviation WSD and require validation with other clinical texts.

Proceedings Article
Jianying Hu1, Fei Wang2, Jimeng Sun1, Robert Sorrentino1, Shahram Ebadollahi1 
03 Nov 2012
TL;DR: The effectiveness of the framework is demonstrated using claims data collected from a population of 7667 diabetes patients, demonstrating the usefulness of the proposed approaches in identifying clinically meaningful instances for both hot spotting and anomaly detection.
Abstract: Patient medical records today contain vast amount of information regarding patient conditions along with treatment and procedure records. Systematic healthcare resource utilization analysis leveraging such observational data can provide critical insights to guide resource planning and improve the quality of care delivery while reducing cost. Of particular interest to providers are hot spotting: the ability to identify in a timely manner heavy users of the systems and their patterns of utilization so that targeted intervention programs can be instituted, and anomaly detection: the ability to identify anomalous utilization cases where the patients incurred levels of utilization that are unexpected given their clinical characteristics which may require corrective actions. Past work on medical utilization pattern analysis has focused on disease specific studies. We present a framework for utilization analysis that can be easily applied to any patient population. The framework includes two main components: utilization profiling and hot spotting, where we use a vector space model to represent patient utilization profiles, and apply clustering techniques to identify utilization groups within a given population and isolate high utilizers of different types; and contextual anomaly detection for utilization, where models that map patient’s clinical characteristics to the utilization level are built in order to quantify the deviation between the expected and actual utilization levels and identify anomalies. We demonstrate the effectiveness of the framework using claims data collected from a population of 7667 diabetes patients. Our analysis demonstrates the usefulness of the proposed approaches in identifying clinically meaningful instances for both hot spotting and anomaly detection. In future work we plan to incorporate additional sources of observational data including EMRs and disease registries, and develop analytics models to leverage temporal relationships among medical encounters to provide more in-depth insights.

Proceedings Article
01 Jan 2012
TL;DR: The development of a comprehensive Time Capture Tool (TimeCaT) is presented: a web application developed to support data capture for TMS and includes the development and validation of a realistic inter-observer reliability scoring algorithm, the creation of an online clinical tasks ontology, and a novel quantitative workflow comparison method.
Abstract: Time Motion Studies (TMS) have proved to be the gold standard method to measure and quantify clinical workflow, and have been widely used to assess the impact of health information systems implementation. Although there are tools available to conduct TMS, they provide different approaches for multitasking, interruptions, inter-observer reliability assessment and task taxonomy, making results across studies not comparable. We postulate that a significant contributing factor towards the standardization and spread of TMS would be the availability and spread of an accessible, scalable and dynamic tool. We present the development of a comprehensive Time Capture Tool (TimeCaT): a web application developed to support data capture for TMS. Ongoing and continuous development of TimeCaT includes the development and validation of a realistic inter-observer reliability scoring algorithm, the creation of an online clinical tasks ontology, and a novel quantitative workflow comparison method.

Proceedings Article
03 Nov 2012
TL;DR: In this paper, the authors examined the relationship between semantic relatedness among medical concepts found in clinical reports and biomedical literature and determined whether relations between medical concepts identified from Medline abstracts may be used to inform us as to the nature of the association between medical terms that appear to be closely related based on their distribution in clinical notes.
Abstract: In this paper we examined the relationship between semantic relatedness among medical concepts found in clinical reports and biomedical literature. Our objective is to determine whether relations between medical concepts identified from Medline abstracts may be used to inform us as to the nature of the association between medical concepts that appear to be closely related based on their distribution in clinical reports. We used a corpus of 800k inpatient clinical notes as a source of data for determining the strength of association between medical concepts and SemRep database as a source of labeled relations extracted from Medline abstracts. The same pair of medical concepts may be found with more than one predicate type in the SemRep database but often with different frequencies. Our analysis shows that predicate type frequency information obtained from the SemRep database appears to be helpful for labeling semantic relations obtained with measures of semantic relatedness and similarity.