scispace - formally typeset
Search or ask a question

Showing papers presented at "American Medical Informatics Association Annual Symposium in 2014"


Proceedings Article
14 Nov 2014
TL;DR: A systematic study of tweets collected for 74 drugs to assess their value as sources of potential signals for adverse drug reactions (ADRs), creating an annotated corpus of 10,822 tweets and attempting a lexicon-based approach for concept extraction, with promising success.
Abstract: Recent research has shown that Twitter data analytics can have broad implications on public health research. However, its value for pharmacovigilance has been scantly studied – with health related forums and community support groups preferred for the task. We present a systematic study of tweets collected for 74 drugs to assess their value as sources of potential signals for adverse drug reactions (ADRs). We created an annotated corpus of 10,822 tweets. Each tweet was annotated for the presence or absence of ADR mentions, with the span and Unified Medical Language System (UMLS) concept ID noted for each ADR present. Using Cohen’s kappa1, we calculated the inter-annotator agreement (IAA) for the binary annotations to be 0.69. To demonstrate the utility of the corpus, we attempted a lexicon-based approach for concept extraction, with promising success (54.1% precision, 62.1% recall, and 57.8% F-measure). A subset of the corpus is freely available at: http://diego.asu.edu/downloads.

134 citations


Proceedings Article
14 Nov 2014
TL;DR: Three studies of the mHealth apps in Google Play are presented that show that m health apps make widespread use of unsecured Internet communications and third party servers, suggesting that increased use of mHealthapps could lead to less secure treatment of health data unless mHealth vendors make improvements in the way they communicate and store data.
Abstract: Mobile Health (mHealth) applications lie outside of regulatory protection such as HIPAA, which requires a baseline of privacy and security protections appropriate to sensitive medical data. However, mHealth apps, particularly those in the app stores for iOS and Android, are increasingly handling sensitive data for both professionals and patients. This paper presents a series of three studies of the mHealth apps in Google Play that show that mHealth apps make widespread use of unsecured Internet communications and third party servers. Both of these practices would be considered problematic under HIPAA, suggesting that increased use of mHealth apps could lead to less secure treatment of health data unless mHealth vendors make improvements in the way they communicate and store data.

97 citations


Proceedings Article
14 Nov 2014
TL;DR: ARX is presented, an anonymization tool that implements a wide variety of privacy methods in a highly efficient manner, provides an intuitive cross-platform graphical interface, and offers a programming interface for integration into other software systems.
Abstract: Collaboration and data sharing have become core elements of biomedical research. Especially when sensitive data from distributed sources are linked, privacy threats have to be considered. Statistical disclosure control allows the protection of sensitive data by introducing fuzziness. Reduction of data quality, however, needs to be balanced against gains in protection. Therefore, tools are needed which provide a good overview of the anonymization process to those responsible for data sharing. These tools require graphical interfaces and the use of intuitive and replicable methods. In addition, extensive testing, documentation and openness to reviews by the community are important. Existing publicly available software is limited in functionality, and often active support is lacking. We present ARX, an anonymization tool that i) implements a wide variety of privacy methods in a highly efficient manner, ii) provides an intuitive cross-platform graphical interface, iii) offers a programming interface for integration into other software systems, and iv) is well documented and actively supported.

77 citations


Proceedings Article
Ping Zhang1, Fei Wang1, Jianying Hu1
14 Nov 2014
TL;DR: Novel predictions of drug-disease associations were supported by clinical trials databases, showing that DDR could serve as a useful tool in drug discovery to efficiently identify potential novel uses for existing drugs.
Abstract: In response to the high cost and high risk associated with traditional de novo drug discovery, investigation of potential additional uses for existing drugs, also known as drug repositioning, has attracted increasing attention from both the pharmaceutical industry and the research community. In this paper, we propose a unified computational framework, called DDR, to predict novel drug-disease associations. DDR formulates the task of hypothesis generation for drug repositioning as a constrained nonlinear optimization problem. It utilizes multiple drug similarity networks, multiple disease similarity networks, and known drug-disease associations to explore potential new associations among drugs and diseases with no known links. A large-scale study was conducted using 799 drugs against 719 diseases. Experimental results demonstrated the effectiveness of the approach. In addition, DDR ranked drug and disease information sources based on their contributions to the prediction, thus paving the way for prioritizing multiple data sources and building more reliable drug repositioning models. Particularly, some of our novel predictions of drug-disease associations were supported by clinical trials databases, showing that DDR could serve as a useful tool in drug discovery to efficiently identify potential novel uses for existing drugs.

72 citations


Proceedings Article
14 Nov 2014
TL;DR: It is shown that agreement between image reading and clinical examinations was imperfect, as was inter-reader agreement, and an image-based reference standard defined as the majority diagnosis given by three readers was improved.
Abstract: Information systems managing image-based data for telemedicine or clinical research applications require a reference standard representing the correct diagnosis. Accurate reference standards are difficult to establish because of imperfect agreement among physicians, and discrepancies between clinical vs. image-based diagnosis. This study is designed to describe the development and evaluation of reference standards for image-based diagnosis, which combine diagnostic impressions of multiple image readers with the actual clinical diagnoses. We show that agreement between image reading and clinical examinations was imperfect (689 [32%] discrepancies in 2148 image readings), as was inter-reader agreement (kappa 0.490-0.652). This was improved by establishing an image-based reference standard defined as the majority diagnosis given by three readers (13% discrepancies with image readers). It was further improved by establishing an overall reference standard that incorporated the clinical diagnosis (10% discrepancies with image readers). These principles of establishing reference standards may be applied to improve robustness of real-world systems supporting image-based diagnosis.

62 citations


Proceedings Article
14 Nov 2014
TL;DR: A novel framework for learning to estimate and predict clinical state variables without labeled data and develops a user interface to enable experts to choose anchor variables in an informed manner to enable real-time decision support in the emergency department.
Abstract: We present a novel framework for learning to estimate and predict clinical state variables without labeled data. The resulting models can used for electronic phenotyping, triggering clinical decision support, and cohort selection. The framework relies on key observations which we characterize and term “anchor variables”. By specifying anchor variables, an expert encodes a certain amount of domain knowledge about the problem while the rest of learning proceeds in an unsupervised manner. The ability to build anchors upon standardized ontologies and the framework’s ability to learn from unlabeled data promote generalizability across institutions. We additionally develop a user interface to enable experts to choose anchor variables in an informed manner. The framework is applied to electronic medical record-based phenotyping to enable real-time decision support in the emergency department. We validate the learned models using a prospectively gathered set of gold-standard responses from emergency physicians for nine clinically relevant variables.

62 citations


Proceedings Article
14 Nov 2014
TL;DR: An automated phenotyping algorithm that can be deployed to identify rapidly diabetic and/or hypertensive CKD cases and controls in health systems with EMRs is developed and dramatically outperformed identification by ICD-9-CM codes.
Abstract: Twenty-six million Americans are estimated to have chronic kidney disease (CKD) with increased risk for cardiovascular disease and end stage renal disease. CKD is frequently undiagnosed and patients are unaware, hampering intervention. A tool for accurate and timely identification of CKD from electronic medical records (EMR) could improve healthcare quality and identify patients for research. As members of eMERGE (electronic medical records and genomics) Network, we developed an automated phenotyping algorithm that can be deployed to identify rapidly diabetic and/or hypertensive CKD cases and controls in health systems with EMRs It uses diagnostic codes, laboratory results, medication and blood pressure records, and textual information culled from notes. Validation statistics demonstrated positive predictive values of 96% and negative predictive values of 93.3. Similar results were obtained on implementation by two independent eMERGE member institutions. The algorithm dramatically outperformed identification by ICD-9-CM codes with 63% positive and 54% negative predictive values, respectively.

54 citations


Proceedings Article
14 Nov 2014
TL;DR: A pattern-based text-mining approach to identify pairs of CHV and professional terms from Wikipedia, a large text corpus created and maintained by the community, provides great potential to produce a high quality CHV to improve the performance of computational applications in processing consumer-generated health text.
Abstract: Community-generated text corpora can be a valuable resource to extract consumer health vocabulary (CHV) and link them to professional terminologies and alternative variants. In this research, we propose a pattern-based text-mining approach to identify pairs of CHV and professional terms from Wikipedia, a large text corpus created and maintained by the community. A novel measure, leveraging the ratio of frequency of occurrence, was used to differentiate consumer terms from professional terms. We empirically evaluated the applicability of this approach using a large data sample consisting of MedLine abstracts and all posts from an online health forum, MedHelp. The results show that the proposed approach is able to identify synonymous pairs and label the terms as either consumer or professional term with high accuracy. We conclude that the proposed approach provides great potential to produce a high quality CHV to improve the performance of computational applications in processing consumer-generated health text.

45 citations


Proceedings Article
14 Nov 2014
TL;DR: The development and pilot testing of a web-based patient centered toolkit (PCTK) prototype is described to improve access to health information and to engage hospitalized patients and caregivers in the plan of care.
Abstract: Patient engagement has been identified as a key strategy for improving patient outcomes. In this paper, we describe the development and pilot testing of a web-based patient centered toolkit (PCTK) prototype to improve access to health information and to engage hospitalized patients and caregivers in the plan of care. Individual and group interviews were used to identify plan of care functional and workflow requirements and user interface design enhancements. Qualitative methods within a participatory design approach supported the development of a PCTK prototype that will be implemented on intensive care and oncology units to engage patients and professional care team members developing their plan of care during an acute hospitalization.

45 citations


Proceedings Article
14 Nov 2014
TL;DR: It is suggested that transparency in the process of sharing is an important factor in the decision to share clinical data for research.
Abstract: We interviewed 70 healthy volunteers to understand their choices about how the information in their health record should be shared for research. Twenty-eight survey questions captured individual preferences of healthy volunteers. The results showed that respondents felt comfortable participating in research if they were given choices about which portions of their medical data would be shared, and with whom those data would be shared. Respondents indicated a strong preference towards controlling access to specific data (83%), and a large proportion (68%) indicated concern about the possibility of their data being used by for-profit entities. The results suggest that transparency in the process of sharing is an important factor in the decision to share clinical data for research.

42 citations


Proceedings Article
14 Nov 2014
TL;DR: TextHunter is presented, a tool for the creation of training data, construction of concept extraction machine learning models and their application to documents and achieved recall measurements as high as 99% in real world use cases.
Abstract: Observational research using data from electronic health records (EHR) is a rapidly growing area, which promises both increased sample size and data richness - therefore unprecedented study power. However, in many medical domains, large amounts of potentially valuable data are contained within the free text clinical narrative. Manually reviewing free text to obtain desired information is an inefficient use of researcher time and skill. Previous work has demonstrated the feasibility of applying Natural Language Processing (NLP) to extract information. However, in real world research environments, the demand for NLP skills outweighs supply, creating a bottleneck in the secondary exploitation of the EHR. To address this, we present TextHunter, a tool for the creation of training data, construction of concept extraction machine learning models and their application to documents. Using confidence thresholds to ensure high precision (>90%), we achieved recall measurements as high as 99% in real world use cases.

Proceedings Article
14 Nov 2014
TL;DR: This work presents the first machine learning-based method specifically for classifying consumer health questions, and describes, manually annotate, and automatically classify three important question elements that improve question classification over previous techniques.
Abstract: We present a method for automatically classifying consumer health questions. Our thirteen question types are designed to aid in the automatic retrieval of medical answers from consumer health resources. To our knowledge, this is the first machine learning-based method specifically for classifying consumer health questions. We demonstrate how previous approaches to medical question classification are insufficient to achieve high accuracy on this task. Additionally, we describe, manually annotate, and automatically classify three important question elements that improve question classification over previous techniques. Our results and analysis illustrate the difficulty of the task and the future directions that are necessary to achieve high-performing consumer health question classification.

Proceedings Article
14 Nov 2014
TL;DR: The Multi-Modality Epilepsy Data Capture and Integration System (MEDCIS) is developed that combines retrospective clinical free text processing using NLP, prospective structured data capture using an ontology-driven interface, interfaces for cohort search and signal visualization, all in a single integrated environment.
Abstract: Sudden Unexpected Death in Epilepsy (SUDEP) is the leading mode of epilepsy-related death and is most common in patients with intractable, frequent, and continuing seizures. A statistically significant cohort of patients for SUDEP study requires meticulous, prospective follow up of a large population that is at an elevated risk, best represented by the Epilepsy Monitoring Unit (EMU) patient population. Multiple EMUs need to collaborate, share data for building a larger cohort of potential SUDEP patient using a state-of-the-art informatics infrastructure. To address the challenges of data integration and data access from multiple EMUs, we developed the Multi-Modality Epilepsy Data Capture and Integration System (MEDCIS) that combines retrospective clinical free text processing using NLP, prospective structured data capture using an ontology-driven interface, interfaces for cohort search and signal visualization, all in a single integrated environment. A dedicated Epilepsy and Seizure Ontology (EpSO) has been used to streamline the user interfaces, enhance its usability, and enable mappings across distributed databases so that federated queries can be executed. MEDCIS contained 936 patient data sets from the EMUs of University Hospitals Case Medical Center (UH CMC) in Cleveland and Northwestern Memorial Hospital (NMH) in Chicago. Patients from UH CMC and NMH were stored in different databases and then federated through MEDCIS using EpSO and our mapping module. More than 77GB of multi-modal signal data were processed using the Cloudwave pipeline and made available for rendering through the web-interface. About 74% of the 40 open clinical questions of interest were answerable accurately using the EpSO-driven VISual AGregagator and Explorer (VISAGE) interface. Questions not directly answerable were either due to their inherent computational complexity, the unavailability of primary information, or the scope of concept that has been formulated in the existing EpSO terminology system.

Proceedings Article
14 Nov 2014
TL;DR: The ability of the automatically generated lexicons to detect new terms is assessed, and it is shown that a data-driven approach captures the sublanguage of members in these communities, all the while increasing coverage of general-purpose terminologies.
Abstract: Online health communities play an increasingly prevalent role for patients and are the source of a growing body of research. A lexicon that represents the sublanguage of an online community is an important resource to enable analysis and tool development over this data source. This paper investigates a method to generate a lexicon representative of the language of members in a given community with respect to specific semantic types. We experiment with a breast cancer community and detect terms that belong to three semantic types: medications, symptoms and side effects, and emotions. We assess the ability of our automatically generated lexicons to detect new terms, and show that a data-driven approach captures the sublanguage of members in these communities, all the while increasing coverage of general-purpose terminologies. The code and the generated lexicons are made available to the research community.

Proceedings Article
14 Nov 2014
TL;DR: An automated way to score Internet search queries and web pages as to the likelihood that a person making these queries or reading those pages would decide to vaccinate is developed and used to learn about the information acquisition process of people.
Abstract: Vaccination campaigns are one of the most important and successful public health programs ever undertaken. People who want to learn about vaccines in order to make an informed decision on whether to vaccinate are faced with a wealth of information on the Internet, both for and against vaccinations. In this paper we develop an automated way to score Internet search queries and web pages as to the likelihood that a person making these queries or reading those pages would decide to vaccinate. We apply this method to data from a major Internet search engine, while people seek information about the Measles, Mumps and Rubella (MMR) vaccine. We show that our method is accurate, and use it to learn about the information acquisition process of people. Our results show that people who are pro-vaccination as well as people who are anti-vaccination seek similar information, but browsing this information has differing effect on their future browsing. These findings demonstrate the need for health authorities to tailor their information according to the current stance of users.

Proceedings Article
14 Nov 2014
TL;DR: In this paper, the Synthetic Minority Over-sampling Technique was used to overcome the problem of between-class imbalance in stroke datasets, which leads to prediction bias and decreased performance.
Abstract: Several models have been developed to predict stroke outcomes (e.g., stroke mortality, patient dependence, etc.) in recent decades. However, there is little discussion regarding the problem of between-class imbalance in stroke datasets, which leads to prediction bias and decreased performance. In this paper, we demonstrate the use of the Synthetic Minority Over-sampling Technique to overcome such problems. We also compare state of the art machine learning methods and construct a six-variable support vector machine (SVM) model to predict stroke mortality at discharge. Finally, we discuss how the identification of a reduced feature set allowed us to identify additional cases in our research database for validation testing. Our classifier achieved a c-statistic of 0.865 on the cross-validated dataset, demonstrating good classification performance using a reduced set of variables.

Proceedings Article
14 Nov 2014
TL;DR: This paper focuses on the Design Cycle of the ISR framework in which user-centered distributed information design methods and participatory action research methods were used to inform the design of a mobile application (app) for persons living with HIV (PLWH).
Abstract: Mobile health (mHealth) technology presents opportunities to enhance chronic illness management, which is especially relevant for persons living with HIV (PLWH). Since mHealth technology comprises evolving and adaptable hardware and software, it provides many challenging design problems. To address this challenge, our methods were guided by the Information System Research (ISR) framework. This paper focuses on the Design Cycle of the ISR framework in which we used user-centered distributed information design methods and participatory action research methods to inform the design of a mobile application (app) for PLWH. In the first design session, participants (N=5) identified features that are optimal for meeting the treatment and management needs of PLWH. In the second design session, participants (N=6) were presented with findings from the first design session and pictures of existing apps. Findings from the Design Cycle will be evaluated with usability inspection methods. Using a systematic approach has the potential to improve mHealth functionality and use and subsequent impact.

Proceedings Article
14 Nov 2014
TL;DR: Sophia, a rapid UMLS concept extraction annotator was developed to fulfill a mandate and address extraction where high throughput is needed while preserving performance, and is noted to be several fold faster than cTAKES and the scaled-out MetaMap service.
Abstract: An opportunity exists for meaningful concept extraction and indexing from large corpora of clinical notes in the Veterans Affairs (VA) electronic medical record. Currently available tools such as MetaMap, cTAKES and HITex do not scale up to address this big data need. Sophia, a rapid UMLS concept extraction annotator was developed to fulfill a mandate and address extraction where high throughput is needed while preserving performance. We report on the development, testing and benchmarking of Sophia against MetaMap and cTAKEs. Sophia demonstrated improved performance on recall as compared to cTAKES and MetaMap (0.71 vs 0.66 and 0.38). The overall f-score was similar to cTAKES and an improvement over MetaMap (0.53 vs 0.57 and 0.43). With regard to speed of processing records, we noted Sophia to be several fold faster than cTAKES and the scaled-out MetaMap service. Sophia offers a viable alternative for high-throughput information extraction tasks.

Proceedings Article
14 Nov 2014
TL;DR: Increased patient information sharing in the inpatient setting is beneficial and desirable to patients, and generally acceptable to clinicians.
Abstract: Being a hospital patient can be isolating and anxiety-inducing. We conducted two experiments to better understand clinician and patient perceptions about giving patients access to their medical records during hospital encounters. The first experiment, a survey of physicians, nurses, and other care providers (N=53), showed that most respondents were comfortable with the idea of providing patients with their clinical information. Some expressed reservations that patients might misunderstand information and become unnecessarily alarmed or offended. In the second experiment, we provided eight hospital patients with a daily copy of their full medical record-including physician notes and diagnostic test results. From semi-structured interviews with seven of these patients, we found that they perceived the information as highly useful even if they did not fully understand complex medical terms. Our results suggest that increased patient information sharing in the inpatient setting is beneficial and desirable to patients, and generally acceptable to clinicians.

Proceedings Article
14 Nov 2014
TL;DR: An allergy module built on the MTERMS NLP system to identify and encode food, drug, and environmental allergies and allergic reactions and demonstrates the feasibility using NLP to extract and encode allergy information from clinical notes.
Abstract: Emergency department (ED) visits due to allergic reactions are common. Allergy information is often recorded in free-text provider notes; however, this domain has not yet been widely studied by the natural language processing (NLP) community. We developed an allergy module built on the MTERMS NLP system to identify and encode food, drug, and environmental allergies and allergic reactions. The module included updates to our lexicon using standard terminologies, and novel disambiguation algorithms. We developed an annotation schema and annotated 400 ED notes that served as a gold standard for comparison to MTERMS output. MTERMS achieved an F-measure of 87.6% for the detection of allergen names and no known allergies, 90% for identifying true reactions in each allergy statement where true allergens were also identified, and 69% for linking reactions to their allergen. These preliminary results demonstrate the feasibility using NLP to extract and encode allergy information from clinical notes.

Proceedings Article
14 Nov 2014
TL;DR: This work used random forest and elastic net on 20,078 deidentified records with significant missing and noisy values to develop models that outperform existing ACS risk prediction tools and shows that random forest applied to noisy and sparse data can perform on par with previously developed scoring metrics.
Abstract: Acute coronary syndrome (ACS) accounts for 1.36 million hospitalizations and billions of dollars in costs in the United States alone. A major challenge to diagnosing and treating patients with suspected ACS is the significant symptom overlap between patients with and without ACS. There is a high cost to over- and under-treatment. Guidelines recommend early risk stratification of patients, but many tools lack sufficient accuracy for use in clinical practice. Prognostic indices often misrepresent clinical populations and rely on curated data. We used random forest and elastic net on 20,078 deidentified records with significant missing and noisy values to develop models that outperform existing ACS risk prediction tools. We found that the random forest (AUC = 0.848) significantly outperformed elastic net (AUC=0.818), ridge regression (AUC = 0.810), and the TIMI (AUC = 0.745) and GRACE (AUC = 0.623) scores. Our findings show that random forest applied to noisy and sparse data can perform on par with previously developed scoring metrics.

Proceedings Article
14 Nov 2014
TL;DR: There is a significant increase in sentiment of posts through time, with different patterns of sentiment trends for initial posts in threads and reply posts.
Abstract: A large number of patients rely on online health communities to exchange information and psychosocial support with their peers. Examining participation in a community and its impact on members' behaviors and attitudes is one of the key open research questions in the field of study of online health communities. In this paper, we focus on a large public breast cancer community and conduct sentiment analysis on all its posts. We investigate the impact of different factors on post sentiment, such as time since joining the community, posting activity, age of members, and cancer stage of members. We find that there is a significant increase in sentiment of posts through time, with different patterns of sentiment trends for initial posts in threads and reply posts. Factors each play a role; for instance stage-IV members form a particular sub-community with patterns of sentiment and usage distinct from others members.

Proceedings Article
14 Nov 2014
TL;DR: An Unstructured Information Management Application-based natural language processing (NLP) module for automated extraction of family history information with functionality for identifying statements, observations, and predication ("indicator phrases") was developed and evaluated.
Abstract: Despite increased functionality for obtaining family history in a structured format within electronic health record systems, clinical notes often still contain this information. We developed and evaluated an Unstructured Information Management Application (UIMA)-based natural language processing (NLP) module for automated extraction of family history information with functionality for identifying statements, observations (e.g., disease or procedure), relative or side of family with attributes (i.e., vital status, age of diagnosis, certainty, and negation), and predication (“indicator phrases”), the latter of which was used to establish relationships between observations and family member. The family history NLP system demonstrated F-scores of 66.9, 92.4, 82.9, 57.3, 97.7, and 61.9 for detection of family history statements, family member identification, observation identification, negation identification, vital status, and overall extraction of the predications between family members and observations, respectively. While the system performed well for detection of family history statements and predication constituents, further work is needed to improve extraction of certainty and temporal modifications.

Proceedings Article
14 Nov 2014
TL;DR: Through interviews with surgical patients who experienced SSI, design considerations for such a post-acute care app are derived and a new framework for mHealth design based on illness duration and intensity is proposed.
Abstract: Many current mobile health applications ("apps") and most previous research have been directed at management of chronic illnesses. However, little is known about patient preferences and design considerations for apps intended to help in a post-acute setting. Our team is developing an mHealth platform to engage patients in wound tracking to identify and manage surgical site infections (SSI) after hospital discharge. Post-discharge SSIs are a major source of morbidity and expense, and occur at a critical care transition when patients are physically and emotionally stressed. Through interviews with surgical patients who experienced SSI, we derived design considerations for such a post-acute care app. Key design qualities include: meeting basic accessibility, usability and security needs; encouraging patient-centeredness; facilitating better, more predictable communication; and supporting personalized management by providers. We illustrate our application of these guiding design considerations and propose a new framework for mHealth design based on illness duration and intensity.

Proceedings Article
14 Nov 2014
TL;DR: While EHR performance varied, common themes were decreased trust due to poor quality documentation, incomplete communication, potential for increased effectiveness through better coordination, and the emerging role of the EHR in identifying performance gaps.
Abstract: Objective Examine how the Electronic Health Record (EHR) and its related systems support or inhibit provider collaboration. Background Health care systems in the US are simultaneously implementing EHRs and transitioning to more collaborative delivery systems; this study examines the interaction between these two changes. Methods This qualitative study of five US EHR implementations included 49 interviews and over 60 hours of provider observation. We examined the role of the EHR in building relationships, communicating, coordinating, and collaborative decision-making. Results The EHR plays four roles in collaboration: a repository, a messenger, an orchestrator, and a monitor. While EHR performance varied, common themes were decreased trust due to poor quality documentation, incomplete communication, potential for increased effectiveness through better coordination, and the emerging role of the EHR in identifying performance gaps. Conclusion Both organizational and technical innovations are needed if the EHR is to truly support collaborative behaviors.

Proceedings Article
14 Nov 2014
TL;DR: It is demonstrated that a properly designed multi-task learning algorithm is viable for joint disease risk prediction and it can discover clinical insights that single-task models would overlook.
Abstract: Disease risk prediction has been a central topic of medical informatics. Although various risk prediction models have been studied in the literature, the vast majority were designed to be single-task, i.e. they only consider one target disease at a time. This becomes a limitation when in practice we are dealing with two or more diseases that are related to each other in terms of sharing common comorbidities, symptoms, risk factors, etc., because single-task prediction models are not equipped to identify these associations across different tasks. In this paper we address this limitation by exploring the application of multi-task learning framework to joint disease risk prediction. Specifically, we characterize the disease relatedness by assuming that the risk predictors underlying these diseases have overlap. We develop an optimization-based formulation that can simultaneously predict the risk for all diseases and learn the shared predictors. Our model is applied to a real Electronic Health Record (EHR) database with 7,839 patients, among which 1,127 developed Congestive Heart Failure (CHF) and 477 developed Chronic Obstructive Pulmonary Disease (COPD). We demonstrate that a properly designed multi-task learning algorithm is viable for joint disease risk prediction and it can discover clinical insights that single-task models would overlook.

Proceedings Article
14 Nov 2014
TL;DR: This work surveyed the published algorithms for SVM learning on large data sets, and chose three for comparison: PROBE, SVMperf, and Liblinear and found SGD with a fixed number of iterations performs as well as these alternative methods and is much faster to compute.
Abstract: Stochastic Gradient Descent (SGD) has gained popularity for solving large scale supervised machine learning problems. It provides a rapid method for minimizing a number of loss functions and is applicable to Support Vector Machine (SVM) and Logistic optimizations. However SGD does not provide a convenient stopping criterion. Generally an optimal number of iterations over the data may be determined using held out data. Here we compare stopping predictions based on held out data with simply stopping at a fixed number of iterations and show that the latter works as well as the former for a number of commonly studied text classification problems. In particular fixed stopping works well for MeSH(®) predictions on PubMed(®) records. We also surveyed the published algorithms for SVM learning on large data sets, and chose three for comparison: PROBE, SVMperf, and Liblinear and compared them with SGD with a fixed number of iterations. We find SGD with a fixed number of iterations performs as well as these alternative methods and is much faster to compute. As an application we made SGD-SVM predictions for all MeSH terms and used the Pool Adjacent Violators (PAV) algorithm to convert these predictions to probabilities. Such probabilistic predictions lead to ranked MeSH term predictions superior to previously published results on two test sets.

Proceedings Article
14 Nov 2014
TL;DR: A novel data resource for analyzing commonalities in clinical trial target populations to facilitate knowledge reuse when designing eligibility criteria of future trials or to reveal potential systematic biases in selecting population subgroups for clinical research is presented.
Abstract: ClinicalTrials.gov presents great opportunities for analyzing commonalities in clinical trial target populations to facilitate knowledge reuse when designing eligibility criteria of future trials or to reveal potential systematic biases in selecting population subgroups for clinical research. Towards this goal, this paper presents a novel data resource for enabling such analyses. Our method includes two parts: (1) parsing and indexing eligibility criteria text; and (2) mining common eligibility features and attributes of common numeric features (e.g., A1c). We designed and built a database called "Commonalities in Target Populations of Clinical Trials" (COMPACT), which stores structured eligibility criteria and trial metadata in a readily computable format. We illustrate its use in an example analytic module called CONECT using COMPACT as the backend. Type 2 diabetes is used as an example to analyze commonalities in the target populations of 4,493 clinical trials on this disease.

Proceedings Article
14 Nov 2014
TL;DR: Three coding schemes were developed and applied to analyze 525 tobacco use entries from the social history module of an EHR system to characterize: potential reasons for using free-text, contents within the free- Text, and data quality issues.
Abstract: Recent initiatives have emphasized the potential role of Electronic Health Record (EHR) systems for improving tobacco use assessment and cessation. In support of these efforts, the goal of the present study was to examine tobacco use documentation in the EHR with an emphasis on free-text. Three coding schemes were developed and applied to analyze 525 tobacco use entries, including structured fields and a free-text comment field, from the social history module of an EHR system to characterize: (1) potential reasons for using free-text, (2) contents within the free-text, and (3) data quality issues. Free-text was most commonly used due to limitations for describing tobacco use amount (23.2%), frequency (26.9%), and start or quit dates (28.2%) as well as secondhand smoke exposure (17.9%) using a variety of words and phrases. The collective results provide insights for informing system enhancements, user training, natural language processing, and standards for tobacco use documentation.

Proceedings Article
14 Nov 2014
TL;DR: This study demonstrates the feasibility of extracting positively asserted concepts related to homelessness from the free text of medical records.
Abstract: Mining the free text of electronic medical records (EMR) using natural language processing (NLP) is an effective method of extracting information not always captured in administrative data. We sought to determine if concepts related to homelessness, a non-medical condition, were amenable to extraction from the EMR of Veterans Affairs (VA) medical records. As there were no off-the-shelf products, a lexicon of terms related to homelessness was created. A corpus of free text documents from outpatient encounters was reviewed to create the reference standard for NLP training and testing. V3NLP Framework was used to detect instances of lexical terms and was compared to the reference standard. With a positive predictive value of 77% for extracting relevant concepts, this study demonstrates the feasibility of extracting positively asserted concepts related to homelessness from the free text of medical records.