scispace - formally typeset
Search or ask a question

Showing papers presented at "International Conference on Bioinformatics in 2014"


Proceedings ArticleDOI
20 Sep 2014
TL;DR: With experiments on gene annotation data from the Gene Ontology project, it is shown that deep autoencoder networks achieve better performance than other standard machine learning methods, including the popular truncated singular value decomposition.
Abstract: The annotation of genomic information is a major challenge in biology and bioinformatics. Existing databases of known gene functions are incomplete and prone to errors, and the bimolecular experiments needed to improve these databases are slow and costly. While computational methods are not a substitute for experimental verification, they can help in two ways: algorithms can aid in the curation of gene annotations by automatically suggesting inaccuracies, and they can predict previously-unidentified gene functions, accelerating the rate of gene function discovery. In this work, we develop an algorithm that achieves both goals using deep autoencoder neural networks. With experiments on gene annotation data from the Gene Ontology project, we show that deep autoencoder networks achieve better performance than other standard machine learning methods, including the popular truncated singular value decomposition.

179 citations


Proceedings ArticleDOI
20 Sep 2014
TL;DR: This work proposes a new statistical frameworks that allows for the possibility of an arbitrary number of causal variants when estimating the posterior probability of a variant being causal, and validate the approach using empirical data from a eQTL study of CHI3L2 to identify new causal variants that affect gene expression at this locus.
Abstract: Although genome-wide association studies have successfully identified thousands of risk loci for complex traits, only a handful of the biologically causal variants, responsible for association at these loci, have been successfully identified. Current statistical methods for identifying causal variants at risk loci either use the strength of association signal in an iterative conditioning framework, or estimate probabilities for variants to be causal. A main drawback of existing methods is that they rely on the simplifying assumption of a single causal variant at each risk locus which is typically invalid at many risk loci. In this work, we propose a new statistical frameworks that allows for the possibility of an arbitrary number of causal variants when estimating the posterior probability of a variant being causal. A direct benefit of our approach is that we predict a set of variants for each locus that under reasonable assumptions will contain all of the true causal variants with a high confidence level (e.g. 95%) even when the locus contains multiple causal variants. We use simulations to show that our approach provides 20-50% improvement in our ability to identify the causal variants compared to the existing methods at loci harboring multiple causal variants. We validate our approach using empirical data from a eQTL study of CHI3L2 to identify new causal variants that affect gene expression at this locus.

120 citations


Journal ArticleDOI
11 Sep 2014
TL;DR: In this article, the authors present the evolution of project management from ancient times to present times and its outlook in the future by outlining major events and developments throughout history, and present several examples of colossal projects successfully completed.
Abstract: Long before there was an institute for project management, or updated knowledge books and guides on how to manage projects, or even before the existence of Gantt charts, history offers several examples of colossal projects successfully completed. The Pyramids of Giza, Great Wall of China, and Coliseum are all good examples of such projects. Project Management, at its core, is concerned with creating an environment where people can work together to achieve a mutual objective, in order to deliver successful projects on time and on budget. Throughout the history of humanity, humans have been working on improving and refining the practices of project management. The goal of this paper is to present the evolution of project management since ancient times until present times and its outlook in the future by outlining major events and developments throughout history.

97 citations


Proceedings ArticleDOI
20 Sep 2014
TL;DR: This work investigates the utility of applying pseudo-relevance feedback (PRF), a query expansion method that performs well in keyword-based medical literature search to CDS search, and obtains statistically significant retrieval efficiency improvement in terms of nDCG, over the baseline.
Abstract: Recent interest in search tools for Clinical Decision Support (CDS) has dramatically increased. These tools help clinicians assess a medical situation by providing actionable information in the form of a select few highly relevant recent medical papers. Unlike traditional search, which is designed to deal with short queries, queries in CDS are long and narrative. We investigate the utility of applying pseudo-relevance feedback (PRF), a query expansion method that performs well in keyword-based medical literature search to CDS search. Using the optimum combination of PRF parameters we obtained statistically significant retrieval efficiency improvement in terms of nDCG, over the baseline.

73 citations


Proceedings ArticleDOI
20 Sep 2014
TL;DR: Using different sets of peptides from various allergen and bacterial antigens and HLA class II binding prediction tools from the IEDB, a strategy to predict the top epitopes from any antigen is designed.
Abstract: Computational prediction of HLA class II restricted T cell epitopes has great significance in many immunological studies including vaccine discovery. With the development of novel bioinformatics approaches, prediction of HLA class II binding has improved significantly but a strategy to predict the most dominant HLA class II epitopes has not been defined. Using different sets of peptides from various allergen and bacterial antigens and HLA class II binding prediction tools from the IEDB, we have designed a strategy to predict the top epitopes from any antigen. We found that the top 21% of 15-mer peptides overlapping by 10 residues (based on the predicted binding to seven DRB1 and DRB3/4/5 alleles) capture 50% of the immune response. This corresponded to an IEDB consensus percentile rank of 19.82 which could be used as a universal prediction threshold.

59 citations


Proceedings ArticleDOI
20 Sep 2014
TL;DR: The feasibility of measuring stress minutes preceding events of interest is shown and the sensor-derived stress to be rising prior to self-reported stress and smoking events are observed and a framework to analyze sensor data yield is proposed and found that losses in wireless channel is negligible.
Abstract: Stress can lead to headaches and fatigue, precipitate addictive behaviors (e.g., smoking, alcohol and drug use), and lead to cardiovascular diseases and cancer. Continuous assessment of stress from sensors can be used for timely delivery of a variety of interventions to reduce or avoid stress. We investigate the feasibility of continuous stress measurement via two field studies using wireless physiological sensors --- a four-week study with illicit drug users (n = 40), and a one-week study with daily smokers and social drinkers (n = 30). We find that 11+ hours/day of usable data can be obtained in a 4-week study. Significant learning effect is observed after the first week and data yield is seen to be increasing over time even in the fourth week. We propose a framework to analyze sensor data yield and find that losses in wireless channel is negligible; the main hurdle in further improving data yield is the attachment constraint. We show the feasibility of measuring stress minutes preceding events of interest and observe the sensor-derived stress to be rising prior to self-reported stress and smoking events.

55 citations


Journal ArticleDOI
30 Jan 2014
TL;DR: A Slang Sentimental Words and Idioms Lexicon (SSWIL) of opinion words is built and a Gaussian kernel SVM classifier for Arabic slang language to classify Arabic news’ comments on Facebook is proposed.
Abstract: Social networks have become one of our daily life activities not only in socializing but in e-commerce, e-learning, and politics. However, they have more effect on the youth generation all over the world, specifically in the Middle East. Arabic slang language is widely used on social networks more than classical Arabic since most of the users of social networks are young-mid age. However, Arabic slang language suffers from the new expressive (opinion) words and idioms as well as the unstructured format. Mining Arabic slang language requires efficient techniques to extract youth opinions on various issues, such as news websites. In this paper, we constructed a Slang Sentimental Words and Idioms Lexicon (SSWIL) of opinion words is built. In addition, we propose a Gaussian kernel SVM classifier for Arabic slang language to classify Arabic news’ comments on Facebook. To test the performance of the proposed classifier, several Facebook news’ comments are used, where 86.86% accuracy rate is obtained with precision 88.63 and recall 78.

42 citations


Proceedings ArticleDOI
20 Sep 2014
TL;DR: It is demonstrated that the GeneMarkS-T self-training is robust with respect to the presence of errors in assembled transcripts and accuracy of GeneMarkT in identification of protein-coding regions and, particularly, in prediction of gene starts compares favorably to other existing methods.
Abstract: Massive parallel sequencing of RNA transcripts by the next generation technology (RNA-Seq) is a powerful method of generating critically important data for discovery of structure and function of eukaryotic genes. The transcripts may or may not carry protein-coding regions. If protein coding region is present, it should be a continuous (spliced) open reading frame. Gene finding in transcripts can be done by statistical (alignment-free) as well as by alignment based methods. We describe a new tool, GeneMarkS-T, for ab initio identification of protein-coding regions, complete or incomplete, in RNA transcripts assembled from RNA-Seq reads. Important feature of GeneMarkS-T is unsupervised estimation of parameters of the algorithm that makes unnecessary several conventional steps used in the gene prediction protocols, most importantly the manually curated preparation of training sets. We demonstrate that i/the GeneMarkS-T self-training is robust with respect to the presence of errors in assembled transcripts and ii/accuracy of GeneMarkS-T in identification of protein-coding regions and, particularly, in prediction of gene starts compares favorably to other existing methods.

39 citations


Journal ArticleDOI
01 Jul 2014
TL;DR: In this paper, the authors identify gaps that exist in organizational practices and how positive organizational behavior can be integrated to build sustainable organizations, and they also identify the gaps that need to be filled.
Abstract: Positive psychological principles and subsequently positive organizational behavior (POB) have become increasingly prevalent in the workplace in recent years. We have witnessed many struggles in the global economy where organizations across the world have experienced layoffs, lower productivity, lower employee morale, and generally struggling to be competitive. Given these negative environments, what can organizations do across the world to enhance the positive practices that will create benefits for all of the stakeholders through POB? We also identify gaps that exist in organizational practices and how positive organizational behavior can be integrated to build sustainable organizations.

33 citations


Journal ArticleDOI
08 Dec 2014
TL;DR: A computational method to identify informative substrate motifs for O-GlcNAcylation sites with the consideration of substrate site specificity is proposed and may help unravel their mechanisms and roles in signaling, transcription, chronic disease, and cancer.
Abstract: Protein O-GlcNAcylation, involving the attachment of single N-acetylglucosamine (GlcNAc) to the hydroxyl group of serine or threonine residues. Elucidation of O-GlcNAcylation sites on proteins is required in order to decipher its crucial roles in regulating cellular processes and aid in drug design. With an increasing number of O-GlcNAcylation sites identified by mass spectrometry (MS)-based proteomics, several methods have been proposed for the computational identification of O-GlcNAcylation sites. However, no development that focuses on the investigation of O-GlcNAcylated substrate motifs has existed. Thus, we were motivated to design a new method for the identification of protein O-GlcNAcylation sites with the consideration of substrate site specificity. In this study, 375 experimentally verified O-GlcNAcylation sites were collected from dbOGAP, which is an integrated resource for protein O-GlcNAcylation. Due to the difficulty in characterizing the substrate motifs by conventional sequence logo analysis, a recursively statistical method has been applied to obtain significant conserved motifs. To construct the predictive models learned from the identified substrate motifs, we adopted Support Vector Machines (SVMs). A five-fold cross validation was used to evaluate the predictive model, achieving sensitivity, specificity, and accuracy of 0.76, 0.80, and 0.78, respectively. Additionally, an independent testing set, which was really blind to the training data of predictive model, was used to demonstrate that the proposed method could provide a promising accuracy (0.94) and outperform three other O-GlcNAcylation site prediction tools. This work proposed a computational method to identify informative substrate motifs for O-GlcNAcylation sites. The evaluation of cross validation and independent testing indicated that the identified motifs were effective in the identification of O-GlcNAcylation sites. A case study demonstrated that the proposed method could be a feasible means of conducting preliminary analyses of protein O-GlcNAcylation. We also anticipated that the revealed substrate motif may facilitate the study of extensive crosstalk between O-GlcNAcylation and phosphorylation. This method may help unravel their mechanisms and roles in signaling, transcription, chronic disease, and cancer.

31 citations


Proceedings Article
01 Jan 2014
TL;DR: In this article, the authors presented an introspection about the intensity of crime committed against women in India over the period of five years and also highlighted head-wise incidents of crime against women and compare reported incidents under IPC and SLL for a period from 2008 to 2012 respectively.
Abstract: The growth and development of a nation depends on the socio-economic status of its community. In our country, women constitute around 49 percent of the total country's population of approximately nine hundred million people. But studies revealed that from century's women have been victims of exploitation by male-dominated society. Women in our country have been victims of ill-treatment, humiliation, torture, exploitation, rape murder, etc. from long period of time. According to the Constitution of India, women are the legal citizens of the country. To provide social justice is the keystone of the constitution of India. It represents equal rights with men. But the actual situation is far from this. In the modern society, crime against women is the most pervasive abuse in the country. Crime against women is not new rather it is a common evil in the Indian society. It represents the form of assertion of dominance and use of greater physical strength of men over women. A woman faced terrifying problems both within the family and outside family structure. It has been examined by National Crime Records Bureau (NCRB) that over 32000 murders, 36500 molestation cases, 19000 rape and 7500 dowry deaths are the violent crimes reported in India in 2006 against women. Thus, current condition of women has worsened their lives. The present paper is introspection about the intensity of crime committed against women in India over the period of five years. The paper also highlights head-wise incidents of crime against women and compare reported incidents of crime (both under IPC and SLL) for a period from 2008 to 2012 respectively. The percentage of crime against women is also compared with total IPC crimes so as to depict the actual position of crime against women in India. The study is based upon secondary data and has been collected from reports, “Crime in India”, published by National Crime Records Bureau annually. The data has been analysed by using percentage analysis and interpretations were made accordingly. The data of cases mentioned under different heads revealed that all the crime heads showed a rising trend except the incidents reported under Sati Prevention Act, 1987 and Importation of Girls-section 366-B IPC. Further, the cases registered under Kidnapping and Abduction (section 363–373 IPC), Torture (section 498-A IPC) and Molestation (section 354 IPC) had showed a sharp increase over the period of five years. Thus, in order to protect the women's from this evil, it is required that the code of laws related to crimes against women should be amended. Women's should be made aware about the legislation through awareness programmes because the law alone cannot be able to curb this menace of “Crime Against Women.”

Journal ArticleDOI
11 Sep 2014
TL;DR: In this paper, the authors show how 3D printing has evolved, why businesses are realizing the strategic potential for 3D print to create a competitive advantage using a consumer technology business model and why this could raise legal and ethical issues associated with existing laws related to the use of 3D technology.
Abstract: Although the technology for 3D printing has been around for more than three decades, its full potential is just beginning to be realized in the business world. Ideas for 3D printing run the gamut from the hobbyist printing jewelry and toys to the medical industry researching 3D printing of human organs. One way businesses are utilizing 3D printing is through support services within their own business processes, referred to in this paper as a consumer technology business model. As with any emerging use of a technology, legal and ethical issues will arise. This paper shows how 3D printing has evolved, why businesses are realizing the strategic potential for 3D printing to create a competitive advantage using a consumer technology business model and why this could raise legal and ethical issues associated with existing laws related to the use of 3D technology.

Proceedings ArticleDOI
20 Sep 2014
TL;DR: A taxonomy of intents is derived to capture user information needs in online health forums and a novel pattern-based features for use with a multiclass support vector machine (SVM) classifier to classify original thread posts according to their underlying intents are proposed.
Abstract: Online health forums provide a convenient way for patients to obtain medical information and connect with physicians and peers outside of clinical settings. However, large quantities of unstructured and diversified content generated on these forums make it difficult for users to digest and extract useful information. Understanding user intents would enable forums to more accurately and efficiently find relevant information by filtering out threads that do not match particular intents. In this paper, we derive a taxonomy of intents to capture user information needs in online health forums, and propose novel pattern based features for use with a multiclass support vector machine (SVM) classifier to classify original thread posts according to their underlying intents. Since no dataset existed for this task, we employ three annotators to manually label a dataset of 1,200 Health-Boards posts spanning four forum topics. Experimental results show that SVM with pattern based features is highly capable of identifying user intents in forum posts, reaching a maximum precision of 75%. Furthermore, comparable classification performance can be achieved by training and testing on posts from different forum topics (e.g. training on allergy posts, testing on depression posts). Finally, we run a trained classifier on a MedHelp dataset to analyze the distribution of intents of posts from different forum topics.

Journal ArticleDOI
28 Jan 2014
TL;DR: An overview of the portfolio’s role in initial teacher training programs is provided, followed by the advantages of the eportfolios over the paper portfolio, and a working conceptual framework is proposed for eportfolio use to support professional development in the Web 2.0 age.
Abstract: The portfolio is rapidly gaining attention in initial teacher training programs. It serves multiple uses and ends in the professional development and reflective practice of preservice teachers, and the technical advances of Web 2.0 will only increase the potential for learning opportunities. From now on, portfolio content that was formerly private territory can be generously shared. Against this background, this article provides an overview of the portfolio’s role in initial teacher training programs. The four main functions of the portfolio are addressed, followed by the advantages of the eportfolio over the paper portfolio. A working conceptual framework is then proposed for eportfolio use to support professional development in the Web 2.0 age. To provide a practical application for initial teacher training, we conclude with a presentation of Eduportfolio, an eportfolio that effectively taps the potential of Web 2.0.

Proceedings ArticleDOI
20 Sep 2014
TL;DR: A novel unsupervised approach to tap into the increasingly available health forums to mine the side effect symptoms of drugs mentioned by forum users is proposed, based on a novel probabilistic mixture model of symptoms.
Abstract: Automatic discovery of medical knowledge using data mining has great potential benefit in improving population health and reducing healthcare cost. Discovering adverse drug reaction (ADR) is especially important because of the significant morbidity of ADRs to patients. Recently, more and more patients describe the ADRs they experienced and seek for help through online health forums, creating great opportunities for these forums to discover previously unknown ADRs.In this paper, we propose a novel unsupervised approach to tap into the increasingly available health forums to mine the side effect symptoms of drugs mentioned by forum users. Our approach is based on a novel probabilistic mixture model of symptoms, where the side effect symptoms and disease symptoms are explicitly modeled with two separate component models, and discovery of side effect symptoms can be achieved in an unsupervised way through fitting the mixture model to the forum data. Extensive experiments on online health forums demonstrate that our proposed model is effective for discovering the reported ADRs on forums in a completely unsupervised way. The mined knowledge using our model is directly useful for increasing our understanding of more challenging ADRs, such as long-term side effects, drug-drug interactions, and rare side effects. Since our approach is unsupervised, it can be applied to mining large amounts of growing forum data to discover new knowledge about ADRs, helping many patients become aware of possible ADRs.

Proceedings ArticleDOI
20 Sep 2014
TL;DR: This work proposes a hybrid approach by integrating a machine learning model with a pattern identification strategy to identify the antecedent and conjuncts regions of a concept mention, and then reassemble the composite mention using those identified regions.
Abstract: Many text-mining studies have focused on the issue of named entity recognition and normalization, especially in the field of biomedical natural language processing. However, entity recognition is a complicated and difficult task in biomedical text. One particular challenge is to identify and resolve composite named entities, where a single span refers to more than one concept (e.g., BRCA1/2). Most bioconcept recognition and normalization studies have either ignored this issue, used simple ad-hoc rules, or only handled coordination ellipsis, which is only one of the many types of composite mentions studied in this work. No systematic methods for simplifying composite mentions have been previously reported, making a robust approach greatly needed. To this end, we propose a hybrid approach by integrating a machine learning model with a pattern identification strategy to identify the antecedent and conjuncts regions of a concept mention, and then reassemble the composite mention using those identified regions. Our method, which we have named SimConcept, is the first method to systematically handle most types of composite mentions. Our method achieves high performance in identifying and resolving composite mentions for three fundamental biological entities: genes (89.29% in F-measure), diseases (85.52% in F-measure) and chemicals (84.04% in F-measure). Furthermore, our results show that, using our SimConcept method can subsequently help improve the performance of gene and disease concept recognition and normalization.

Proceedings Article
01 Jan 2014
TL;DR: The study presents the Bibliometric analysis of articles published in the New England journal of Medicine during the years 2006-2010 to highlight the distribution of articles, affiliation wise distribution of article, year wise authorship pattern, and the Impact factor.
Abstract: The study presents the Bibliometric analysis of articles published in the New England journal of Medicine. There are a total of 2740 articles published during the years 2006-2010. The study attempts to highlight the distribution of articles, affiliation wise distribution of articles, year wise authorship pattern, year wise Impact factor of articles, year wise distribution of length of articles, country wise distribution of articles, subject wise distribution of articles, year wise distribution of references of articles, contribution according to thrust areas.

Journal ArticleDOI
11 Oct 2014
TL;DR: This approach is using different product based priority queues for different services in the cloud computing, to guarantee that all the nodes have equal load i.e. no single node is over loaded.
Abstract: Cloud computing is the dynamic delivery of information technology resources and capabilities as a service over the Internet. Cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a service over the Internet. It generally incorporates infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS). [1] Load balancing is one of the biggest challenges in the cloud computing. The concept of load balancing is to equally distribute the workload, resources across all the nodes to guarantee that all the nodes have equal load i.e. no single node is over loaded. As we all know that cloud computing services are mainly product based so in this approach we are using different product based priority queues for different services.

Proceedings ArticleDOI
20 Sep 2014
TL;DR: This work proposes a new method for constructing the Multi-String Burrows-Wheeler Transform for a collection of strings based on previous work for merging two or more MSBWTs, and evaluates the speed of the algorithm on multiple datasets that vary in both quantity of strings and string length.
Abstract: The throughput of biological sequencing technologies has led to the necessity for compressed and accessible sequencing formats. Recently, the Multi-String Burrows-Wheeler Transform (MSBWT) has risen in prevalence as a method for transforming sequence data to improve compression while providing access to the reads through an auxiliary FM-index. While there are many algorithms for building the MSBWT for a collection of strings, they do not scale well as the length of those strings increases.We propose a new method for constructing the MSBWT for a collection of strings based on previous work for merging two or more MSBWTs. It requires O(N * LCPavg * log(m)) time and O(N) bits of memory for a collection of m strings composed of N symbols where the average longest common prefix of all suffixes is LCPavg. We evaluate the speed of the algorithm on multiple datasets that vary in both quantity of strings and string length.Availability: https://code.google.com/p/msbwt/source/browse/MUSCython/MultimergeCython.pyx

Proceedings ArticleDOI
20 Sep 2014
TL;DR: It is shown that significant improvement in predictive performance can be achieved by properly exploiting ICD-9 hierarchy, and a novel feature engineering approach is proposed and evaluated to leverage hierarchy, while simultaneously reducing feature dimensionality.
Abstract: ICD-9 codes are among the most important patient information recorded in electronic health records. They have been shown to be useful for predictive modeling of different adverse outcomes in patients, including diabetes and heart failure. An important characteristic of ICD-9 codes is the hierarchical relationships among different codes. Nevertheless, the most common feature representation used to incorporate ICD-9 codes in predictive models disregards the structural relationships.In this paper, we explore different methods to leverage the hierarchical structure in ICD-9 codes with the goal of improving performance of predictive models. We compare methods that leverage hierarchy by 1) incorporating the information during feature construction, 2) using a learning algorithm that addresses the structure in the ICD-9 codes when building a model, or 3) doing both. We propose and evaluate a novel feature engineering approach to leverage hierarchy, while simultaneously reducing feature dimensionality.Our experiments indicate that significant improvement in predictive performance can be achieved by properly exploiting ICD-9 hierarchy. Using two clinical tasks: predicting chronic kidney disease progression (Task-CKD), and predicting incident heart failure (Task-HF), we show that methods that use hierarchy outperform the conventional approach in F-score (0.44 vs 0.36 for Task-HF and 0.40 vs 0.37 for Task-CKD) and relative risk (4.6 vs 3.3 for Task-HF and 5.9 vs 3.8 for Task-CKD).

Proceedings ArticleDOI
20 Sep 2014
TL;DR: The proposed algorithm makes the first steps to answering the question of how sequence mutations affect function in proteins at the center of proteinopathies by providing the energy landscape as the intermediate explanatory link between protein sequence and function.
Abstract: The emerging picture of proteins as dynamic systems switching between structures to modulate function demands a comprehensive structural characterization only possible through an energy landscape treatment. Only sample-based representations of a protein energy landscape are viable in silico, and sampling-based exploration algorithms have to address the fundamental but challenging issue of balancing between exploration (broad view) and exploitation (going deep). We propose here a novel algorithm that achieves this balance by combining concepts from evolutionary computation and protein modeling research. The algorithm draws samples from a reduced space obtained via principal component analysis of known experimental structures. Samples are lifted from the reduced to an all-atom structure space where they are then mapped to nearby local minima in the all-atom energy landscape. From an algorithmic point of view, this paper makes several contributions, including the design of a local selection operator that is crucial to avoiding premature convergence. From an application point of view, this paper demonstrates the utility of the proposed evolutionary algorithm to advance understanding of multi-basin proteins. In particular, the proposed algorithm makes the first steps to answering the question of how sequence mutations affect function in proteins at the center of proteinopathies by providing the energy landscape as the intermediate explanatory link between protein sequence and function.

Proceedings ArticleDOI
20 Sep 2014
TL;DR: Six methods developed for constrained treadmill walking to obtain gait speed during natural walking with older chronic pulmonary patients are evaluated, a middleware is developed to provide comparable readings to medical accelerometers using only smartphones and new models to predict speed and distance are trained.
Abstract: Mobile devices present the opportunity to continuously collect health data including movement and walking speed. Fitness trackers have become popular which record steps taken, distance walked and caloric expenditure. While useful for fitness purposes, medical monitoring requires precise accuracy and testing on real patients with a medically valid measure. Walking speed is closely linked to morbidity in patients and is also useful to determine distance walked during six-minute walk tests, a standard assessment for both chronic obstructive pulmonary disease and congestive heart failure. Current generation smartphone hardware contains similar sensor chips as medical devices and popular fitness devices. We develop a middleware, MoveSense, to provide comparable readings to medical accelerometers using only smartphones. We evaluate six methods developed for constrained treadmill walking to obtain gait speed during natural walking with older chronic pulmonary patients and train new models to predict speed and distance. Natural walking is walking without artificial speed constraints introduced during treadmill and nurse assisted walking. We also compare our model's accuracy to popular fitness devices. Our models produce accurate 6MWT distance and higher accuracy distance estimation than dedicated fitness devices during unconstrained walking on patients using a universally trained support vector machine model.

Proceedings ArticleDOI
20 Sep 2014
TL;DR: This paper develops and evaluates a plethora of specialized search methods that treat an entire unresolved forum post as a query, and retrieve forum threads discussing similar problems to help resolve it, and shows that these methods outperform state of the art retrieval methods for the task.
Abstract: Web communities such as healthcare web forums serve as popular platforms for users to get their complex medical queries resolved. A typical forum thread contains a query in its first post, and a discussion around it in subsequent posts. However many users do not receive satisfactory responses from other members in the community, leaving them dissatisfied. We propose to help these users by exploiting an existing collection of discussion threads.Often many users suffer from the same medical condition and start multiple discussion threads on very similar queries. In this paper we develop and evaluate a plethora of specialized search methods that treat an entire unresolved forum post as a query, and retrieve forum threads discussing similar problems to help resolve it. The task is more challenging than a traditional document retrieval problem, since forum posts can contain a lot of irrelevant background information. The discussion threads to be retrieved are also quite different from traditional unstructured text documents. We evaluate our results on a dataset comprising over 350K discussion threads and show that our proposed methods outperform state of the art retrieval methods for the task. In particular, method based on non-uniform weighting of thread posts and semantic analysis of the query text perform quite well.

Proceedings ArticleDOI
Anna Ritz1, T. M. Murali1
20 Sep 2014
TL;DR: It is demonstrated that the shortest hyperpaths computed in signaling hypergraphs are far more informative than shortest paths found in corresponding graph representations.
Abstract: Signaling pathways play an important role in the cell's response to its environment. Signaling pathways are often represented as directed graphs, which are not adequate for modeling reactions such as complex assembly and dissociation, combinatorial regulation, and protein activation/inactivation. More accurate representations such as directed hypergraphs remain underutilized. In this paper, we present an extension of a directed hypergraph that we call a signaling hypergraph. We formulate a problem that asks what proteins and interactions must be involved in order to stimulate a specific response downstream of a signaling pathway. We relate this problem to computing the shortest acyclic B-hyperpath in a signaling hypergraph --- an NP-hard problem --- and present a mixed integer linear program to solve it. We demonstrate that the shortest hyperpaths computed in signaling hypergraphs are far more informative than shortest paths found in corresponding graph representations. Our results illustrate the potential of signaling hypergraphs as an improved representation of signaling pathways and motivate the development of novel hypergraph algorithms.


Proceedings ArticleDOI
20 Sep 2014
TL;DR: This paper compares the accuracy and performance characteristics of Strand against RDP using 16S rRNA sequence data from the RDP training dataset and the Greengenes sequence repository.
Abstract: The Super Threaded Reference-Free Alignment-Free N-sequence Decoder (Strand) is a highly parallel technique for the learning and classification of gene sequence data into any number of associated categories or gene sequence taxonomies. Current methods, including the state-of-the-art sequence classification method RDP, balance performance by using a shorter word length. Strand in contrast uses a much longer word length, and does so efficiently by implementing a Divide and Conquer algorithm leveraging MapReduce style processing and locality sensitive hashing. Strand is able to learn gene sequence taxonomies and classify new sequences approximately 20 times faster than the RDP classifier while still achieving comparable accuracy results. This paper compares the accuracy and performance characteristics of Strand against RDP using 16S rRNA sequence data from the RDP training dataset and the Greengenes sequence repository.

Proceedings ArticleDOI
20 Sep 2014
TL;DR: This work presents several techniques to capture the pair-wise synchronization of EEG signals and apply unsupervised learning algorithms, such as k-means clustering and multiway modeling of third-order tensors, to analyze the labeled clinical data in the feature domain to detect the onset and origin location of the seizure.
Abstract: This work presents a novel modeling of neuronal activity of the brain by capturing the synchronization of EEG signals along the scalp. The pair-wise correspondence between electrodes recording EEG signals are used to establish edges between these electrodes which then become the nodes of a synchronization graph. As EEG signals are recorded over time, we discretize the time axis into overlapping epochs, and build a series of time-evolving synchronization graphs for each epoch and for each traditional frequency band.We show that graph theory provides a rich set of graph features that can be used for mining and learning from the EEG signals to determine temporal and spatial localization of epileptic seizures. We present several techniques to capture the pair-wise synchronization and apply unsupervised learning algorithms, such as k-means clustering and multiway modeling of third-order tensors, to analyze the labeled clinical data in the feature domain to detect the onset and origin location of the seizure. We use k-means clustering on two-way feature matrices for detection of seizures, and Tucker3 tensor decomposition for localization of seizures. We conduct an extensive parametric search to determine the best configuration of the model parameters including epoch length, synchronization metrics, and frequency bands, to achieve the highest accuracy.Our results are promising: we are able to detect the onset of seizure with an accuracy of 88.24%, and localize the onset of the seizure with an accuracy of 76.47%.

Proceedings ArticleDOI
20 Sep 2014
TL;DR: This paper designs a novel dual--boundary classification algorithm that can identify patients at risk for AHE with nearly 95% accuracy up to 120 minutes before the episode begins and significantly outperforms existing approaches in predictive accuracy, sensitivity and specificity.
Abstract: An Acute Hypotensive Episode (AHE) is the sudden onset of a period of sustained low blood pressure and is one of the most critical conditions in Intensive Care Units (ICU). Without timely medical care, it can lead to irreversible organ damage and death. By identifying patients at risk for this complication, adequate medical intervention can save lives and improve patient outcomes.In this paper we study the problem of identifying patients at risk for AHE. We cast the problem as a supervised classification task and design a novel dual--boundary classification algorithm. Our algorithm uses only past blood pressure measurements of the patients thereby being much simpler than many existing methods that use multiple sources of data. It can also be used in online or batch mode which is advantageous in an ICU setting. Our extensive experiments on 1700 patients' records demonstrate that the algorithm significantly outperforms existing approaches in predictive accuracy, sensitivity and specificity. It can identify patients at risk for AHE with nearly 95% accuracy up to 120 minutes before the episode begins.

Journal ArticleDOI
14 Jul 2014
TL;DR: A modified PSO algorithm has been introduced and implemented for solving task scheduling problem in the cloud and it is found that the modified MPOS algorithm outperforms the existed PSO.
Abstract: The Cloud computing is a most recent computing paradigm where IT services are provided and delivered over the Internet on demand. The Scheduling problem for cloud computing environment has a lot of awareness as the applications tasks could be mapped to the available resources to achieve better results. One of the main existed algorithms of task scheduling on the available resources on the cloud environment is based on the Particle Swarm Optimization (PSO). According to this PSO algorithm, the application’s tasks are allocated to the available resources to minimize the computation cost only. In this paper, a modified PSO algorithm has been introduced and implemented for solving task scheduling problem in the cloud. The main idea of the modified PSO is that the tasks are allocated on the available resources to minimize the execution time in addition to the computation cost. This modified PSO algorithm is called Modified Particle Swarm Optimization (MPOS).The MPOS evaluations have been illustrated using different time, and cost parameters and their effects in the performance measures such as utilization, speedup, and efficiency. According to the implementation results, it is found that the modified MPOS algorithm outperforms the existed PSO.

Proceedings ArticleDOI
20 Sep 2014
TL;DR: A new algorithm, Fast Read Stitcher (FStitch), is described that takes advantage of two popular machine-learning techniques, a hidden Markov model (HMM) and logistic regression to robustly classify which regions of the genome are transcribed.
Abstract: We present a fast and simple algorithm to detect nascent RNA transcription in global nuclear run-on sequencing (GRO-seq). GRO-seq is a relatively new protocol that captures nascent transcripts from actively engaged polymerase, providing a direct read-out on bona fide transcription. Most traditional assays, such as RNA-seq, measure steady state RNA levels, which are affected by transcription, post-transcriptional processing, and RNA stability. A detailed study of GRO-seq data has the potential to inform on many aspects of the transcription process. GRO-seq data, however, presents unique analysis challenges that are only beginning to be addressed. Here we describe a new algorithm, Fast Read Stitcher (FStitch), that takes advantage of two popular machine-learning techniques, a hidden Markov model (HMM) and logistic regression to robustly classify which regions of the genome are transcribed. Our algorithm builds on the strengths of previous approaches but is accurate, dependent on very little training data, robust to varying read depth, annotation agnostic, and fast.