scispace - formally typeset
Search or ask a question

Showing papers on "Knowledge extraction published in 2014"


Proceedings ArticleDOI
24 Aug 2014
TL;DR: The Knowledge Vault is a Web-scale probabilistic knowledge base that combines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repositories that computes calibrated probabilities of fact correctness.
Abstract: Recent years have witnessed a proliferation of large-scale knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft's Satori, and Google's Knowledge Graph. To increase the scale even further, we need to explore automatic methods for constructing knowledge bases. Previous approaches have primarily focused on text-based extraction, which can be very noisy. Here we introduce Knowledge Vault, a Web-scale probabilistic knowledge base that combines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repositories. We employ supervised machine learning methods for fusing these distinct information sources. The Knowledge Vault is substantially bigger than any previously published structured knowledge repository, and features a probabilistic inference system that computes calibrated probabilities of fact correctness. We report the results of multiple studies that explore the relative utility of the different information sources and extraction methods.

1,657 citations


Proceedings ArticleDOI
Mu Li1
04 Aug 2014
TL;DR: View on new challenges identified are shared, and some of the application scenarios such as micro-blog data analysis and data processing in building next generation search engines are covered.
Abstract: Big data may contain big values, but also brings lots of challenges to the computing theory, architecture, framework, knowledge discovery algorithms, and domain specific tools and applications. Beyond the 4-V or 5-V characters of big datasets, the data processing shows the features like inexact, incremental, and inductive manner. This brings new research opportunities to research community across theory, systems, algorithms, and applications. Is there some new "theory" for the big data? How to handle the data computing algorithms in an operatable manner? This report shares some view on new challenges identified, and covers some of the application scenarios such as micro-blog data analysis and data processing in building next generation search engines.

1,364 citations


Book
30 Aug 2014
TL;DR: This book is intended to review the tasks that fill the gap between the data acquisition from the source and the data mining process, and contains a comprehensive look from a practical point of view, including basic concepts and surveying the techniques proposed in the specialized literature.
Abstract: Data Preprocessing for Data Mining addresses one of the most important issues within the well-known Knowledge Discovery from Data process. Data directly taken from the source will likely have inconsistencies, errors or most importantly, it is not ready to be considered for a data mining process. Furthermore, the increasing amount of data in recent science, industry and business applications, calls to the requirement of more complex tools to analyze it. Thanks to data preprocessing, it is possible to convert the impossible into possible, adapting the data to fulfill the input demands of each data mining algorithm. Data preprocessing includes the data reduction techniques, which aim at reducing the complexity of the data, detecting or removing irrelevant and noisy elements from the data. This book is intended to review the tasks that fill the gap between the data acquisition from the source and the data mining process. A comprehensive look from a practical point of view, including basic concepts and surveying the techniques proposed in the specialized literature, is given. Each chapter is a stand-alone guide to a particular data preprocessing topic, from basic concepts and detailed descriptions of classical algorithms, to an incursion of an exhaustive catalog of recent developments. The in-depth technical descriptions make this book suitable for technical professionals, researchers, senior undergraduate and graduate students in data science, computer science and engineering.

678 citations


Journal ArticleDOI
01 Jun 2014
TL;DR: A data mining approach to predict the success of telemarketing calls for selling bank long-term deposits in Portuguese retail bank was addressed, with data collected from 2008 to 2013, thus including the effects of the recent financial crisis.
Abstract: We propose a data mining (DM) approach to predict the success of telemarketing calls for selling bank long-term deposits. A Portuguese retail bank was addressed, with data collected from 2008 to 2013, thus including the effects of the recent financial crisis. We analyzed a large set of 150 features related with bank client, product and social-economic attributes. A semi-automatic feature selection was explored in the modeling phase, performed with the data prior to July 2012 and that allowed to select a reduced set of 22 features. We also compared four DM models: logistic regression, decision trees (DTs), neural network (NN) and support vector machine. Using two metrics, area of the receiver operating characteristic curve (AUC) and area of the LIFT cumulative curve (ALIFT), the four models were tested on an evaluation set, using the most recent data (after July 2012) and a rolling window scheme. The NN presented the best results (AUC = 0.8 and ALIFT = 0.7), allowing to reach 79% of the subscribers by selecting the half better classified clients. Also, two knowledge extraction methods, a sensitivity analysis and a DT, were applied to the NN model and revealed several key attributes (e.g., Euribor rate, direction of the call and bank agent experience). Such knowledge extraction confirmed the obtained model as credible and valuable for telemarketing campaign managers.

673 citations


Journal ArticleDOI
TL;DR: This survey is to provide a guidance and a glue for researchers and anti-discrimination data analysts on concepts, problems, application areas, datasets, methods, and approaches from a multidisciplinary perspective.
Abstract: The collection and analysis of observational and experimental data represent the main tools for assessing the presence, the extent, the nature, and the trend of discrimination phenomena. Data analysis techniques have been proposed in the last 50 years in the economic, legal, statistical, and, recently, in the data mining literature. This is not surprising, since discrimination analysis is a multidisciplinary problem, involving sociological causes, legal argumentations, economic models, statistical techniques, and computational issues. The objective of this survey is to provide a guidance and a glue for researchers and anti-discrimination data analysts on concepts, problems, application areas, datasets, methods, and approaches from a multidisciplinary perspective. We organize the approaches according to their method of data collection as observational, quasi-experimental, and experimental studies. A fourth line of recently blooming research on knowledge discovery based methods is also covered. Observational methods are further categorized on the basis of their application context: labor economics, social profiling, consumer markets, and others.

286 citations


Journal ArticleDOI
TL;DR: This article presents a discussion on eight open challenges for data stream mining, which cover the full cycle of knowledge discovery and involve such problems as protecting data privacy, dealing with legacy systems, handling incomplete and delayed information, analysis of complex data, and evaluation of stream mining algorithms.
Abstract: Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams, which need to be analyzed online as they arrive. Streaming data can be considered as one of the main sources of what is called big data. While predictive modeling for data streams and big data have received a lot of attention over the last decade, many research approaches are typically designed for well-behaved controlled problem settings, overlooking important challenges imposed by real-world applications. This article presents a discussion on eight open challenges for data stream mining. Our goal is to identify gaps between current research and meaningful applications, highlight open problems, and define new application-relevant research directions for data stream mining. The identified challenges cover the full cycle of knowledge discovery and involve such problems as: protecting data privacy, dealing with legacy systems, handling incomplete and delayed information, analysis of complex data, and evaluation of stream mining algorithms. The resulting analysis is illustrated by practical applications and provides general suggestions concerning lines of future research in data stream mining.

260 citations


Book
24 Aug 2014
TL;DR: The Encyclopedia of Social Network Analysis and Mining is the first major reference work to integrate fundamental concepts and research directions in the areas of social networks and applications to data mining, and the application of social network methodologies to other domains, such as web networks and biological networks.
Abstract: The Encyclopedia of Social Network Analysis and Mining (ESNAM) is the first major reference work to integrate fundamental concepts and research directions in the areas of social networks and applications to data mining. While ESNAM reflects the state-of-the-art in social network research, the field had its start in the 1930s when fundamental issues in social network research were broadly defined. These communities were limited to relatively small numbers of nodes (actors) and links. More recently the advent of electronic communication, and in particular on-line communities, have created social networks of hitherto unimaginable sizes. People around the world are directly or indirectly connected by popular social networks established using web-based platforms rather than by physical proximity.Reflecting the interdisciplinary nature of this unique field, the essential contributionsof diverse disciplines, from computer science, mathematics, and statistics tosociology and behavioral science, are described among the 300 authoritative yet highly readable entries.Students will find a world of information and insight behind the familiar faade of the social networks in which they participate. Researchers and practitioners will benefit from a comprehensive perspective on the methodologies for analysis of constructed networks, and thedata mining and machine learning techniques that have proved attractive for sophisticated knowledge discovery in complex applications. Also addressed is the application of social network methodologies to other domains, such as web networks and biological networks.

234 citations


Journal ArticleDOI
TL;DR: This special section on Intelligent Mobile Knowledge Discovery and Management Systems is to bring together top-quality articles on the art and practice of mobile knowledge discovery and management systems that exhibit a level of intelligence.
Abstract: Advances in wireless communication mobile-information infrastructures such as GPS, WiFi, and mobile phone technologies have enabled us to collect, process, and manage massive amounts of mobile data from diverse information sources. These mobile data are fine-grained, information-rich, and provide unparalleled opportunities for us to understand mobile user behaviours and generate useful knowledge, which in turn allows the delivery of intelligence for real-time decision making in various real-world applications. In this context, knowledge discovery is the process of automatic extraction of interesting and useful knowledge from large amounts of mobile data, whereas knowledge management consists of a range of strategies and practices to identify, create, represent, distribute, and enable the adoption of novel insights and experiences for decision making. There is a critical emerging need to investigate knowledge discovery and management issues in the mobile context. The objective of this special section on Intelligent Mobile Knowledge Discovery and Management Systems is to bring together top-quality articles on the art and practice of mobile knowledge discovery and management systems that exhibit a level of intelligence. We received a total of 12 submissions from which 3 articles have been selected for publication after an extensive peer-review process. The first article, entitled \" Mining Geographic-Temporal-Semantic Patterns in Trajectories for Location Prediction \" by Ying et al., has a focus on location prediction by mining human location traces. A unique perspective of this article is to exploit a user's geographic, temporal, and semantic information simultaneously for estimating the probability of a traveler in visiting a location. The key idea underlying this study is the discovery of user trajectory patterns, which are used to capture frequent movements triggered by the user's geographic, temporal, and semantic intentions. The article \" A Framework of Traveling Companion Discovery on Trajectory Data Streams \" by Tang et al. studies the problem of discovering object groups which travel together (i.e., traveling companions) from trajectory data streams. Since the solution of this problem requires a large computational cost because of expensive spatial operations , the authors propose a smart data structure to facilitate scalable and flexible companion discovery from location traces. The article \" Mondrian Tree: A Fast Index for Spatial Alarm Processing \" authored by M. Doo and L. Liu promotes the efficient process of spatial alarms, which remind us of the arrival of a future spatial event. A key research challenge in scaling spatial alarm processing is how to efficiently …

221 citations


Journal ArticleDOI
TL;DR: The main idea in this paper is to describe key papers and provide some guidelines to help medical practitioners to explore previous works and identify interesting areas for future research.
Abstract: Data mining is a powerful method to extract knowledge from data. Raw data faces various challenges that make traditional method improper for knowledge extraction. Data mining is supposed to be able to handle various data types in all formats. Relevance of this paper is emphasized by the fact that data mining is an object of research in different areas. In this paper, we review previous works in the context of knowledge extraction from medical data. The main idea in this paper is to describe key papers and provide some guidelines to help medical practitioners. Medical data mining is a multidisciplinary field with contribution of medicine and data mining. Due to this fact, previous works should be classified to cover all users' requirements from various fields. Because of this, we have studied papers with the aim of extracting knowledge from structural medical data published between 1999 and 2013. We clarify medical data mining and its main goals. Therefore, each paper is studied based on the six medical tasks: screening, diagnosis, treatment, prognosis, monitoring and management. In each task, five data mining approaches are considered: classification, regression, clustering, association and hybrid. At the end of each task, a brief summarization and discussion are stated. A standard framework according to CRISP-DM is additionally adapted to manage all activities. As a discussion, current issue and future trend are mentioned. The amount of the works published in this scope is substantial and it is impossible to discuss all of them on a single work. We hope this paper will make it possible to explore previous works and identify interesting areas for future research.

220 citations


Journal ArticleDOI
TL;DR: This study shows that DM techniques are valuable for knowledge discovery in BAS database; however, solid domain knowledge is still needed to apply the knowledge discovered to achieve better building operational performance.

213 citations


Proceedings ArticleDOI
01 Oct 2014
TL;DR: A tensor decomposition approach for knowledge base embedding that is highly scalable, and is especially suitable for relation extraction by leveraging relational domain knowledge about entity type information, which is significantly faster than previous approaches and better able to discover new relations missing from the database.
Abstract: While relation extraction has traditionally been viewed as a task relying solely on textual data, recent work has shown that by taking as input existing facts in the form of entity-relation triples from both knowledge bases and textual data, the performance of relation extraction can be improved significantly. Following this new paradigm, we propose a tensor decomposition approach for knowledge base embedding that is highly scalable, and is especially suitable for relation extraction. By leveraging relational domain knowledge about entity type information, our learning algorithm is significantly faster than previous approaches and is better able to discover new relations missing from the database. In addition, when applied to a relation extraction task, our approach alone is comparable to several existing systems, and improves the weighted mean average precision of a state-of-theart method by 10 points when used as a subcomponent.

Proceedings ArticleDOI
01 Jan 2014
TL;DR: It is shown that in the big data era, without any user input, it is possible to learn prior knowledge automatically from a large amount of review data available on the Web, which can then be used by a topic model to discover more coherent aspects.
Abstract: Aspect extraction is an important task in sentiment analysis. Topic modeling is a popular method for the task. However, unsupervised topic models often generate incoherent aspects. To address the issue, several knowledge-based models have been proposed to incorporate prior knowledge provided by the user to guide modeling. In this paper, we take a major step forward and show that in the big data era, without any user input, it is possible to learn prior knowledge automatically from a large amount of review data available on the Web. Such knowledge can then be used by a topic model to discover more coherent aspects. There are two key challenges: (1) learning quality knowledge from reviews of diverse domains, and (2) making the model fault-tolerant to handle possibly wrong knowledge. A novel approach is proposed to solve these problems. Experimental results using reviews from 36 domains show that the proposed approach achieves significant improvements over state-of-the-art baselines.

Journal ArticleDOI
TL;DR: The Semanticscience Integrated Ontology is an ontology to facilitate biomedical knowledge discovery that provides an ontological foundation for the Bio2RDF linked data for the life sciences project and is used for semantic integration and discovery for SADI-based semantic web services.
Abstract: The Semanticscience Integrated Ontology (SIO) is an ontology to facilitate biomedical knowledge discovery. SIO features a simple upper level comprised of essential types and relations for the rich description of arbitrary (real, hypothesized, virtual, fictional) objects, processes and their attributes. SIO specifies simple design patterns to describe and associate qualities, capabilities, functions, quantities, and informational entities including textual, geometrical, and mathematical entities, and provides specific extensions in the domains of chemistry, biology, biochemistry, and bioinformatics. SIO provides an ontological foundation for the Bio2RDF linked data for the life sciences project and is used for semantic integration and discovery for SADI-based semantic web services. SIO is freely available to all users under a creative commons by attribution license. See website for further information: http://sio.semanticscience.org.

Book ChapterDOI
14 May 2014

Journal ArticleDOI
TL;DR: An extended rough set model, called as composite rough sets, is presented, and a novel matrix-based method for fast updating approximations is proposed in dynamic composite information systems.

Journal ArticleDOI
TL;DR: Whether a public sentiment indicator extracted from daily Twitter messages can indeed improve the forecasting of social, economic, or commercial indicators is assessed and nonlinear models do take advantage of Twitter data when forecasting trends in volatility indices, while linear ones fail systematically when forecasting any kind of financial time series.
Abstract: The dramatic rise in the use of social network platforms such as Facebook or Twitter has resulted in the availability of vast and growing user-contributed repositories of data. Exploiting this data by extracting useful information from it has become a great challenge in data mining and knowledge discovery. A recently popular way of extracting useful information from social network platforms is to build indicators, often in the form of a time series, of general public mood by means of sentiment analysis. Such indicators have been shown to correlate with a diverse variety of phenomena. In this article we follow this line of work and set out to assess, in a rigorous manner, whether a public sentiment indicator extracted from daily Twitter messages can indeed improve the forecasting of social, economic, or commercial indicators. To this end we have collected and processed a large amount of Twitter posts from March 2011 to the present date for two very different domains: stock market and movie box office revenue. For each of these domains, we build and evaluate forecasting models for several target time series both using and ignoring the Twitter-related data. If Twitter does help, then this should be reflected in the fact that the predictions of models that use Twitter-related data are better than the models that do not use this data. By systematically varying the models that we use and their parameters, together with other tuning factors such as lag or the way in which we build our Twitter sentiment index, we obtain a large dataset that allows us to test our hypothesis under different experimental conditions. Using a novel decision-tree-based technique that we call summary tree we are able to mine this large dataset and obtain automatically those configurations that lead to an improvement in the prediction power of our forecasting models. As a general result, we have seen that nonlinear models do take advantage of Twitter data when forecasting trends in volatility indices, while linear ones fail systematically when forecasting any kind of financial time series. In the case of predicting box office revenue trend, it is support vector machines that make best use of Twitter data. In addition, we conduct statistical tests to determine the relation between our Twitter time series and the different target time series.

Journal ArticleDOI
TL;DR: A close look at the unique challenges in building an unimpeded network infrastructure for big data, covering each and every segment in this network highway: the access networks that connect data sources, the Internet backbone that bridges them to remote data centers, as well as the dedicated network among data centers and within a data center.
Abstract: Big data, with their promise to discover valuable insights for better decision making, have recently attracted significant interest from both academia and industry. Voluminous data are generated from a variety of users and devices, and are to be stored and processed in powerful data centers. As such, there is a strong demand for building an unimpeded network infrastructure to gather geologically distributed and rapidly generated data, and move them to data centers for effective knowledge discovery. The express network should also be seamlessly extended to interconnect multiple data centers as well as interconnect the server nodes within a data center. In this article, we take a close look at the unique challenges in building such a network infrastructure for big data. Our study covers each and every segment in this network highway: the access networks that connect data sources, the Internet backbone that bridges them to remote data centers, as well as the dedicated network among data centers and within a data center. We also present two case studies of real-world big data applications that are empowered by networking, highlighting interesting and promising future research directions.

Proceedings ArticleDOI
01 Nov 2014
TL;DR: Denoising autoencoders (DAs) as mentioned in this paper have been used to identify and extract complex patterns from genomic data, such as tumor or normal samples, estrogen receptor (ER) status, and molecular subtypes.
Abstract: Big data bring new opportunities for methods that efficiently summarize and automatically extract knowledge from such compendia. While both supervised learning algorithms and unsupervised clustering algorithms have been successfully applied to biological data, they are either dependent on known biology or limited to discerning the most significant signals in the data. Here we present denoising autoencoders (DAs), which employ a data-defined learning objective independent of known biology, as a method to identify and extract complex patterns from genomic data. We evaluate the performance of DAs by applying them to a large collection of breast cancer gene expression data. Results show that DAs successfully construct features that contain both clinical and molecular information. There are features that represent tumor or normal samples, estrogen receptor (ER) status, and molecular subtypes. Features constructed by the autoencoder generalize to an independent dataset collected using a distinct experimental platform. By integrating data from ENCODE for feature interpretation, we discover a feature representing ER status through association with key transcription factors in breast cancer. We also identify a feature highly predictive of patient survival and it is enriched by FOXM1 signaling pathway. The features constructed by DAs are often bimodally distributed with one peak near zero and another near one, which facilitates discretization. In summary, we demonstrate that DAs effectively extract key biological principles from gene expression data and summarize them into constructed features with convenient properties.

Book ChapterDOI
24 Nov 2014
TL;DR: In this paper, the authors present a framework for integrating procedural knowledge into the Linked Data Cloud, which allows the automatic generation of links between different know-how resources and other online knowledge bases, such as DBpedia.
Abstract: This paper presents the first framework for integrating procedural knowledge, or “know-how”, into the Linked Data Cloud. Know-how available on the Web, such as step-by-step instructions, is largely unstructured and isolated from other sources of online knowledge. To overcome these limitations, we propose extending to procedural knowledge the benefits that Linked Data has already brought to representing, retrieving and reusing declarative knowledge. We describe a framework for representing generic know-how as Linked Data and for automatically acquiring this representation from existing resources on the Web. This system also allows the automatic generation of links between different know-how resources, and between those resources and other online knowledge bases, such as DBpedia. We discuss the results of applying this framework to a real-world scenario and we show how it outperforms existing manual community-driven integration efforts.

Book ChapterDOI
01 Jan 2014
TL;DR: This paper describes some of the most important challenges, including the need to develop and apply novel methods, algorithms and tools for the integration, fusion, pre-processing, mapping, analysis and interpretation of complex biomedical data with the aim to identify testable hypotheses, and build realistic models.
Abstract: Biomedical research is drowning in data, yet starving for knowledge. Current challenges in biomedical research and clinical practice include information overload – the need to combine vast amounts of structured, semi-structured, weakly structured data and vast amounts of unstructured information – and the need to optimize workflows, processes and guidelines, to increase capacity while reducing costs and improving efficiencies. In this paper we provide a very short overview on interactive and integrative solutions for knowledge discovery and data mining. In particular, we emphasize the benefits of including the end user into the “interactive” knowledge discovery process. We describe some of the most important challenges, including the need to develop and apply novel methods, algorithms and tools for the integration, fusion, pre-processing, mapping, analysis and interpretation of complex biomedical data with the aim to identify testable hypotheses, and build realistic models. The HCI-KDD approach, which is a synergistic combination of methodologies and approaches of two areas, Human–Computer Interaction (HCI) and Knowledge Discovery & Data Mining (KDD), offer ideal conditions towards solving these challenges: with the goal of supporting human intelligence with machine intelligence. There is an urgent need for integrative and interactive machine learning solutions, because no medical doctor or biomedical researcher can keep pace today with the increasingly large and complex data sets – often called “Big Data”.

Journal Article
TL;DR: This book provides a comprehensive and accessible introduction to the cutting-edge statistical methods needed to efficiently analyze complex data sets from astronomical surveys such as the Panoramic Survey Telescope and Rapid Response System, the Dark Energy Survey, and the upcoming Large Synoptic Survey Telescope.

Journal ArticleDOI
Paul Cooper1
TL;DR: Data mining is the application of improved techniques of data organization and storage and analysis to large datasets and has led to the discovery of previously unknown knowledge and relationships.

Journal ArticleDOI
TL;DR: A novel method to efficiently provide better Web-page recommendation through semantic-enhancement by integrating the domain and Web usage knowledge of a website is proposed.
Abstract: Web-page recommendation plays an important role in intelligent Web systems. Useful knowledge discovery from Web usage data and satisfactory knowledge representation for effective Web-page recommendations are crucial and challenging. This paper proposes a novel method to efficiently provide better Web-page recommendation through semantic-enhancement by integrating the domain and Web usage knowledge of a website. Two new models are proposed to represent the domain knowledge. The first model uses an ontology to represent the domain knowledge. The second model uses one automatically generated semantic network to represent domain terms, Web-pages, and the relations between them. Another new model, the conceptual prediction model, is proposed to automatically generate a semantic network of the semantic Web usage knowledge, which is the integration of domain knowledge and Web usage knowledge. A number of effective queries have been developed to query about these knowledge bases. Based on these queries, a set of recommendation strategies have been proposed to generate Web-page candidates. The recommendation results have been compared with the results obtained from an advanced existing Web Usage Mining (WUM) method. The experimental results demonstrate that the proposed method produces significantly higher performance than the WUM method.

Journal ArticleDOI
TL;DR: This work exploits the fact that RDTs can naturally fit into a parallel and fully distributed architecture, and develops protocols to implement privacy-preserving R DTs that enable general and efficient distributed privacy- Preserving knowledge discovery.
Abstract: Distributed data is ubiquitous in modern information driven applications. With multiple sources of data, the natural challenge is to determine how to collaborate effectively across proprietary organizational boundaries while maximizing the utility of collected information. Since using only local data gives suboptimal utility, techniques for privacy-preserving collaborative knowledge discovery must be developed. Existing cryptography-based work for privacy-preserving data mining is still too slow to be effective for large scale data sets to face today’s big data challenge. Previous work on random decision trees (RDT) shows that it is possible to generate equivalent and accurate models with much smaller cost. We exploit the fact that RDTs can naturally fit into a parallel and fully distributed architecture, and develop protocols to implement privacy-preserving RDTs that enable general and efficient distributed privacy-preserving knowledge discovery.

Journal ArticleDOI
TL;DR: A keyword extraction method for tweet collections that represents texts as graphs and applies centrality measures for finding the relevant vertices (keywords) is proposed, showing that TKG is a novel and robust proposal to extract keywords from texts, particularly from short messages, such as tweets.

Journal ArticleDOI
TL;DR: The largest existing taxonomy of common knowledge is blended with a natural-language-based semantic network of common-sense knowledge, and Multidimensional scaling is applied on the resulting knowledge base for open-domain opinion mining and sentiment analysis.
Abstract: The ability to understand natural language text is far from being emulated in machines. One of the main hurdles to overcome is that computers lack both the common and common-sense knowledge that humans normally acquire during the formative years of their lives. To really understand natural language, a machine should be able to comprehend this type of knowledge, rather than merely relying on the valence of keywords and word co-occurrence frequencies. In this article, the largest existing taxonomy of common knowledge is blended with a natural-language-based semantic network of common-sense knowledge. Multidimensional scaling is applied on the resulting knowledge base for open-domain opinion mining and sentiment analysis.

Journal ArticleDOI
TL;DR: This work reviewed studies that performed data mining techniques for detecting health care fraud and abuse, using supervised and unsupervised data mining approaches and recommended seven general steps to data mining of health care claims.
Abstract: Inappropriate payments by insurance organizations or third party payers occur because of errors, abuse and fraud. The scale of this problem is large enough to make it a priority issue for health systems. Traditional methods of detecting health care fraud and abuse are time-consuming and inefficient. Combining automated methods and statistical knowledge lead to the emergence of a new interdisciplinary branch of science that is named Knowledge Discovery from Databases (KDD). Data mining is a core of the KDD process. Data mining can help third-party payers such as health insurance organizations to extract useful information from thousands of claims and identify a smaller subset of the claims or claimants for further assessment. We reviewed studies that performed data mining techniques for detecting health care fraud and abuse, using supervised and unsupervised data mining approaches. Most available studies have focused on algorithmic data mining without an emphasis on or application to fraud detection efforts in the context of health service provision or health insurance policy. More studies are needed to connect sound and evidence-based diagnosis and treatment approaches toward fraudulent or abusive behaviors. Ultimately, based on available studies, we recommend seven general steps to data mining of health care claims.

Journal ArticleDOI
TL;DR: An original Experience Feedback process dedicated to maintenance is suggested, allowing to capitalize on past activities by formalizing the domain knowledge and experiences using a visual knowledge representation formalism with logical foundation and extracting new knowledge thanks to association rules mining algorithms.
Abstract: Knowledge is nowadays considered as a significant source of performance improvement, but may be difficult to identify, structure, analyse and reuse properly. A possible source of knowledge is in the data and information stored in various modules of industrial information systems, like CMMS (Computerized Maintenance Management Systems) for maintenance. In that context, the main objective of this paper is to propose a framework allowing to manage and generate knowledge from information on past experiences, in order to improve the decisions related to the maintenance activity. In that purpose, we suggest an original Experience Feedback process dedicated to maintenance, allowing to capitalize on past activities by (i) formalizing the domain knowledge and experiences using a visual knowledge representation formalism with logical foundation (Conceptual Graphs); (ii) extracting new knowledge thanks to association rules mining algorithms, using an innovative interactive approach; and (iii) interpreting and evaluating this new knowledge thanks to the reasoning operations of Conceptual Graphs. The suggested method is illustrated on a case study based on real data dealing with the maintenance of overhead cranes.

Journal ArticleDOI
TL;DR: A quantitative evaluation shows a significant improvement when using an enriched version of SenticNet for polarity classification and a qualitative evaluation sheds light on the strengths and weaknesses of the concept grounding, and on the quality of the enrichment process.
Abstract: This paper presents a novel method for contextualizing and enriching large semantic knowledge bases for opinion mining with a focus on Web intelligence platforms and other high-throughput big data applications The method is not only applicable to traditional sentiment lexicons, but also to more comprehensive, multi-dimensional affective resources such as SenticNet It comprises the following steps: (i) identify ambiguous sentiment terms, (ii) provide context information extracted from a domain-specific training corpus, and (iii) ground this contextual information to structured background knowledge sources such as ConceptNet and WordNet A quantitative evaluation shows a significant improvement when using an enriched version of SenticNet for polarity classification Crowdsourced gold standard data in conjunction with a qualitative evaluation sheds light on the strengths and weaknesses of the concept grounding, and on the quality of the enrichment process

Journal ArticleDOI
TL;DR: This survey paper discusses such successful techniques and methods to give effectiveness over information retrieval in text mining, the types of situations where each technology may be useful in order to help users are discussed.
Abstract: In recent years growth of digital data is increasing, knowledge discovery and data mining have attracted great attention with coming up need for turning such data into useful information and knowledge. The use of the information and knowledge extracted from a large amount of data benefits many applications like market analysis and business management. In many applications database stores information in text form so text mining is the one of the most resent area for research. To extract user required information is the challenging issue. Text Mining is an important step of knowledge discovery process. Text mining extracts hidden information from notstructured to semi-structured data. Text mining is the discovery by automatically extracting information from different written resources and also by computer for extracting new, previously unknown information. This survey paper tries to cover the text mining techniques and methods that solve these challenges. In this survey paper we discuss such successful techniques and methods to give effectiveness over information retrieval in text mining. The types of situations where each technology may be useful in order to help users are also discussed.