scispace - formally typeset
Search or ask a question

Showing papers on "Knowledge extraction published in 2016"


Journal ArticleDOI
01 Jan 2016-Database
TL;DR: The Harmonizome is a comprehensive resource of knowledge about genes and proteins that enables researchers to discover novel relationships between biological entities, as well as form novel data-driven hypotheses for experimental validation.
Abstract: Genomics, epigenomics, transcriptomics, proteomics and metabolomics efforts rapidly generate a plethora of data on the activity and levels of biomolecules within mammalian cells. At the same time, curation projects that organize knowledge from the biomedical literature into online databases are expanding. Hence, there is a wealth of information about genes, proteins and their associations, with an urgent need for data integration to achieve better knowledge extraction and data reuse. For this purpose, we developed the Harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins from over 70 major online resources. We extracted, abstracted and organized data into ∼72 million functional associations between genes/proteins and their attributes. Such attributes could be physical relationships with other biomolecules, expression in cell lines and tissues, genetic associations with knockout mouse or human phenotypes, or changes in expression after drug treatment. We stored these associations in a relational database along with rich metadata for the genes/proteins, their attributes and the original resources. The freely available Harmonizome web portal provides a graphical user interface, a web service and a mobile app for querying, browsing and downloading all of the collected data. To demonstrate the utility of the Harmonizome, we computed and visualized gene-gene and attribute-attribute similarity networks, and through unsupervised clustering, identified many unexpected relationships by combining pairs of datasets such as the association between kinase perturbations and disease signatures. We also applied supervised machine learning methods to predict novel substrates for kinases, endogenous ligands for G-protein coupled receptors, mouse phenotypes for knockout genes, and classified unannotated transmembrane proteins for likelihood of being ion channels. The Harmonizome is a comprehensive resource of knowledge about genes and proteins, and as such, it enables researchers to discover novel relationships between biological entities, as well as form novel data-driven hypotheses for experimental validation.Database URL: http://amp.pharm.mssm.edu/Harmonizome.

962 citations


Journal ArticleDOI
TL;DR: A survey of such knowledge graph refinement approaches, with a dual look at both the methods being proposed as well as the evaluation methodologies used.
Abstract: In the recent years, different Web knowledge graphs, both free and commercial, have been created. While Google coined the term "Knowledge Graph" in 2012, there are also a few openly available knowledge graphs, with DBpedia, YAGO, and Freebase being among the most prominent ones. Those graphs are often constructed from semi-structured knowledge, such as Wikipedia, or harvested from the web with a combination of statistical and linguistic methods. The result are large-scale knowledge graphs that try to make a good trade-off between completeness and correctness. In order to further increase the utility of such knowledge graphs, various refinement methods have been proposed, which try to infer and add missing knowledge to the graph, or identify erroneous pieces of information. In this article, we provide a survey of such knowledge graph refinement approaches, with a dual look at both the methods being proposed as well as the evaluation methodologies used.

915 citations


Proceedings ArticleDOI
Qi Wu1, Peng Wang1, Chunhua Shen1, Anthony Dick1, Anton van den Hengel1 
27 Jun 2016
TL;DR: In this article, the authors propose a method for visual question answering which combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions.
Abstract: We propose a method for visual question answering which combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions. This allows more complex questions to be answered using the predominant neural network-based approach than has previously been possible. It particularly allows questions to be asked about the contents of an image, even when the image itself does not contain the whole answer. The method constructs a textual representation of the semantic content of an image, and merges it with textual information sourced from a knowledge base, to develop a deeper understanding of the scene viewed. Priming a recurrent neural network with this combined information, and the submitted question, leads to a very flexible visual question answering approach. We are specifically able to answer questions posed in natural language, that refer to information not contained in the image. We demonstrate the effectiveness of our model on two publicly available datasets, Toronto COCO-QA [23] and VQA [1] and show that it produces the best reported results in both cases.

316 citations


Journal ArticleDOI
TL;DR: The survey shows that, while there are numerous interesting research works performed, the full potential of the Semantic Web and Linked Open Data for data mining and KDD is still to be unlocked.

266 citations


Journal ArticleDOI
TL;DR: The proposition that manual content analysis of injury reports can be eliminated using natural language processing (NLP) is tested and the results indicate that the NLP system is capable of quickly and automatically scanning unstructured injury reports for 101 attributes and outcomes with over 95% accuracy.

204 citations


Journal ArticleDOI
TL;DR: The CityPulse framework supports smart city service creation by means of a distributed system for semantic discovery, data analytics, and interpretation of large-scale (near-)real-time Internet of Things data and social media data streams to break away from silo applications and enable cross-domain data integration.
Abstract: Our world and our lives are changing in many ways. Communication, networking, and computing technologies are among the most influential enablers that shape our lives today. Digital data and connected worlds of physical objects, people, and devices are rapidly changing the way we work, travel, socialize, and interact with our surroundings, and they have a profound impact on different domains, such as healthcare, environmental monitoring, urban systems, and control and management applications, among several other areas. Cities currently face an increasing demand for providing services that can have an impact on people’s everyday lives. The CityPulse framework supports smart city service creation by means of a distributed system for semantic discovery, data analytics, and interpretation of large-scale (near-)real-time Internet of Things data and social media data streams. To goal is to break away from silo applications and enable cross-domain data integration. The CityPulse framework integrates multimodal, mixed quality, uncertain and incomplete data to create reliable, dependable information and continuously adapts data processing techniques to meet the quality of information requirements from end users. Different than existing solutions that mainly offer unified views of the data, the CityPulse framework is also equipped with powerful data analytics modules that perform intelligent data aggregation, event detection, quality assessment, contextual filtering, and decision support. This paper presents the framework, describes its components, and demonstrates how they interact to support easy development of custom-made applications for citizens. The benefits and the effectiveness of the framework are demonstrated in a use-case scenario implementation presented in this paper.

199 citations


Proceedings Article
13 Aug 2016
TL;DR: The 2016 ACM Conference on Knowledge Discovery and Data Mining (KDD'16) as mentioned in this paper has attracted a significant number of submissions from countries all over the world, in particular, the research track attracted 784 submissions and the applied data science track attracted 331 submissions.
Abstract: It is our great pleasure to welcome you to the 2016 ACM Conference on Knowledge Discovery and Data Mining -- KDD'16. We hope that the content and the professional network at KDD'16 will help you succeed professionally by enabling you to: identify technology trends early; make new/creative contributions; increase your productivity by using newer/better tools, processes or ways of organizing teams; identify new job opportunities; and hire new team members. We are living in an exciting time for our profession. On the one hand, we are witnessing the industrialization of data science, and the emergence of the industrial assembly line processes characterized by the division of labor, integrated processes/pipelines of work, standards, automation, and repeatability. Data science practitioners are organizing themselves in more sophisticated ways, embedding themselves in larger teams in many industry verticals, improving their productivity substantially, and achieving a much larger scale of social impact. On the other hand we are also witnessing astonishing progress from research in algorithms and systems -- for example the field of deep neural networks has revolutionized speech recognition, NLP, computer vision, image recognition, etc. By facilitating interaction between practitioners at large companies & startups on the one hand, and the algorithm development researchers including leading academics on the other, KDD'16 fosters technological and entrepreneurial innovation in the area of data science. This year's conference continues its tradition of being the premier forum for presentation of results in the field of data mining, both in the form of cutting edge research, and in the form of insights from the development and deployment of real world applications. Further, the conference continues with its tradition of a strong tutorial and workshop program on leading edge issues of data mining. The mission of this conference has broadened in recent years even as we placed a significant amount of focus on both the research and applied aspects of data mining. As an example of this broadened focus, this year we have introduced a strong hands-on tutorial program nduring the conference in which participants will learn how to use practical tools for data mining. KDD'16 also gives researchers and practitioners a unique opportunity to form professional networks, and to share their perspectives with others interested in the various aspects of data mining. For example, we have introduced office hours for budding entrepreneurs from our community to meet leading Venture Capitalists investing in this area. We hope that KDD 2016 conference will serve as a meeting ground for researchers, practitioners, funding agencies, and investors to help create new algorithms and commercial products. The call for papers attracted a significant number of submissions from countries all over the world. In particular, the research track attracted 784 submissions and the applied data science track attracted 331 submissions. Papers were accepted either as full papers or as posters. The overall acceptance rate either as full papers or posters was less than 20%. For full papers in the research track, the acceptance rate was lower than 10%. This is consistent with the fact that the KDD Conference is a premier conference in data mining and the acceptance rates historically tend to be low. It is noteworthy that the applied data science track received a larger number of submissions compared to previous years. We view this as an encouraging sign that research in data mining is increasingly becoming relevant to industrial applications. All papers were reviewed by at least three program committee members and then discussed by the PC members in a discussion moderated by a meta-reviewer. Borderline papers were thoroughly reviewed by the program chairs before final decisions were made.

179 citations


Journal ArticleDOI
TL;DR: In this paper, the authors discuss the evolution of big data computing, differences between traditional data warehousing and big data, taxonomy of Big Data computing and underpinning technologies, integrated platform of Big data and clouds known as big data clouds, layered architecture and components of big Data cloud, and finally open-technical challenges and future directions.
Abstract: Advances in information technology and its widespread growth in several areas of business, engineering, medical, and scientific studies are resulting in information/data explosion. Knowledge discovery and decision-making from such rapidly growing voluminous data are a challenging task in terms of data organization and processing, which is an emerging trend known as big data computing, a new paradigm that combines large-scale compute, new data-intensive techniques, and mathematical models to build data analytics. Big data computing demands a huge storage and computing for data curation and processing that could be delivered from on-premise or clouds infrastructures. This paper discusses the evolution of big data computing, differences between traditional data warehousing and big data, taxonomy of big data computing and underpinning technologies, integrated platform of big data and clouds known as big data clouds, layered architecture and components of big data cloud, and finally open-technical challenges and future directions. Copyright © 2015 John Wiley & Sons, Ltd.

141 citations


Journal ArticleDOI
TL;DR: The selection of right and appropriate text mining technique helps to enhance the speed and decreases the time and effort required to extract valuable information.
Abstract: Rapid progress in digital data acquisition tech-niques have led to huge volume of data. More than 80 percent of today’s data is composed of unstructured or semi-structured data. The discovery of appropriate patterns and trends to analyze the text documents from massive volume of data is a big issue. Text mining is a process of extracting interesting and non-trivial patterns from huge amount of text documents. There exist different techniques and tools to mine the text and discover valuable information for future prediction and decision making process. The selection of right and appropriate text mining technique helps to enhance the speed and decreases the time and effort required to extract valuable information. This paper briefly discuss and analyze the text mining techniques and their applications in diverse fields of life. Moreover, the issues in the field of text mining that affect the accuracy and relevance of results are identified.

136 citations


Journal ArticleDOI
TL;DR: This article proposes a distributed implementation of one of the most well‐known discretizers based on Information Theory, obtaining better results than the one produced by the entropy minimization discretizer proposed by Fayyad and Irani.
Abstract: Discretization of numerical data is one of the most influential data preprocessing tasks in knowledge discovery and data mining. The purpose of attribute discretization is to find concise data representations as categories which are adequate for the learning task retaining as much information in the original continuous attribute as possible. In this article, we present an updated overview of discretization techniques in conjunction with a complete taxonomy of the leading discretizers. Despite the great impact of discretization as data preprocessing technique, few elementary approaches have been developed in the literature for Big Data. The purpose of this article is twofold: a comprehensive taxonomy of discretization techniques to help the practitioners in the use of the algorithms is presented; the article aims is to demonstrate that standard discretization methods can be parallelized in Big Data platforms such as Apache Spark, boosting both performance and accuracy. We thus propose a distributed implementation of one of the most well-known discretizers based on Information Theory, obtaining better results than the one produced by: the entropy minimization discretizer proposed by Fayyad and Irani. Our scheme goes beyond a simple parallelization and it is intended to be the first to face the Big Data challenge. WIREs Data Mining Knowl Discov 2016, 6:5-21. doi: 10.1002/widm.1173

132 citations


Journal ArticleDOI
11 May 2016
TL;DR: A certain perspective is given on how to use set theory for management of information granules for rules/rule terms and different types of computational logic for reduction of learning bias in rule-based systems.
Abstract: A rule-based system is a special type of expert system, which typically consists of a set of if–then rules. Such rules can be used in the real world for both academic and practical purposes. In general, rule-based systems are involved in knowledge discovery tasks for both purposes and predictive modeling tasks for the latter purpose. In the context of granular computing, each of the rules that make up a rule-based system can be seen as a granule. This is due to the fact that granulation in general means decomposition of a whole into several parts. Similarly, each rule consists of a number of rule terms. From this point of view, each rule term can also be seen as a granule. As mentioned above, rule-based systems can be used for the purpose of knowledge discovery, which means to extract information or knowledge discovered from data. Therefore, rules and rule terms that make up a rule-based system are considered as information granules. This paper positions the research of rule-based systems in the granular computing context, which explores ways of achieving advances in the former area through the novel use of theories and techniques in the latter area. In particular, this paper gives a certain perspective on how to use set theory for management of information granules for rules/rule terms and different types of computational logic for reduction of learning bias. The effectiveness is critically analyzed and discussed. Further directions of this research area are recommended towards achieving advances in rule-based systems through the use of granular computing theories and techniques.

Proceedings ArticleDOI
01 Dec 2016
TL;DR: This work proposes to augment three information sources into one learning objective function, so that the interplay roles between three parties are enforced by requiring the learned network representations to be consistent with node content and topology structure, and following the social homophily constraints in the learned space.
Abstract: Advances in social networking and communication technologies have witnessed an increasing number of applications where data is not only characterized by rich content information, but also connected with complex relationships representing social roles and dependencies between individuals. To enable knowledge discovery from such networked data, network representation learning (NRL) aims to learn vector representations for network nodes, such that off-the-shelf machine learning algorithms can be directly applied. To date, existing NRL methods either primarily focus on network structure or simply combine node content and topology for learning. We argue that in information networks, information is mainly originated from three sources: (1) homophily, (2) topology structure, and (3) node content. Homophily states social phenomenon where individuals sharing similar attributes (content) tend to be directly connected through local relational ties, while topology structure emphasizes more on global connections. To ensure effective network representation learning, we propose to augment three information sources into one learning objective function, so that the interplay roles between three parties are enforced by requiring the learned network representations (1) being consistent with node content and topology structure, and also (2) following the social homophily constraints in the learned space. Experiments on multi-class node classification demonstrate that the representations learned by the proposed method consistently outperform state-of-the-art NRL methods, especially for very sparsely labeled networks.

Proceedings ArticleDOI
14 May 2016
TL;DR: A manual investigation is performed to understand why users submit duplicate questions in Stack Overflow and a classification technique is proposed that uses a number of carefully chosen features to identify duplicate questions with reasonable accuracy.
Abstract: Stack Overflow is a popular question answering site that is focused on programming problems. Despite efforts to prevent asking questions that have already been answered, the site contains duplicate questions. This may cause developers to unnecessarily wait for a question to be answered when it has already been asked and answered. The site currently depends on its moderators and users with high reputation to manually mark those questions as duplicates, which not only results in delayed responses but also requires additional efforts. In this paper, we first perform a manual investigation to understand why users submit duplicate questions in Stack Overflow. Based on our manual investigation we propose a classification technique that uses a number of carefully chosen features to identify duplicate questions. Evaluation using a large number of questions shows that our technique can detect duplicate questions with reasonable accuracy. We also compare our technique with DupPredictor, a state-of-the-art technique for detecting duplicate questions, and we found that our proposed technique has a better recall rate than that technique.

Journal ArticleDOI
TL;DR: The proposed Generalized Logistic algorithm is simple yet effective, robust to outliers, so no additional denoising or outlier detection step is needed in data preprocessing, and empirical results show models learned from data scaled by the GL algorithm have higher accuracy compared to the commonly used data scaling algorithms.
Abstract: Background Machine learning models have been adapted in biomedical research and practice for knowledge discovery and decision support. While mainstream biomedical informatics research focuses on developing more accurate models, the importance of data preprocessing draws less attention. We propose the Generalized Logistic (GL) algorithm that scales data uniformly to an appropriate interval by learning a generalized logistic function to fit the empirical cumulative distribution function of the data. The GL algorithm is simple yet effective; it is intrinsically robust to outliers, so it is particularly suitable for diagnostic/classification models in clinical/medical applications where the number of samples is usually small; it scales the data in a nonlinear fashion, which leads to potential improvement in accuracy.

Book
27 May 2016
TL;DR: This is the first textbook on attribute exploration, its theory, its algorithms for applications, and some of its many possible generalizations, which provides an introduction to Formal Concept Analysis with emphasis on its ability to derive algebraic structures from qualitative data.
Abstract: This is the first textbook on attribute exploration, its theory, its algorithms forapplications, and some of its many possible generalizations. Attribute explorationis useful for acquiring structured knowledge through an interactive process, byasking queries to an expert. Generalizations that handle incomplete, faulty, orimprecise data are discussed, but the focus lies on knowledge extraction from areliable information source. The method is based on Formal Concept Analysis, a mathematical theory ofconcepts and concept hierarchies, and uses its expressive diagrams. The presentationis self-contained. It provides an introduction to Formal Concept Analysiswith emphasis on its ability to derive algebraic structures from qualitative data,which can be represented in meaningful and precise graphics.

Journal ArticleDOI
TL;DR: An overview of the studies undertaking the two main data mining tasks (i.e. predictive tasks and descriptive tasks) in the building field is provided.

Journal ArticleDOI
19 Oct 2016-PLOS ONE
TL;DR: BEST, a biomedical entity search tool, is introduced, the only system that processes free text queries and returns up-to-date results in real time including mutation information in the results.
Abstract: As the volume of publications rapidly increases, searching for relevant information from the literature becomes more challenging. To complement standard search engines such as PubMed, it is desirable to have an advanced search tool that directly returns relevant biomedical entities such as targets, drugs, and mutations rather than a long list of articles. Some existing tools submit a query to PubMed and process retrieved abstracts to extract information at query time, resulting in a slow response time and limited coverage of only a fraction of the PubMed corpus. Other tools preprocess the PubMed corpus to speed up the response time; however, they are not constantly updated, and thus produce outdated results. Further, most existing tools cannot process sophisticated queries such as searches for mutations that co-occur with query terms in the literature. To address these problems, we introduce BEST, a biomedical entity search tool. BEST returns, as a result, a list of 10 different types of biomedical entities including genes, diseases, drugs, targets, transcription factors, miRNAs, and mutations that are relevant to a user’s query. To the best of our knowledge, BEST is the only system that processes free text queries and returns up-to-date results in real time including mutation information in the results. BEST is freely accessible at http://best.korea.ac.kr.

Journal ArticleDOI
TL;DR: The aim of this paper is to review the state of art on linguistic summarization in the framework of fuzzy sets, focusing on the methods for evaluating linguistic summaries and the current applications, and to propose a taxonomy for the methods.
Abstract: We review the evaluating methods on linguistic summarization.We propose a taxonomy for the methods.We present the differences between methods.We illustrate that fuzzy cardinality based methods provides consistent results. While the rapid development of information technology has made easy to store and access the huge amount of data, it also brings another problem, that of how to extract potentially useful knowledge not only in an efficient way but also in a way that could be easily understandable by humans. One of the solutions to this problem is linguistic summarization, aim of which is to generate explicit and concise summaries from data that is more compatible with human cognitive mechanism. The most crucial step in linguistic summarization is certainly the evaluation of linguistic summaries since they are the most important element of fuzzy rule based systems commonly used in expert systems and intelligent systems. Therefore, the selection of appropriate method for evaluating linguistic summaries in sense of different views such as quality, quantity, relevance and simplicity becomes vital. The aim of this paper is to review the state of art on linguistic summarization in the framework of fuzzy sets, focusing on the methods for evaluating linguistic summaries and the current applications. A taxonomy is proposed to identify the existing methods depending on the type of fuzzy sets (i.e., type-1 fuzzy set and type-2 fuzzy set) and the type of cardinalities (i.e., scalar cardinality and fuzzy cardinality). The recent studies on linguistic summarization are also presented to give a comprehensive framework for the future directions. The paper ends with conclusions, addressing some important issues and open questions which can be subject for future research.

Journal ArticleDOI
TL;DR: Results indicated that latent knowledge can be identified to support location selection decisions, and the proposed data mining framework consists of four stages: problem definition and data collection; RST analysis; rule validation; and knowledge extraction and usage.

Journal ArticleDOI
TL;DR: It is demonstrated that the system performs at state-of-the-art level for various subtasks in the four languages of the project, but that the full integration of these tasks in an overall system with the purpose of reading text shows the capacity of the system to perform at an unprecedented scale.
Abstract: In this article, we describe a system that reads news articles in four different languages and detects what happened, who is involved, where and when. This event-centric information is represented as episodic situational knowledge on individuals in an interoperable RDF format that allows for reasoning on the implications of the events. Our system covers the complete path from unstructured text to structured knowledge, for which we defined a formal model that links interpreted textual mentions of things to their representation as instances. The model forms the skeleton for interoperable interpretation across different sources and languages. The real content, however, is defined using multilingual and cross-lingual knowledge resources, both semantic and episodic. We explain how these knowledge resources are used for the processing of text and ultimately define the actual content of the episodic situational knowledge that is reported in the news. The knowledge and model in our system can be seen as an example how the Semantic Web helps NLP. However, our systems also generate massive episodic knowledge of the same type as the Semantic Web is built on. We thus envision a cycle of knowledge acquisition and NLP improvement on a massive scale. This article reports on the details of the system but also on the performance of various high-level components. We demonstrate that our system performs at state-of-the-art level for various subtasks in the four languages of the project, but that we also consider the full integration of these tasks in an overall system with the purpose of reading text. We applied our system to millions of news articles, generating billions of triples expressing formal semantic properties. This shows the capacity of the system to perform at an unprecedented scale.

Journal ArticleDOI
TL;DR: A survey of GO semantic similarity tools to provide a comprehensive view of the challenges and advances made to avoid redundant effort in developing features that already exist, or implementing ideas already proven to be obsolete in the context of GO.
Abstract: Gene Ontology (GO) semantic similarity tools enable retrieval of semantic similarity scores, which incorporate biological knowledge embedded in the GO structure for comparing or classifying different proteins or list of proteins based on their GO annotations. This facilitates a better understanding of biological phenomena underlying the corresponding experiment and enables the identification of processes pertinent to different biological conditions. Currently, about 14 tools are available, which may play an important role in improving protein analyses at the functional level using different GO semantic similarity measures. Here we survey these tools to provide a comprehensive view of the challenges and advances made in this area to avoid redundant effort in developing features that already exist, or implementing ideas already proven to be obsolete in the context of GO. This helps researchers, tool developers, as well as end users, understand the underlying semantic similarity measures implemented through knowledge of pertinent features of, and issues related to, a particular tool. This should empower users to make appropriate choices for their biological applications and ensure effective knowledge discovery based on GO annotations.

Proceedings ArticleDOI
13 Nov 2016
TL;DR: ScaleMine is proposed; a novel parallel frequent subgraph mining system for a single large graph that scales to 8,192 cores on a Cray XC40; supports graphs with one billion edges (10× larger than competitors), and is at least an order of magnitude faster than existing solutions.
Abstract: Frequent Subgraph Mining is an essential operation for graph analytics and knowledge extraction. Due to its high computational cost, parallel solutions are necessary. Existing approaches either suffer from load imbalance, or high communication and synchronization overheads. In this paper we propose ScaleMine; a novel parallel frequent subgraph mining system for a single large graph. ScaleMine introduces a novel two-phase approach. The first phase is approximate; it quickly identifies subgraphs that are frequent with high probability, while collecting various statistics. The second phase computes the exact solution by employing the results of the approximation to achieve good load balance; prune the search space; generate efficient execution plans; and guide intra-task parallelism. Our experiments show that ScaleMine scales to 8,192 cores on a Cray XC40 (12× more than competitors); supports graphs with one billion edges (10× larger than competitors), and is at least an order of magnitude faster than existing solutions.

Journal ArticleDOI
TL;DR: This paper provides a general approach for the concepts and processes related to the generation of linguistic descriptions of time series, and incorporates as a core element a description model, which is based on three pillars: a knowledge representation formalism, an expression language, and a quality framework.

Journal ArticleDOI
TL;DR: In this article, the authors describe a method to discover frequent behavioral patterns in event logs and express these patterns as local process models, which can be positioned in between process discovery and episode/sequential pattern mining.

Journal ArticleDOI
01 Jun 2016-Sensors
TL;DR: The step-by-step methodology that has to be applied to extract knowledge from raw data traces is illustrated, which clarifies when, why and how to use data science in wireless network research.
Abstract: Data science or "data-driven research" is a research approach that uses real-life data to gain insight about the behavior of systems. It enables the analysis of small, simple as well as large and more complex systems in order to assess whether they function according to the intended design and as seen in simulation. Data science approaches have been successfully applied to analyze networked interactions in several research areas such as large-scale social networks, advanced business and healthcare processes. Wireless networks can exhibit unpredictable interactions between algorithms from multiple protocol layers, interactions between multiple devices, and hardware specific influences. These interactions can lead to a difference between real-world functioning and design time functioning. Data science methods can help to detect the actual behavior and possibly help to correct it. Data science is increasingly used in wireless research. To support data-driven research in wireless networks, this paper illustrates the step-by-step methodology that has to be applied to extract knowledge from raw data traces. To this end, the paper (i) clarifies when, why and how to use data science in wireless network research; (ii) provides a generic framework for applying data science in wireless networks; (iii) gives an overview of existing research papers that utilized data science approaches in wireless networks; (iv) illustrates the overall knowledge discovery process through an extensive example in which device types are identified based on their traffic patterns; (v) provides the reader the necessary datasets and scripts to go through the tutorial steps themselves.

Journal ArticleDOI
TL;DR: A space‐efficient bitwise data structure for capturing interdependency among social entities; a time-efficient data mining algorithm that makes the best use of the proposed data structure; and another time‐efficient datamining algorithm for concurrent computation and discovery of groups of frequently followed social entities in parallel so as to handle high volumes of social network data.
Abstract: Social networking sites e.g., Facebook, Google+, and Twitter have become popular for sharing valuable knowledge and information among social entities e.g., individual users and organizations, who are often linked by some interdependency such as friendship. As social networking sites keep growing, there are situations in which a user wants to find those frequently followed groups of social entities so that he can follow the same groups. In this article, we present i a space-efficient bitwise data structure for capturing interdependency among social entities; ii a time-efficient data mining algorithm that makes the best use of our proposed data structure for serial discovery of groups of frequently followed social entities; and iii another time-efficient data mining algorithm for concurrent computation and discovery of groups of frequently followed social entities in parallel so as to handle high volumes of social network data. Evaluation results show the efficiency and practicality of our data structure and social network data mining algorithms. Copyright © 2016 John Wiley & Sons, Ltd.

BookDOI
14 Jun 2016
TL;DR: A completely new problem in the pattern mining field, mining of exceptional relationships between patterns, is discussed, in this problem the goal is to identify patterns which distribution is exceptionally different from the distribution in the complete set of data records.
Abstract: This book provides a comprehensive overview of the field of pattern mining with evolutionary algorithms. To do so, it covers formal definitions about patterns, patterns mining, type of patterns and the usefulness of patterns in the knowledge discovery process. As it is described within the book, the discovery process suffers from both high runtime and memory requirements, especially when high dimensional datasets are analyzed. To solve this issue, many pruning strategies have been developed. Nevertheless, with the growing interest in the storage of information, more and more datasets comprise such a dimensionality that the discovery of interesting patterns becomes a challenging process. In this regard, the use of evolutionary algorithms for mining pattern enables the computation capacity to be reduced, providing sufficiently good solutions. This book offers a survey on evolutionary computation with particular emphasis on genetic algorithms and genetic programming. Also included is an analysis of the set of quality measures most widely used in the field of pattern mining with evolutionary algorithms. This book serves as a review of the most important evolutionary algorithms for pattern mining. It considers the analysis of different algorithms for mining different type of patterns and relationships between patterns, such as frequent patterns, infrequent patterns, patterns defined in a continuous domain, or even positive and negative patterns. A completely new problem in the pattern mining field, mining of exceptional relationships between patterns, is discussed. In this problem the goal is to identify patterns which distribution is exceptionally different from the distribution in the complete set of data records. Finally, the book deals with the subgroup discovery task, a method to identify a subgroup of interesting patterns that is related to a dependent variable or target attribute. This subgroup of patterns satisfies two essential conditions: interpretability and interestingness.

Book ChapterDOI
14 Nov 2016
TL;DR: This position paper critically review these four approaches to analysing massive data sets and produces a framework, which provides visual representation of the relationship between them to effectively support their identification and easier differentiation.
Abstract: Confusion, ambiguity and misunderstanding of the concepts and terminology regarding different approaches concerned with analysing massive data sets (Business Intelligence, Big Data, Data Analytics and Knowledge Discovery) was identified as a significant issue faced by many academics, fellow researchers, industry professionals and domain experts. In that context, a need to critically evaluate these concept and approaches focusing on their similarities, differences and relationships was identified as useful for further research and industry professionals. In this position paper, we critically review these four approaches and produce a framework, which provides visual representation of the relationship between them to effectively support their identification and easier differentiation.

Journal ArticleDOI
TL;DR: This work defines Time Aware Knowledge Extraction (briefly TAKE) methodology that relies on temporal extension of Fuzzy Formal Concept Analysis and a microblog summarization algorithm has been defined filtering the concepts organized by TAKE in a time-dependent hierarchy.

Journal ArticleDOI
TL;DR: The main contribution of the proposed method is that it extract explorative knowledge based on recognition of the data structure and categorize instances through the clustering technique while performing simultaneous optimization for the artificial neural networks modeling.
Abstract: We examined the effectiveness an optimized cluster-based undersampling technique.We used a GA-based optimization approach for selecting the appropriate instances.A critical issue of real-world knowledge extraction is the data imbalance problem.The proposed method is successfully applied to the bankruptcy prediction problem. We suggest an optimization approach of cluster-based undersampling to select appropriate instances. This approach can solve the data imbalance problem, which can lead to knowledge extraction for improving the performance of existing data mining techniques. Although data mining techniques among various big data analytics technologies have been successfully applied and proven in terms of classification performance in various domains, such as marketing, accounting and finance areas, the data imbalance problem has been regarded as one of the most important issues to be considered.We examined the effectiveness of a hybrid method using a clustering technique and genetic algorithms based on the artificial neural networks model to balance the proportion between the minority class and majority class. The objective of this paper is to constitute the best suitable training dataset for both decreasing data imbalance and improving the classification accuracy. We extracted the properly balanced dataset composed of optimal or near-optimal instances for the artificial neural networks model. The main contribution of the proposed method is that we extract explorative knowledge based on recognition of the data structure and categorize instances through the clustering technique while performing simultaneous optimization for the artificial neural networks modeling. In addition, we can easily understand why the instances are selected by the rule-format knowledge representation increasing the expressive power of the criteria of selecting instances. The proposed method is successfully applied to the bankruptcy prediction problem using financial data for which the proportion of small- and medium-sized bankruptcy firms in the manufacturing industry is extremely small compared to that of non-bankruptcy firms.