scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Toward semantic data imputation for a dengue dataset

TL;DR: An improvement in the efficiency of predicting missing data utilizing Particle Swarm Optimization (PSO), which is applied to the numerical data cleansing problem, with the performance of PSO being enhanced using K-means to help determine the fitness value.
Abstract: Missing data are a major problem that affects data analysis techniques for forecasting. Traditional methods suffer from poor performance in predicting missing values using simple techniques, e.g., mean and mode. In this paper, we present and discuss a novel method of imputing missing values semantically with the use of an ontology model. We make three new contributions to the field: first, an improvement in the efficiency of predicting missing data utilizing Particle Swarm Optimization (PSO), which is applied to the numerical data cleansing problem, with the performance of PSO being enhanced using K-means to help determine the fitness value. Second, the incorporation of an ontology with PSO for the purpose of narrowing the search space, to make PSO provide greater accuracy in predicting numerical missing values while quickly converging on the answer. Third, the facilitation of a framework to substitute nominal data that are lost from the dataset using the relationships of concepts and a reasoning mechanism concerning the knowledge-based model. The experimental results indicated that the proposed method could estimate missing data more efficiently and with less chance of error than conventional methods, as measured by the root mean square error.
Citations
More filters
Journal ArticleDOI
TL;DR: In this article , the authors present a comprehensive review of the current state-of-the-art and future trends of real-time modelling of flood forecasting in urban drainage systems.
Abstract: • Last recent works on real-time flood forecasting in urban drainage systems are reviewed. • A bibliometric and in-depth critical review is conducted. • All points are classified in data collection and preparation, model development and performance assessment. • Real-time data requirements and developed real-time urban flood forecasting models are discussed. There has been a strong tendency in recent decades to develop real-time urban flood prediction models for early warning to the public due to a large number of worldwide urban flood occurrences and their disastrous consequences. While a significant breakthrough has been made so far, there are still some potential knowledge gaps that need further investigation. This paper presents a comprehensive review of the current state-of-the-art and future trends of real-time modelling of flood forecasting in urban drainage systems. Findings showed that the combination of various real-time sources of rainfall measurement and the inclusion of other real-time data such as soil moisture, wind flow patterns, evaporation, fluvial flow and infiltration should be more investigated in real-time flood forecasting models. Additionally, artificial intelligence is also present in most of the new RTFF models in UDS and consequently further developments of this technique are expected to appear in future works.

31 citations

Journal ArticleDOI
TL;DR: A comprehensive overview of the literature on domain ontologies as used in the various semantic data‐mining tasks, such as preprocessing, modeling, and postprocessing is provided.
Abstract: Data mining is the discovery of meaningful information or unrevealed patterns in data. Traditional data‐mining approaches, using statistical calculations, machine learning, artificial intelligence, and database technology, cannot interpret data on a conceptual or semantic level and fail to reveal the meanings within the data. This results in a user not being analyzed and determines its signification and implications. Several semantic data‐mining approaches have been proposed in the past decade that overcome these limitations by using a domain ontology as background knowledge to enable and enhance data‐mining performance. The main contributions of this literature survey include organizing the surveyed articles in a new way that provides ease of understanding for interested researchers, and the provision of a critical analysis and summary of the surveyed articles, identifying the contribution of these papers to the field, and the limitations of the analysis methods and approaches discussed in this corpus, with the intention of informing researchers in this growing field in their innovative approaches to new research. Finally, we identify the future trends and challenges in this study track that will be of concern to future researchers, such as dynamic knowledge‐based methods or big‐data tool collaboration. This survey article provides a comprehensive overview of the literature on domain ontologies as used in the various semantic data‐mining tasks, such as preprocessing, modeling, and postprocessing. We investigated the role of semantic data mining in the field of data science and the processes and methods of applying semantic data mining to a data resource description framework.

15 citations


Cites methods from "Toward semantic data imputation for..."

  • ...Kamkhad et al.117 proposed a method that collaborates particle swarm optimization (PSO) with semantic data mining for the data preparation process, an essential first process for providing ‘clean’ data as input to ontology construction and the semantic data‐mining process....

    [...]

  • ...Kamkhad et al.(117) proposed a method that collaborates particle swarm optimization (PSO) with semantic data mining for the data preparation process, an essential first process for providing ‘clean’ data as input to ontology construction and the semantic data‐mining process....

    [...]

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a virtual sensor-based imputed graph attention network, which generates signals to impute the time of sensor record failure by generative adversarial network (GAN) and extracts the features of complete signals mixed with real signals and generated signals by GAT.

12 citations

Posted Content
TL;DR: The weighted nearest neighbors approach is extended to impute missing values in categorical variables and shows that the weighting of attributes yields smaller imputation errors than existing approaches.
Abstract: Missing values are a common phenomenon in all areas of applied research. While various imputation methods are available for metrically scaled variables, methods for categorical data are scarce. An imputation method that has been shown to work well for high dimensional metrically scaled variables is the imputation by nearest neighbor methods. In this paper, we extend the weighted nearest neighbors approach to impute missing values in categorical variables. The proposed method, called $\mathtt{wNNSel_{cat}}$, explicitly uses the information on association among attributes. The performance of different imputation methods is compared in terms of the proportion of falsely imputed values. Simulation results show that the weighting of attributes yields smaller imputation errors than existing approaches. A variety of real data sets is used to support the results obtained by simulations.

11 citations

Journal ArticleDOI
TL;DR: A new strategy that incorporates knowledge-based models into a framework, named the Semantic-based Star-schema Designer, that assists the automation of star schema construction and their relationship information without human intervention using homegrown algorithms.
Abstract: Most data-warehouse construction processes are performed manually by experts, which is laborious, time-consuming, and prone to error. Furthermore, special knowledge is required to design complex multidimensional models, such as a star schema. This predicament has motivated computer scientists to propose automation techniques to generate such models. For this reason, we present a new strategy that incorporates knowledge-based models into a framework, named the Semantic-based Star-schema Designer, that assists the automation of star schema construction. Our models provide reasoning capabilities needed by star schema designs, including those that can disambiguate heterogeneous terms, detect appropriate data types and attribute sizes, and organize data hierarchies to support online analytical processes. We also propose strategies to overcome the uncertainty arising when attribute names are not available in the data source. The names of unknown attributes are thus predicted using an arithmetic coding technique to infer column names. Our system also generates star schema from semi-structured data (e.g., comma-separated-value files and spreadsheets), which do not provide primary keys, foreign keys, or relationship cardinalities between tables. Our framework facilitates star schema construction and their relationship information without human intervention using homegrown algorithms. Experiments demonstrate that our technique predicts column names and data types that enable the effective generation of star schema better than baseline approaches.

5 citations

References
More filters
Proceedings ArticleDOI
06 Aug 2002
TL;DR: A concept for the optimization of nonlinear functions using particle swarm methodology is introduced, and the evolution of several paradigms is outlined, and an implementation of one of the paradigm is discussed.
Abstract: A concept for the optimization of nonlinear functions using particle swarm methodology is introduced. The evolution of several paradigms is outlined, and an implementation of one of the paradigms is discussed. Benchmark testing of the paradigm is described, and applications, including nonlinear function optimization and neural network training, are proposed. The relationships between particle swarm optimization and both artificial life and genetic algorithms are described.

35,104 citations

Book
01 Jan 2008
TL;DR: In this article, the authors present an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections.
Abstract: Class-tested and coherent, this groundbreaking new textbook teaches web-era information retrieval, including web search and the related areas of text classification and text clustering from basic concepts. Written from a computer science perspective by three leading experts in the field, it gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections. All the important ideas are explained using examples and figures, making it perfect for introductory courses in information retrieval for advanced undergraduates and graduate students in computer science. Based on feedback from extensive classroom experience, the book has been carefully structured in order to make teaching more natural and effective. Although originally designed as the primary text for a graduate or advanced undergraduate course in information retrieval, the book will also create a buzz for researchers and professionals alike.

11,804 citations

Journal ArticleDOI
TL;DR: A Monte Carlo simulation examined the performance of 4 missing data methods in structural equation models and found that full information maximum likelihood (FIML) estimation was superior across all conditions of the design.
Abstract: A Monte Carlo simulation examined the performance of 4 missing data methods in structural equation models: full information maximum likelihood (FIML), listwise deletion, pairwise deletion, and similar response pattern imputation. The effects of 3 independent variables were examined (factor loading magnitude, sample size, and missing data rate) on 4 outcome measures: convergence failures, parameter estimate bias, parameter estimate efficiency, and model goodness of fit. Results indicated that FIML estimation was superior across all conditions of the design. Under ignorable missing data conditions (missing completely at random and missing at random), FIML estimates were unbiased and more efficient than the other methods. In addition, FIML yielded the lowest proportion of convergence failures and provided near-optimal Type 1 error rates across both simulations.

3,748 citations

Journal ArticleDOI
TL;DR: A HACE theorem is presented that characterizes the features of the Big Data revolution, and a Big Data processing model is proposed, from the data mining perspective, which involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations.
Abstract: Big Data concern large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are now rapidly expanding in all science and engineering domains, including physical, biological and biomedical sciences. This paper presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective. This data-driven model involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations. We analyze the challenging issues in the data-driven model and also in the Big Data revolution.

2,233 citations

Journal ArticleDOI
TL;DR: A novel evolutionary optimization strategy based on the derandomized evolution strategy with covariance matrix adaptation (CMA-ES), intended to reduce the number of generations required for convergence to the optimum, which results in a highly parallel algorithm which scales favorably with large numbers of processors.
Abstract: This paper presents a novel evolutionary optimization strategy based on the derandomized evolution strategy with covariance matrix adaptation (CMA-ES). This new approach is intended to reduce the number of generations required for convergence to the optimum. Reducing the number of generations, i.e., the time complexity of the algorithm, is important if a large population size is desired: (1) to reduce the effect of noise; (2) to improve global search properties; and (3) to implement the algorithm on (highly) parallel machines. Our method results in a highly parallel algorithm which scales favorably with large numbers of processors. This is accomplished by efficiently incorporating the available information from a large population, thus significantly reducing the number of generations needed to adapt the covariance matrix. The original version of the CMA-ES was designed to reliably adapt the covariance matrix in small populations but it cannot exploit large populations efficiently. Our modifications scale up the efficiency to population sizes of up to 10n, where n is the problem dimension. This method has been applied to a large number of test problems, demonstrating that in many cases the CMA-ES can be advanced from quadratic to linear time complexity.

2,144 citations