scispace - formally typeset
Search or ask a question
Topic

Data modeling

About: Data modeling is a research topic. Over the lifetime, 29624 publications have been published within this topic receiving 470187 citations.


Papers
More filters
Journal Article
TL;DR: If the goal as a field is to use data to solve problems, then the statistical community needs to move away from exclusive dependence on data models and adopt a more diverse set of tools.
Abstract: There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated bya given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical communityhas been committed to the almost exclusive use of data models. This commit- ment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current prob- lems. Algorithmic modeling, both in theoryand practice, has developed rapidlyin fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move awayfrom exclusive dependence on data models and adopt a more diverse set of tools.

1,735 citations

Journal ArticleDOI
John Mullahy1
TL;DR: These alternatives permit more flexible specification of the data-generating process (dgp) than do familiar count data models, and provide a natural means for modeling data that are over- or underdispersed by the standards of the basic models.

1,700 citations

Journal Article
TL;DR: This work classifies data quality problems that are addressed by data cleaning and provides an overview of the main solution approaches and discusses current tool support for data cleaning.
Abstract: We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning.

1,675 citations

Journal ArticleDOI
01 Aug 2003
TL;DR: The basic processing model and architecture of Aurora, a new system to manage data streams for monitoring applications, are described and a stream-oriented set of operators are described.
Abstract: .This paper describes the basic processing model and architecture of Aurora, a new system to manage data streams for monitoring applications. Monitoring applications differ substantially from conventional business data processing. The fact that a software system must process and react to continual inputs from many sources (e.g., sensors) rather than from human operators requires one to rethink the fundamental architecture of a DBMS for this application area. In this paper, we present Aurora, a new DBMS currently under construction at Brandeis University, Brown University, and M.I.T. We first provide an overview of the basic Aurora model and architecture and then describe in detail a stream-oriented set of operators.

1,518 citations

Journal ArticleDOI
TL;DR: In this paper, the authors consider the problem of learning model parameters from data distributed across multiple edge nodes, without sending raw data to a centralized place, and propose a control algorithm that determines the best tradeoff between local update and global parameter aggregation to minimize the loss function under a given resource budget.
Abstract: Emerging technologies and applications including Internet of Things, social networking, and crowd-sourcing generate large amounts of data at the network edge. Machine learning models are often built from the collected data, to enable the detection, classification, and prediction of future events. Due to bandwidth, storage, and privacy concerns, it is often impractical to send all the data to a centralized location. In this paper, we consider the problem of learning model parameters from data distributed across multiple edge nodes, without sending raw data to a centralized place. Our focus is on a generic class of machine learning models that are trained using gradient-descent-based approaches. We analyze the convergence bound of distributed gradient descent from a theoretical point of view, based on which we propose a control algorithm that determines the best tradeoff between local update and global parameter aggregation to minimize the loss function under a given resource budget. The performance of the proposed algorithm is evaluated via extensive experiments with real datasets, both on a networked prototype system and in a larger-scale simulated environment. The experimentation results show that our proposed approach performs near to the optimum with various machine learning models and different data distributions.

1,441 citations


Network Information
Related Topics (5)
Feature extraction
111.8K papers, 2.1M citations
92% related
Cluster analysis
146.5K papers, 2.9M citations
88% related
Wireless sensor network
142K papers, 2.4M citations
88% related
Artificial neural network
207K papers, 4.5M citations
88% related
Deep learning
79.8K papers, 2.1M citations
87% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023124
2022410
20212,121
20202,034
20191,550
20182,042