scispace - formally typeset
Search or ask a question
Topic

Data management

About: Data management is a research topic. Over the lifetime, 31574 publications have been published within this topic receiving 424326 citations.


Papers
More filters
Proceedings ArticleDOI
01 Sep 2006
TL;DR: This paper presents a quality management framework that leverages well studied feedback control techniques and achieves significantly fewer QoS violations with the same or lower level of data loss, as compared to current strategies utilized in DSMSs.
Abstract: In Data Stream Management Systems (DSMSs), query processing has to meet various Quality-of-Service (QoS) requirements. In many data stream applications, processing delay is the most critical quality requirement since the value of query results decreases dramatically over time. The ability to remain within a desired level of delay is significantly hampered under situations of overloading, which are common in data stream systems. When overloaded, DSMSs employ load shedding in order to meet quality requirements and keep pace with the high rate of data arrivals. Data stream applications are extremely dynamic due to bursty data arrivals and time-varying data processing costs. Current approaches ignore system status information in decision-making and consequently are unable to achieve desired control of quality under dynamic load. In this paper, we present a quality management framework that leverages well studied feedback control techniques. We discuss the design and implementation of such a framework in a real DSMS - the Borealis stream manager. Our data management framework is built on the advantages of system identification and rigorous controller analysis. Experimental results show that our solution achieves significantly fewer QoS (delay) violations with the same or lower level of data loss, as compared to current strategies utilized in DSMSs. It is also robust and bears negligible computational overhead.

104 citations

Book
29 Dec 2015
TL;DR: The authors begin by explaining how Big Data can propel an organization forward by solving a spectrum of previously intractable business problems and show how a Big Data solution environment can be built and integrated to offer competitive advantages.
Abstract: This text should be required reading for everyone in contemporary business. --Peter Woodhull, CEO, Modus21 The one book that clearly describes and links Big Data concepts to business utility. --Dr. Christopher Starr, PhD Simply, this is the best Big Data book on the market! --Sam Rostam, Cascadian IT Group ...one of the most contemporary approaches Ive seen to Big Data fundamentals... --Joshua M. Davis, PhD The Definitive Plain-English Guide to Big Data for Business and Technology Professionals Big Data Fundamentals provides a pragmatic, no-nonsense introduction to Big Data. Best-selling IT author Thomas Erl and his team clearly explain key Big Data concepts, theory and terminology, as well as fundamental technologies and techniques. All coverage is supported with case study examples and numerous simple diagrams. The authors begin by explaining how Big Data can propel an organization forward by solving a spectrum of previously intractable business problems. Next, they demystify key analysis techniques and technologies and show how a Big Data solution environment can be built and integrated to offer competitive advantages. Discovering Big Datas fundamental concepts and what makes it different from previous forms of data analysis and data science Understanding the business motivations and drivers behind Big Data adoption, from operational improvements through innovation Planning strategic, business-driven Big Data initiatives Addressing considerations such as data management, governance, and security Recognizing the 5 V characteristics of datasets in Big Data environments: volume, velocity, variety, veracity, and value Clarifying Big Datas relationships with OLTP, OLAP, ETL, data warehouses, and data marts Working with Big Data in structured, unstructured, semi-structured, and metadata formats Increasing value by integrating Big Data resources with corporate performance monitoring Understanding how Big Data leverages distributed and parallel processing Using NoSQL and other technologies to meet Big Datas distinct data processing requirements Leveraging statistical approaches of quantitative and qualitative analysis Applying computational analysis methods, including machine learning

104 citations

01 Jan 2005
TL;DR: The ATLAS Computing Model establishes the environment and operational requirements that ATLAS data-handling systems must support, and, together with the operational experience gained to date in test beams and data challenges, provides the primary guidance for the development of the data management systems.
Abstract: The ATLAS Computing Model embraces the Grid paradigm and a high degree of decentralization and sharing of computing resources. The required level of computing resources means that off-site facilities will be vital to the operation of ATLAS in a way that was not the case for previous CERN-based experiments. The primary event processing occurs at CERN in a Tier-0 facility. The RAW data is archived at CERN and copied (along with the primary processed data) to the Tier-1 facilities around the world. These facilities archive the raw data, provide the reprocessing capacity, provide access to the various processed versions, and allow scheduled analysis of the processed data by physics analysis groups. Derived datasets produced by the physics groups are copied to the Tier-2 facilities for further analysis. The Tier-2 facilities also provide the simulation capacity for the experiment, with the simulated data housed at Tier-1s. In addition, Tier-2 centres will provide analysis facilities, and some will provide the capacity to produce calibrations based on processing raw data. A CERN Analysis Facility provides an additional analysis capacity, with an important role in the calibration and algorithmic development work. ATLAS has adopted an object-oriented approach to software, based primarily on the C++ programming language, but with some components implemented using FORTRAN and Java. A component-based model has been adopted, whereby applications are built up from collections of plug-compatible components based on a variety of configuration files. This capability is supported by a common framework that provides common data-processing support. This approach results in great flexibility in meeting the basic processing needs of the experiment, and also for responding to changing requirements throughout its lifetime. The heavy use of abstract interfaces allows for different implementations to be provided, supporting different persistency technologies, or optimized for the offline or high-level trigger environments. The Athena framework is an enhanced version of the Gaudi framework that was originally developed by the LHCb experiment, but is now a common ATLAS-LHCb project. Major design principles are the clear separation of data and algorithms, and of transient (in-memory) and persistent (in-file) data. All levels of processing of ATLAS data, from high-level trigger to event simulation, reconstruction and analysis, take place within the Athena framework; in this way it is easier for code developers and users to test and run algorithmic code, with the assurance that all geometry and conditions data will be the same for all types of applications (simulation, reconstruction, analysis, visualization). One of the principal challenges for ATLAS computing is to develop and operate a data storage and management infrastructure able to meet the demands of a yearly data volume of O(10 PB) utilized by data processing and analysis activities spread around the world. The ATLAS Computing Model establishes the environment and operational requirements that ATLAS data-handling systems must support, and, together with the operational experience gained to date in test beams and data challenges, provides the primary guidance for the development of the data management systems. The ATLAS Databases and Data Management Project (DB Project) leads and coordinates ATLAS activities in these areas, with a scope encompassing technical databases (detector production, installation and survey data), detector geometry, online/TDAQ databases, conditions databases (online and offline), event data, offline processing configuration and book-keeping, distributed data management, and distributed database and data management services. The project is responsible for ensuring the coherent development, integration, and operational capability of the distributed database and data management software and infrastructure for ATLAS across these areas. The ATLAS Computing Model foresees the distribution of raw and processed data to Tier-1 and Tier-2 centres, so as to be able to exploit fully the computing resources that are made available to the Collaboration. Additional computing resources will be available for data processing and analysis at Tier-3 centres and other computing facilities to which ATLAS may have access. A complex set of tools and distributed services, enabling the automatic distribution and processing of the large amounts of data, has been developed and deployed by ATLAS in cooperation with the LHC Computing Grid (LCG) Project and with the middleware providers of the three large Grid infrastructures we use: EGEE, OSG and NorduGrid. The tools are designed in a flexible way, in order to have the possibility to extend them to use other types of Grid middleware in the future. These tools, and the service infrastructure on which they depend, were initially developed in the context of centrally managed, distributed Monte Carlo production exercises. They will be re-used wherever possible to create systems and tools for individual users to access data and compute resources, providing a distributed analysis environment for general usage by the ATLAS Collaboration. The first version of the production system was deployed in summer 2004 and has been used since the second half of 2004. It was used for Data Challenge 2, for the production of simulated data for the 5th ATLAS Physics Workshop (Rome, June 2005) and for the reconstruction and analysis of the 2004 Combined Test-Beam data. The main computing operations that ATLAS will have to run comprise the preparation, distribution and validation of ATLAS software, and the computing and data management operations run centrally on Tier-0, Tier-1s and Tier-2s. The ATLAS Virtual Organization will allow production and analysis users to run jobs and access data at remote sites using the ATLAS-developed Grid tools. In the past few years the Computing Model has been tested and developed by running Data Challenges of increasing scope and magnitude, as was proposed by the LHC Computing Review in 2001. We have run two major Data Challenges since 2002 and performed other massive productions in order to provide simulated data to the physicists and to reconstruct and analyse real data coming from test-beam activities; this experience is now useful in setting up the operations model for the start of LHC data-taking in 2007. The Computing Model, together with the knowledge of the resources needed to store and process each ATLAS event, gives rise to estimates of required resources that can be used to design and set up the various facilities. It is not assumed that all Tier-1s or Tier-2s will be of the same size; however, in order to ensure a smooth operation of the Computing Model, all Tier-1s should have broadly similar proportions of disk, tape and CPU, and the same should apply for the Tier-2s. The organization of the ATLAS Software & Computing Project reflects all areas of activity within the project itself. Strong high-level links have been established with other parts of the ATLAS organization, such as the T-DAQ Project and Physics Coordination, through cross-representation in the respective steering boards. The Computing Management Board, and in particular the Planning Officer, acts to make sure that software and computing developments take place coherently across sub-systems and that the project as a whole meets its milestones. The International Computing Board assures the information flow between the ATLAS Software & Computing Project and the national resources and their Funding Agencies.

104 citations

Journal ArticleDOI
George G. Dodd1
TL;DR: A description of the basic types of data management techniques, as well as the relation of each to the hardware on which it is used, and how these basic elements can be used as building blocks to describe and build more complex data management systems.
Abstract: Many different data management techniques have been designed, described in the literature, and marketed. With each teehmque there are claims of added flexiblhty and speed. However, because new terms are invented to describe each new technique, the observer is left m a state of confusion when he tries to understand how they work. A description is given oi the basic types of data management techniques, as well as the relation of each to the hardware on which it is used. Then it is shown how these basic elements can be used as building blocks to describe and build more complex data management systems. Finally, there is a discussion of the languages used for programming data management systems.

104 citations


Network Information
Related Topics (5)
Information system
107.5K papers, 1.8M citations
90% related
Software
130.5K papers, 2M citations
88% related
Cluster analysis
146.5K papers, 2.9M citations
83% related
The Internet
213.2K papers, 3.8M citations
82% related
Cloud computing
156.4K papers, 1.9M citations
81% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023218
2022485
2021959
20201,435
20191,745
20181,719