scispace - formally typeset
Search or ask a question
Topic

Data access

About: Data access is a research topic. Over the lifetime, 13141 publications have been published within this topic receiving 172859 citations. The topic is also known as: Data access.


Papers
More filters
Proceedings ArticleDOI
Yi Shan1, Bo Wang1, Jing Yan1, Yu Wang1, Ningyi Xu2, Huazhong Yang1 
21 Feb 2010
TL;DR: FPMR, a MapReduce framework on FPGA, which provides programming abstraction, hardware architecture, and basic building blocks to developers so that more attention can be paid to the application itself and the speedup of this framework is demonstrated.
Abstract: Machine learning and data mining are gaining increasing attentions of the computing society. FPGA provides a highly parallel, low power, and flexible hardware platform for this domain, while the difficulty of programming FPGA greatly limits its prevalence. MapReduce is a parallel programming framework that could easily utilize inherent parallelism in algorithms. In this paper, we describe FPMR, a MapReduce framework on FPGA, which provides programming abstraction, hardware architecture, and basic building blocks to developers.An on-chip processor scheduler is implemented to maximize the utilization of computation resources and achieve better load balancing. An efficient data access scheme is carefully designed to maximize data reuse and throughput. Meanwhile, the FPMR framework hides the task control, synchronization, and communication away from designers so that more attention can be paid to the application itself. A case study of RankBoost acceleration based on FPMR demonstrates that FPMR efficiently helps with the development productivity; and the speedup is 31.8x versus CPU-based implementation. This performance is comparable to a fully manually designed version, which achieves 33.5x speedup. Two other applications: SVM, PageRank are also discussed to show the generalization of the framework.

154 citations

Patent
30 Oct 1990
TL;DR: In this paper, a two level lock management system is used to prevent data corruption due to unsynchronized data access by the multiple processors in a multi-processor computer system, where each processor is under the control of separate system software and access a common database.
Abstract: A multi-processor computer system in which each processor is under the control of separate system software and access a common database. A two level lock management system is used to prevent data corruption due to unsynchronized data access by the multiple processors. By this system, subsets of data in the database are assigned respectively different lock entities. Before a task running on one of the processors access data in the database it first requests permission to access the data in a given mode with reference to the appropriate lock entity. A first level lock manager handles these requests synchronously, using a simplified model of the locking system having shared and exclusive lock modes to either grant or deny the request. All requests are then forwarded to a second level lock manager which grants or denies the requests based on a more robust model of the locking system and queues denied requests. The denied requests are granted, in turn, as the tasks which have been granted access finish processing data in the database.

153 citations

Patent
09 Aug 2007
TL;DR: In this paper, the authors present systems and methods for automating the EII, using a smart integration engine based on metadata, which is used for seamless integration of a fully-distributed organization with many data sources and technologies.
Abstract: The present invention discloses systems and methods for automating the EII, using a smart integration engine based on metadata. On-line execution (i.e. data access, retrieval, or update) is automated by integrating heterogeneous data sources via a centralized smart engine based on metadata of all data sources managed in a metadata repository. The data-source assets are mapped to business metadata (terminology) giving programmers the ability to use business terms, and overcome technical terms. IT departments can use the business-level terms for easy and fast programming of all services “at the business level”. The integration is performed by the engine (via pre-configuration) automatically, dynamically, and on-line, regardless of topology or technology changes, without user or administrator intervention. MDOA is a high-level concept in which the metadata maps the technical low-level terms to business high-level terms. MDOA is used for seamless integration of a fully-distributed organization with many data sources and technologies.

152 citations

Journal ArticleDOI
TL;DR: A set of principles for designing highly scalable distributed storage systems that are optimized for heavy data access concurrency and a set of versioning algorithms that enable a high throughput under concurrency are proposed.

151 citations

Proceedings ArticleDOI
13 May 2012
TL;DR: This paper builds a mathematical model of scheduling in MapReduce and proposes an algorithm that schedules multiple tasks simultaneously rather than one by one to give optimal data locality and runs extensive experiments to quantify performance improvement of the proposed algorithms and measure how different factors impact data locality.
Abstract: Traditional HPC architectures separate compute nodes and storage nodes, which are interconnected with high speed links to satisfy data access requirements in multi-user environments. However, the capacity of those high speed links is still much less than the aggregate bandwidth of all compute nodes. In Data Parallel Systems such as GFS/MapReduce, clusters are built with commodity hardware and each node takes the roles of both computation and storage, which makes it possible to bring compute to data. Data locality is a significant advantage of data parallel systems over traditional HPC systems. Good data locality reduces cross-switch network traffic - one of the bottlenecks in data-intensive computing. In this paper, we investigate data locality in depth. Firstly, we build a mathematical model of scheduling in MapReduce and theoretically analyze the impact on data locality of configuration factors, such as the numbers of nodes and tasks. Secondly, we find the default Hadoop scheduling is non-optimal and propose an algorithm that schedules multiple tasks simultaneously rather than one by one to give optimal data locality. Thirdly, we run extensive experiments to quantify performance improvement of our proposed algorithms, measure how different factors impact data locality, and investigate how data locality influences job execution time in both single-cluster and cross-cluster environments.

151 citations


Network Information
Related Topics (5)
Software
130.5K papers, 2M citations
86% related
Cloud computing
156.4K papers, 1.9M citations
86% related
Cluster analysis
146.5K papers, 2.9M citations
85% related
The Internet
213.2K papers, 3.8M citations
85% related
Information system
107.5K papers, 1.8M citations
83% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
202351
2022125
2021403
2020721
2019906
2018816