Showing papers by "Kai Li published in 2008"

PDF

Open Access

Proceedings Article•DOI•

The PARSEC benchmark suite: characterization and architectural implications

[...]

Christian Bienia¹, Sanjeev Kumar², Jaswinder Pal Singh¹, Kai Li¹•Institutions (2)

25 Oct 2008

TL;DR: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs), and shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic.

...read moreread less

Abstract: This paper presents and characterizes the Princeton Application Repository for Shared-Memory Computers (PARSEC), a benchmark suite for studies of Chip-Multiprocessors (CMPs). Previous available benchmarks for multiprocessors have focused on high-performance computing applications and used a limited number of synchronization methods. PARSEC includes emerging applications in recognition, mining and synthesis (RMS) as well as systems applications which mimic large-scale multithreaded commercial programs. Our characterization shows that the benchmark suite covers a wide spectrum of working sets, locality, data sharing, synchronization and off-chip traffic. The benchmark suite has been made available to the public.

...read moreread less

3,514 citations

Proceedings Article•

Avoiding the disk bottleneck in the data domain deduplication file system

[...]

Benjamin Zhu¹, Kai Li², Hugo Patterson¹•Institutions (2)

Data Domain¹, Princeton University²

26 Feb 2008

TL;DR: Three techniques employed in the production Data Domain deduplication file system to relieve the disk bottleneck are described, which enable a modern two-socket dual-core system to run at 90% CPU utilization with only one shelf of 15 disks and achieve 100 MB/sec for single-stream throughput and 210 MB/ sec for multi- stream throughput.

...read moreread less

Abstract: Disk-based deduplication storage has emerged as the new-generation storage system for enterprise data protection to replace tape libraries. Deduplication removes redundant data segments to compress data into a highly compact form and makes it economical to store backups on disk instead of tape. A crucial requirement for enterprise data protection is high throughput, typically over 100 MB/sec, which enables backups to complete quickly. A significant challenge is to identify and eliminate duplicate data segments at this rate on a low-cost system that cannot afford enough RAM to store an index of the stored segments and may be forced to access an on-disk index for every input segment. This paper describes three techniques employed in the production Data Domain deduplication file system to relieve the disk bottleneck. These techniques include: (1) the Summary Vector, a compact in-memory data structure for identifying new segments; (2) Stream-Informed Segment Layout, a data layout method to improve on-disk locality for sequentially accessed segments; and (3) Locality Preserved Caching, which maintains the locality of the fingerprints of duplicate segments to achieve high cache hit ratios. Together, they can remove 99% of the disk accesses for deduplication of real world workloads. These techniques enable a modern two-socket dual-core system to run at 90% CPU utilization with only one shelf of 15 disks and achieve 100 MB/sec for single-stream throughput and 210 MB/sec for multi-stream throughput.

...read moreread less

934 citations

Proceedings Article•DOI•

PARSEC vs. SPLASH-2: A quantitative comparison of two multithreaded benchmark suites on Chip-Multiprocessors

[...]

Christian Bienia¹, Sanjeev Kumar², Kai Li¹•Institutions (2)

Princeton University¹, Intel²

30 Sep 2008

TL;DR: It is shown that PARSEC workloads are fundamentally different from SPLASH-2 benchmarks, and the observed differences can be explained with two technology trends, the proliferation of CMPs and the accelerating growth of world data.

...read moreread less

Abstract: The PARSEC benchmark suite was recently released and has been adopted by a significant number of users within a short amount of time. This new collection of workloads is not yet fully understood by researchers. In this study we compare the SPLASH-2 and PARSEC benchmark suites with each other to gain insights into differences and similarities between the two program collections. We use standard statistical methods and machine learning to analyze the suites for redundancy and overlap on chip-multiprocessors (CMPs). Our analysis shows that PARSEC workloads are fundamentally different from SPLASH-2 benchmarks. The observed differences can be explained with two technology trends, the proliferation of CMPs and the accelerating growth of world data.

...read moreread less

239 citations

Proceedings Article•DOI•

Modeling LSH for performance tuning

[...]

Wei Dong¹, Zhe Wang¹, William Josephson¹, Moses Charikar¹, Kai Li¹ - Show less +1 more•Institutions (1)

Princeton University¹

26 Oct 2008

TL;DR: A statistical performance model of Multi-probe LSH, a state-of-the-art variance of LSH is presented, which can accurately predict the average search quality and latency given a small sample dataset and an adaptive LSH search algorithm is devised to determine the probing parameter dynamically for each query.

...read moreread less

Abstract: Although Locality-Sensitive Hashing (LSH) is a promising approach to similarity search in high-dimensional spaces, it has not been considered practical partly because its search quality is sensitive to several parameters that are quite data dependent. Previous research on LSH, though obtained interesting asymptotic results, provides little guidance on how these parameters should be chosen, and tuning parameters for a given dataset remains a tedious process.To address this problem, we present a statistical performance model of Multi-probe LSH, a state-of-the-art variance of LSH. Our model can accurately predict the average search quality and latency given a small sample dataset. Apart from automatic parameter tuning with the performance model, we also use the model to devise an adaptive LSH search algorithm to determine the probing parameter dynamically for each query. The adaptive probing method addresses the problem that even though the average performance is tuned for optimal, the variance of the performance is extremely high. We experimented with three different datasets including audio, images and 3D shapes to evaluate our methods. The results show the accuracy of the proposed model: the recall errors predicted are within 5% from the real values for most cases; the adaptive search method reduces the standard deviation of recall by about 50% over the existing method.

...read moreread less

164 citations

Book Chapter•DOI•

Towards Scalable Dataset Construction: An Active Learning Approach

[...]

Brendan M. Collins¹, Jia Deng¹, Kai Li¹, Li Fei-Fei¹•Institutions (1)

Princeton University¹

20 Oct 2008

TL;DR: This work presents a discriminative learning process which employs active, online learning to quickly classify many images with minimal user input, and demonstrates precision which is often superior to the state-of-the-art, with scalability which exceeds previous work.

...read moreread less

Abstract: As computer vision research considers more object categories and greater variation within object categories, it is clear that larger and more exhaustive datasets are necessary. However, the process of collecting such datasets is laborious and monotonous. We consider the setting in which many images have been automatically collected for a visual category (typically by automatic internet search), and we must separate relevant images from noise. We present a discriminative learning process which employs active, online learning to quickly classify many images with minimal user input. The principle advantage of this work over previous endeavors is its scalability. We demonstrate precision which is often superior to the state-of-the-art, with scalability which exceeds previous work.

...read moreread less

150 citations

Proceedings Article•DOI•

Asymmetric distance estimation with sketches for similarity search in high-dimensional spaces

[...]

Wei Dong¹, Moses Charikar¹, Kai Li¹•Institutions (1)

Princeton University¹

20 Jul 2008

TL;DR: An efficient sketch algorithm for similarity search with L2 distances and a novel asymmetric distance estimation technique that takes advantage of the original feature vector of the query to boost the distance estimation accuracy.

...read moreread less

Abstract: Efficient similarity search in high-dimensional spaces is important to content-based retrieval systems. Recent studies have shown that sketches can effectively approximate L1 distance in high-dimensional spaces, and that filtering with sketches can speed up similarity search by an order of magnitude. It is a challenge to further reduce the size of sketches, which are already compact, without compromising accuracy of distance estimation.This paper presents an efficient sketch algorithm for similarity search with L2 distances and a novel asymmetric distance estimation technique. Our new asymmetric estimator takes advantage of the original feature vector of the query to boost the distance estimation accuracy. We also apply this asymmetric method to existing sketches for cosine similarity and L1 distance. Evaluations with datasets extracted from images and telephone records show that our L2 sketch outperforms existing methods, and the asymmetric estimators consistently improve the accuracy of different sketch methods. To achieve the same search quality, asymmetric estimators can reduce the sketch size by 10% to 40%.

...read moreread less

118 citations

Patent•

Delta compression after identity deduplication

[...]

Mark Huang¹, Edward K. Lee¹, Kai Li², Philip Shilane, Grant Wallace, Ming Benjamin Zhu - Show less +2 more•Institutions (2)

Data Domain¹, EMC Corporation²

14 Nov 2008

TL;DR: In this article, the identity deduplication is disclosed and a first data segment is determined to be identical to a first previous data segment, and a second data segment was determined not to be similar to a third previous segment.

...read moreread less

Abstract: Delta compression after identity deduplication is disclosed. A first data segment is determined to be identical to a first previous data segment. A second data segment, not determined to be identical to a second previous data segment, is then determined to be similar to a third previous data segment.

...read moreread less

94 citations

Patent•

Cluster storage using delta compression

[...]

R. Hugo Patterson¹, Kai Li¹, Ming Benjamin Zhu², Sazzala Reddy¹, Umesh Maheshwari¹, Edward K. Lee² - Show less +2 more•Institutions (2)

Data Domain¹, EMC Corporation²

09 Apr 2008

TL;DR: In this article, a data stream or a data block is split into segments, and a cluster node is selected for each segment, and in the event that a similar segment to the segment is identified that is already managed by the selected cluster node, a reference to the similar segment and a delta between the similar segments and the segment are caused to be stored on the selected node.

...read moreread less

Abstract: Cluster storage is disclosed A data stream or a data block is received The data stream or the data block is broken into segments For each segment, a cluster node is selected, and in the event that a similar segment to the segment is identified that is already managed by the selected cluster node, a reference to the similar segment and a delta between the similar segment and the segment is caused to be stored on the selected cluster node

...read moreread less

73 citations

Proceedings Article•DOI•

Efficiently matching sets of features with random histograms

[...]

Wei Dong¹, Zhe Wang¹, Moses Charikar¹, Kai Li¹•Institutions (1)

Princeton University¹

26 Oct 2008

TL;DR: A randomized algorithm to embed a set of features into a single high-dimensional vector to simplify the feature-set matching problem and can achieve accuracy comparable to the state-of-the-art feature- set matching methods, while requiring significantly less space and time.

...read moreread less

Abstract: As the commonly used representation of a feature-rich data object has evolved from a single feature vector to a set of feature vectors, a key challenge in building a content-based search engine for feature-rich data is to match feature-sets efficiently. Although substantial progress has been made during the past few years, existing approaches are still inefficient and inflexible for building a search engine for massive datasets. This paper presents a randomized algorithm to embed a set of features into a single high-dimensional vector to simplify the feature-set matching problem. The main idea is to project feature vectors into an auxiliary space using locality sensitive hashing and to represent a set of features as a histogram in the auxiliary space. A histogram is simply a high dimensional vector, and efficient similarity measures like L1 and L2 distances can be employed to approximate feature-set distance measures.We evaluated the proposed approach under three different task settings, i.e. content-based image search, image object recognition and near-duplicate video clip detection. The experimental results show that the proposed approach is indeed effective and flexible. It can achieve accuracy comparable to the feature-set matching methods, while requiring significantly less space and time. For object recognition with Caltech 101 dataset, our method runs 25 times faster to achieve the same precision as Pyramid Matching Kernel, the state-of-the-art feature-set matching method.

...read moreread less

55 citations