Managing Gigabytes: Compressing and Indexing Documents and Images

Book•

Managing Gigabytes: Compressing and Indexing Documents and Images

Ian H. Witten, Alistair Moffat, Tim Bell

11 May 1999-

TL;DR: A guide to the MG system and its applications, as well as a comparison to the NZDL reference index, are provided.

read less

Abstract: PREFACE 1. OVERVIEW 2. TEXT COMPRESSION 3. INDEXING 4. QUERYING 5. INDEX CONSTRUCTION 6. IMAGE COMPRESSION 7. TEXTUAL IMAGES 8. MIXED TEXT AND IMAGES 9. IMPLEMENTATION 10. THE INFORMATION EXPLOSION A. GUIDE TO THE MG SYSTEM B. GUIDE TO THE NZDL REFERENCES INDEX

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

The anatomy of a large-scale hypertextual Web search engine

[...]

Sergey Brin¹, Lawrence Page¹•Institutions (1)

Stanford University¹

01 Apr 1998

TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

...read moreread less

Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

...read moreread less

14,696 citations

Journal Article•

The Anatomy of a Large-Scale Hypertextual Web Search Engine.

[...]

Sergey Brin, Lawrence Page

01 Jan 1998-Computer Networks

TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.

...read moreread less

13,327 citations

Proceedings Article•DOI•

Video Google: a text retrieval approach to object matching in videos

[...]

Sivic¹, Zisserman¹•Institutions (1)

University of Oxford¹

13 Oct 2003

TL;DR: An approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video, represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion.

...read moreread less

Abstract: We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. The temporal continuity of the video within a shot is used to track the regions in order to reject unstable regions and reduce the effects of noise in the descriptors. The analogy with text retrieval is in the implementation where matches on descriptors are pre-computed (using vector quantization), and inverted file systems and document rankings are used. The result is that retrieved is immediate, returning a ranked list of key frames/shots in the manner of Google. The method is illustrated for matching in two full length feature films.

...read moreread less

6,938 citations

Cites background from "Managing Gigabytes: Compressing and..."

...All of the above steps are carried out in advance of actual retrieval, and the set of vectors representing all the documents in a corpus are organized as an inverted file [18] to facilitate efficient retrieval....
[...]

Posted Content•

Principles of data mining

[...]

David J. Hand, Heikki Mannila, Padhraic Smyth

01 Jan 2001

TL;DR: This paper gives a lightning overview of data mining and its relation to statistics, with particular emphasis on tools for the detection of adverse drug reactions.

...read moreread less

Abstract: The growing interest in data mining is motivated by a common problem across disciplines: how does one store, access, model, and ultimately describe and understand very large data sets? Historically, different aspects of data mining have been addressed independently by different disciplines. This is the first truly interdisciplinary text on data mining, blending the contributions of information science, computer science, and statistics. The book consists of three sections. The first, foundations, provides a tutorial overview of the principles underlying data mining algorithms and their application. The presentation emphasizes intuition rather than rigor. The second section, data mining algorithms, shows how algorithms are constructed to solve specific problems in a principled manner. The algorithms covered include trees and rules for classification and regression, association rules, belief networks, classical statistical models, nonlinear models such as neural networks, and local "memory-based" models. The third section shows how all of the preceding analysis fits together when applied to real-world data mining problems. Topics include the role of metadata, how to handle missing data, and data preprocessing.

...read moreread less

3,765 citations

Book•

Introduction to data compression

[...]

Khalid Sayood¹•Institutions (1)

University of Nebraska–Lincoln¹

01 Jan 1996

TL;DR: The author explains the development of the Huffman Coding Algorithm and some of the techniques used in its implementation, as well as some of its applications, including Image Compression, which is based on the JBIG standard.

...read moreread less

Abstract: Preface 1 Introduction 1.1 Compression Techniques 1.1.1 Lossless Compression 1.1.2 Lossy Compression 1.1.3 Measures of Performance 1.2 Modeling and Coding 1.3 Organization of This Book 1.4 Summary 1.5 Projects and Problems 2 Mathematical Preliminaries 2.1 Overview 2.2 A Brief Introduction to Information Theory 2.3 Models 2.3.1 Physical Models 2.3.2 Probability Models 2.3.3. Markov Models 2.3.4 Summary 2.5 Projects and Problems 3 Huffman Coding 3.1 Overview 3.2 "Good" Codes 3.3. The Huffman Coding Algorithm 3.3.1 Minimum Variance Huffman Codes 3.3.2 Length of Huffman Codes 3.3.3 Extended Huffman Codes 3.4 Nonbinary Huffman Codes 3.5 Adaptive Huffman Coding 3.5.1 Update Procedure 3.5.2 Encoding Procedure 3.5.3 Decoding Procedure 3.6 Applications of Huffman Coding 3.6.1 Lossless Image Compression 3.6.2 Text Compression 3.6.3 Audio Compression 3.7 Summary 3.8 Projects and Problems 4 Arithmetic Coding 4.1 Overview 4.2 Introduction 4.3 Coding a Sequence 4.3.1 Generating a Tag 4.3.2 Deciphering the Tag 4.4 Generating a Binary Code 4.4.1 Uniqueness and Efficiency of the Arithmetic Code 4.4.2 Algorithm Implementation 4.4.3 Integer Implementation 4.5 Comparison of Huffman and Arithmetic Coding 4.6 Applications 4.6.1 Bi-Level Image Compression-The JBIG Standard 4.6.2 Image Compression 4.7 Summary 4.8 Projects and Problems 5 Dictionary Techniques 5.1 Overview 5.2 Introduction 5.3 Static Dictionary 5.3.1 Diagram Coding 5.4 Adaptive Dictionary 5.4.1 The LZ77 Approach 5.4.2 The LZ78 Approach 5.5 Applications 5.5.1 File Compression-UNIX COMPRESS 5.5.2 Image Compression-the Graphics Interchange Format (GIF) 5.5.3 Compression over Modems-V.42 bis 5.6 Summary 5.7 Projects and Problems 6 Lossless Image Compression 6.1 Overview 6.2 Introduction 6.3 Facsimile Encoding 6.3.1 Run-Length Coding 6.3.2 CCITT Group 3 and 4-Recommendations T.4 and T.6 6.3.3 Comparison of MH, MR, MMR, and JBIG 6.4 Progressive Image Transmission 6.5 Other Image Compression Approaches 6.5.1 Linear Prediction Models 6.5.2 Context Models 6.5.3 Multiresolution Models 6.5.4 Modeling Prediction Errors 6.6 Summary 6.7 Projects and Problems 7 Mathematical Preliminaries 7.1 Overview 7.2 Introduction 7.3 Distortion Criteria 7.3.1 The Human Visual System 7.3.2 Auditory Perception 7.4 Information Theory Revisted 7.4.1 Conditional Entropy 7.4.2 Average Mutual Information 7.4.3 Differential Entropy 7.5 Rate Distortion Theory 7.6 Models 7.6.1 Probability Models 7.6.2 Linear System Models 7.6.3 Physical Models 7.7 Summary 7.8 Projects and Problems 8 Scalar Quantization 8.1 Overview 8.2 Introduction 8.3 The Quantization Problem 8.4 Uniform Quantizer 8.5 Adaptive Quantization 8.5.1 Forward Adaptive Quantization 8.5.2 Backward Adaptive Quantization 8.6 Nonuniform Quantization 8.6.1 pdf-Optimized Quantization 8.6.2 Companded Quantization 8.7 Entropy-Coded Quantization 8.7.1 Entropy Coding of Lloyd-Max Quantizer Outputs 8.7.2 Entropy-Constrained Quantization 8.7.3 High-Rate Optimum Quantization 8.8 Summary 8.9 Projects and Problems 9 Vector Quantization 9.1 Overview 9.2 Introduction 9.3 Advantages of Vector Quantization over Scalar Quantization 9.4 The Linde-Buzo-Gray Algorithm 9.4.1 Initializing the LBG Algorithm 9.4.2 The Empty Cell Problem 9.4.3 Use of LBG for Image Compression 9.5 Tree-Structured Vector Quantizers 9.5.1 Design of Tree-Structured Vector Quantizers 9.6 Structured Vector Quantizers 9.6.1 Pyramid Vector Quantization 9.6.2 Polar and Spherical Vector Quantizers 9.6.3 Lattice Vector Quantizers 9.7 Variations on the Theme 9.7.1 Gain-Shape Vector Quantization 9.7.2 Mean-Removed Vector Quantization 9.7.3 Classified Vector Quantization 9.7.4 Multistage Vector Quantization 9.7.5 Adaptive Vector Quantization 9.8 Summary 9.9 Projects and Problems 10 Differential Encoding 10.1 Overview 10.2 Introduction 10.3 The Basic Algorithm 10.4 Prediction in DPCM 10.5 Adaptive DPCM (ADPCM) 10.5.1 Adaptive Quantization in DPCM 10.5.2 Adaptive Prediction in DPCM 10.6 Delta Modulation 10.6.1 Constant Factor Adaptive Delta Modulation (CFDM) 10.6.2 Continuously Variable Slope Delta Modulation 10.7 Speech Coding 10.7.1 G.726 10.8 Summary 10.9 Projects and Problems 11 Subband Coding 11.1 Overview 11.2 Introduction 11.3 The Frequency Domain and Filtering 11.3.1 Filters 11.4 The Basic Subband Coding Algorithm 11.4.1 Bit Allocation 11.5 Application to Speech Coding-G.722 11.6 Application to Audio Coding-MPEG Audio 11.7 Application to Image Compression 11.7.1 Decomposing an Image 11.7.2 Coding the Subbands 11.8 Wavelets 11.8.1 Families of Wavelets 11.8.2 Wavelets and Image Compression 11.9 Summary 11.10 Projects and Problems 12 Transform Coding 12.1 Overview 12.2 Introduction 12.3 The Transform 12.4 Transforms of Interest 12.4.1 Karhunen-Loeve Transform 12.4.2 Discrete Cosine Transform 12.4.3 Discrete Sine Transform 12.4.4 Discrete Walsh-Hadamard Transform 12.5 Quantization and Coding of Transform Coefficients 12.6 Application to Image Compression-JPEG 12.6.1 The Transform 12.6.2 Quantization 12.6.3 Coding 12.7 Application to Audio Compression 12.8 Summary 12.9 Projects and Problems 13 Analysis/Synthesis Schemes 13.1 Overview 13.2 Introduction 13.3 Speech Compression 13.3.1 The Channel Vocoder 13.3.2 The Linear Predictive Coder (Gov.Std.LPC-10) 13.3.3 Code Excited Linear Prediction (CELP) 13.3.4 Sinusoidal Coders 13.4 Image Compression 13.4.1 Fractal Compression 13.5 Summary 13.6 Projects and Problems 14 Video Compression 14.1 Overview 14.2 Introduction 14.3 Motion Compensation 14.4 Video Signal Representation 14.5 Algorithms for Videoconferencing and Videophones 14.5.1 ITU_T Recommendation H.261 14.5.2 Model-Based Coding 14.6 Asymmetric Applications 14.6.1 The MPEG Video Standard 14.7 Packet Video 14.7.1 ATM Networks 14.7.2 Compression Issues in ATM Networks 14.7.3 Compression Algorithms for Packet Video 14.8 Summary 14.9 Projects and Problems A Probability and Random Processes A.1 Probability A.2 Random Variables A.3 Distribution Functions A.4 Expectation A.5 Types of Distribution A.6 Stochastic Process A.7 Projects and Problems B A Brief Review of Matrix Concepts B.1 A Matrix B.2 Matrix Operations C Codes for Facsimile Encoding D The Root Lattices Bibliography Index

...read moreread less

2,311 citations

Cites background or methods from "Managing Gigabytes: Compressing and..."

...(For a more general look at the problem of managing large amounts of information, see [81]....
[...]
...However, when we have two pixels of each kind, we run into trouble; consistently replacing the four pixels with either a white or black pixel causes a severe loss of detail, and randomly replacing with a black or white pixel introduces a considerable amount of noise into the image [81]....
[...]

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Fast and efficient lossless image compression

[...]

Paul G. Howard¹, Jeffrey Scott Vitter•Institutions (1)

Brown University¹

30 Mar 1993

TL;DR: A new method gives compression comparable with the JPEG lossless mode, with about five times the speed, based on a novel use of two neighboring pixels for both prediction and error modeling.

...read moreread less

Abstract: A new method gives compression comparable with the JPEG lossless mode, with about five times the speed. FELICS is based on a novel use of two neighboring pixels for both prediction and error modeling. For coding, the authors use single bits, adjusted binary codes, and Golomb or Rice codes. For the latter they present and analyze a provably good method for estimating the single coding parameter. >

...read moreread less

259 citations

Managing Gigabytes: Compressing and Indexing Documents and Images

Citations

Cites background from "Managing Gigabytes: Compressing and..."

Cites background or methods from "Managing Gigabytes: Compressing and..."

References

Related Papers (5)