scispace - formally typeset
Search or ask a question
Book

Managing Gigabytes: Compressing and Indexing Documents and Images

TL;DR: A guide to the MG system and its applications, as well as a comparison to the NZDL reference index, are provided.
Abstract: PREFACE 1. OVERVIEW 2. TEXT COMPRESSION 3. INDEXING 4. QUERYING 5. INDEX CONSTRUCTION 6. IMAGE COMPRESSION 7. TEXTUAL IMAGES 8. MIXED TEXT AND IMAGES 9. IMPLEMENTATION 10. THE INFORMATION EXPLOSION A. GUIDE TO THE MG SYSTEM B. GUIDE TO THE NZDL REFERENCES INDEX
Citations
More filters
Journal ArticleDOI
01 Apr 1998
TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Abstract: In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/. To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

14,696 citations

Journal Article
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.

13,327 citations

Proceedings ArticleDOI
Sivic1, Zisserman1
13 Oct 2003
TL;DR: An approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video, represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion.
Abstract: We describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. The object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. The temporal continuity of the video within a shot is used to track the regions in order to reject unstable regions and reduce the effects of noise in the descriptors. The analogy with text retrieval is in the implementation where matches on descriptors are pre-computed (using vector quantization), and inverted file systems and document rankings are used. The result is that retrieved is immediate, returning a ranked list of key frames/shots in the manner of Google. The method is illustrated for matching in two full length feature films.

6,938 citations


Cites background from "Managing Gigabytes: Compressing and..."

  • ...All of the above steps are carried out in advance of actual retrieval, and the set of vectors representing all the documents in a corpus are organized as an inverted file [18] to facilitate efficient retrieval....

    [...]

Posted Content
01 Jan 2001
TL;DR: This paper gives a lightning overview of data mining and its relation to statistics, with particular emphasis on tools for the detection of adverse drug reactions.
Abstract: The growing interest in data mining is motivated by a common problem across disciplines: how does one store, access, model, and ultimately describe and understand very large data sets? Historically, different aspects of data mining have been addressed independently by different disciplines. This is the first truly interdisciplinary text on data mining, blending the contributions of information science, computer science, and statistics. The book consists of three sections. The first, foundations, provides a tutorial overview of the principles underlying data mining algorithms and their application. The presentation emphasizes intuition rather than rigor. The second section, data mining algorithms, shows how algorithms are constructed to solve specific problems in a principled manner. The algorithms covered include trees and rules for classification and regression, association rules, belief networks, classical statistical models, nonlinear models such as neural networks, and local "memory-based" models. The third section shows how all of the preceding analysis fits together when applied to real-world data mining problems. Topics include the role of metadata, how to handle missing data, and data preprocessing.

3,765 citations

Book
01 Jan 1996
TL;DR: The author explains the development of the Huffman Coding Algorithm and some of the techniques used in its implementation, as well as some of its applications, including Image Compression, which is based on the JBIG standard.
Abstract: Preface 1 Introduction 1.1 Compression Techniques 1.1.1 Lossless Compression 1.1.2 Lossy Compression 1.1.3 Measures of Performance 1.2 Modeling and Coding 1.3 Organization of This Book 1.4 Summary 1.5 Projects and Problems 2 Mathematical Preliminaries 2.1 Overview 2.2 A Brief Introduction to Information Theory 2.3 Models 2.3.1 Physical Models 2.3.2 Probability Models 2.3.3. Markov Models 2.3.4 Summary 2.5 Projects and Problems 3 Huffman Coding 3.1 Overview 3.2 "Good" Codes 3.3. The Huffman Coding Algorithm 3.3.1 Minimum Variance Huffman Codes 3.3.2 Length of Huffman Codes 3.3.3 Extended Huffman Codes 3.4 Nonbinary Huffman Codes 3.5 Adaptive Huffman Coding 3.5.1 Update Procedure 3.5.2 Encoding Procedure 3.5.3 Decoding Procedure 3.6 Applications of Huffman Coding 3.6.1 Lossless Image Compression 3.6.2 Text Compression 3.6.3 Audio Compression 3.7 Summary 3.8 Projects and Problems 4 Arithmetic Coding 4.1 Overview 4.2 Introduction 4.3 Coding a Sequence 4.3.1 Generating a Tag 4.3.2 Deciphering the Tag 4.4 Generating a Binary Code 4.4.1 Uniqueness and Efficiency of the Arithmetic Code 4.4.2 Algorithm Implementation 4.4.3 Integer Implementation 4.5 Comparison of Huffman and Arithmetic Coding 4.6 Applications 4.6.1 Bi-Level Image Compression-The JBIG Standard 4.6.2 Image Compression 4.7 Summary 4.8 Projects and Problems 5 Dictionary Techniques 5.1 Overview 5.2 Introduction 5.3 Static Dictionary 5.3.1 Diagram Coding 5.4 Adaptive Dictionary 5.4.1 The LZ77 Approach 5.4.2 The LZ78 Approach 5.5 Applications 5.5.1 File Compression-UNIX COMPRESS 5.5.2 Image Compression-the Graphics Interchange Format (GIF) 5.5.3 Compression over Modems-V.42 bis 5.6 Summary 5.7 Projects and Problems 6 Lossless Image Compression 6.1 Overview 6.2 Introduction 6.3 Facsimile Encoding 6.3.1 Run-Length Coding 6.3.2 CCITT Group 3 and 4-Recommendations T.4 and T.6 6.3.3 Comparison of MH, MR, MMR, and JBIG 6.4 Progressive Image Transmission 6.5 Other Image Compression Approaches 6.5.1 Linear Prediction Models 6.5.2 Context Models 6.5.3 Multiresolution Models 6.5.4 Modeling Prediction Errors 6.6 Summary 6.7 Projects and Problems 7 Mathematical Preliminaries 7.1 Overview 7.2 Introduction 7.3 Distortion Criteria 7.3.1 The Human Visual System 7.3.2 Auditory Perception 7.4 Information Theory Revisted 7.4.1 Conditional Entropy 7.4.2 Average Mutual Information 7.4.3 Differential Entropy 7.5 Rate Distortion Theory 7.6 Models 7.6.1 Probability Models 7.6.2 Linear System Models 7.6.3 Physical Models 7.7 Summary 7.8 Projects and Problems 8 Scalar Quantization 8.1 Overview 8.2 Introduction 8.3 The Quantization Problem 8.4 Uniform Quantizer 8.5 Adaptive Quantization 8.5.1 Forward Adaptive Quantization 8.5.2 Backward Adaptive Quantization 8.6 Nonuniform Quantization 8.6.1 pdf-Optimized Quantization 8.6.2 Companded Quantization 8.7 Entropy-Coded Quantization 8.7.1 Entropy Coding of Lloyd-Max Quantizer Outputs 8.7.2 Entropy-Constrained Quantization 8.7.3 High-Rate Optimum Quantization 8.8 Summary 8.9 Projects and Problems 9 Vector Quantization 9.1 Overview 9.2 Introduction 9.3 Advantages of Vector Quantization over Scalar Quantization 9.4 The Linde-Buzo-Gray Algorithm 9.4.1 Initializing the LBG Algorithm 9.4.2 The Empty Cell Problem 9.4.3 Use of LBG for Image Compression 9.5 Tree-Structured Vector Quantizers 9.5.1 Design of Tree-Structured Vector Quantizers 9.6 Structured Vector Quantizers 9.6.1 Pyramid Vector Quantization 9.6.2 Polar and Spherical Vector Quantizers 9.6.3 Lattice Vector Quantizers 9.7 Variations on the Theme 9.7.1 Gain-Shape Vector Quantization 9.7.2 Mean-Removed Vector Quantization 9.7.3 Classified Vector Quantization 9.7.4 Multistage Vector Quantization 9.7.5 Adaptive Vector Quantization 9.8 Summary 9.9 Projects and Problems 10 Differential Encoding 10.1 Overview 10.2 Introduction 10.3 The Basic Algorithm 10.4 Prediction in DPCM 10.5 Adaptive DPCM (ADPCM) 10.5.1 Adaptive Quantization in DPCM 10.5.2 Adaptive Prediction in DPCM 10.6 Delta Modulation 10.6.1 Constant Factor Adaptive Delta Modulation (CFDM) 10.6.2 Continuously Variable Slope Delta Modulation 10.7 Speech Coding 10.7.1 G.726 10.8 Summary 10.9 Projects and Problems 11 Subband Coding 11.1 Overview 11.2 Introduction 11.3 The Frequency Domain and Filtering 11.3.1 Filters 11.4 The Basic Subband Coding Algorithm 11.4.1 Bit Allocation 11.5 Application to Speech Coding-G.722 11.6 Application to Audio Coding-MPEG Audio 11.7 Application to Image Compression 11.7.1 Decomposing an Image 11.7.2 Coding the Subbands 11.8 Wavelets 11.8.1 Families of Wavelets 11.8.2 Wavelets and Image Compression 11.9 Summary 11.10 Projects and Problems 12 Transform Coding 12.1 Overview 12.2 Introduction 12.3 The Transform 12.4 Transforms of Interest 12.4.1 Karhunen-Loeve Transform 12.4.2 Discrete Cosine Transform 12.4.3 Discrete Sine Transform 12.4.4 Discrete Walsh-Hadamard Transform 12.5 Quantization and Coding of Transform Coefficients 12.6 Application to Image Compression-JPEG 12.6.1 The Transform 12.6.2 Quantization 12.6.3 Coding 12.7 Application to Audio Compression 12.8 Summary 12.9 Projects and Problems 13 Analysis/Synthesis Schemes 13.1 Overview 13.2 Introduction 13.3 Speech Compression 13.3.1 The Channel Vocoder 13.3.2 The Linear Predictive Coder (Gov.Std.LPC-10) 13.3.3 Code Excited Linear Prediction (CELP) 13.3.4 Sinusoidal Coders 13.4 Image Compression 13.4.1 Fractal Compression 13.5 Summary 13.6 Projects and Problems 14 Video Compression 14.1 Overview 14.2 Introduction 14.3 Motion Compensation 14.4 Video Signal Representation 14.5 Algorithms for Videoconferencing and Videophones 14.5.1 ITU_T Recommendation H.261 14.5.2 Model-Based Coding 14.6 Asymmetric Applications 14.6.1 The MPEG Video Standard 14.7 Packet Video 14.7.1 ATM Networks 14.7.2 Compression Issues in ATM Networks 14.7.3 Compression Algorithms for Packet Video 14.8 Summary 14.9 Projects and Problems A Probability and Random Processes A.1 Probability A.2 Random Variables A.3 Distribution Functions A.4 Expectation A.5 Types of Distribution A.6 Stochastic Process A.7 Projects and Problems B A Brief Review of Matrix Concepts B.1 A Matrix B.2 Matrix Operations C Codes for Facsimile Encoding D The Root Lattices Bibliography Index

2,311 citations


Cites background or methods from "Managing Gigabytes: Compressing and..."

  • ...(For a more general look at the problem of managing large amounts of information, see [81]....

    [...]

  • ...However, when we have two pixels of each kind, we run into trouble; consistently replacing the four pixels with either a white or black pixel causes a severe loss of detail, and randomly replacing with a black or white pixel introduces a considerable amount of noise into the image [81]....

    [...]

References
More filters
Proceedings ArticleDOI
30 Mar 1993
TL;DR: A new method gives compression comparable with the JPEG lossless mode, with about five times the speed, based on a novel use of two neighboring pixels for both prediction and error modeling.
Abstract: A new method gives compression comparable with the JPEG lossless mode, with about five times the speed. FELICS is based on a novel use of two neighboring pixels for both prediction and error modeling. For coding, the authors use single bits, adjusted binary codes, and Golomb or Rice codes. For the latter they present and analyze a provably good method for estimating the single coding parameter. >

259 citations