Showing papers in "Sequence in 1997"

PDF

Open Access

Proceedings Article•DOI•

On the resemblance and containment of documents

[...]

11 Jun 1997-Sequence

TL;DR: The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.

...read moreread less

Abstract: Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper discusses the mathematical properties of these measures and the efficient implementation of the sampling process using Rabin (1981) fingerprints.

...read moreread less

1,989 citations

Proceedings Article•DOI•

Group testing problems with sequences in experimental molecular biology

[...]

Martin Farach¹, Sampath Kannan¹, E. Knill¹, S. Muthukrishnan•Institutions (1)

Rutgers University¹

11 Jun 1997-Sequence

TL;DR: The authors isolate the most basic issues in molecular biological group testing and formulate a set of novel group testing problems for designing cost effective experiments.

...read moreread less

Abstract: Group testing is a basic paradigm for experimental design. In computational biology, group testing problems come up in designing experiments with sequences for mapping, screening libraries, etc. While a great deal of classical research has been done on group testing over the last fifty years, the current biological applications bring up many new issues in group testing which had not been previously considered. The authors isolate the most basic issues in molecular biological group testing. Given these, they formulate a set of novel group testing problems for designing cost effective experiments. For some of these problems they give solutions, while leaving others open.

...read moreread less

101 citations

Proceedings Article•DOI•

Code and parse trees for lossless source encoding

[...]

J. Abrahams¹•Institutions (1)

Office of Naval Research¹

11 Jun 1997-Sequence

TL;DR: Questions related to counting and representing code and parse trees are discussed and variants of Huffman coding in which the assignment of 0s and 1s within codewords is significant such as bidirectionality and synchronization are discussed.

...read moreread less

Abstract: This paper surveys the theoretical literature on fixed-to-variable-length lossless source code trees, called code trees, and on variable-length-to-fixed lossless source code trees, called parse trees. In particular, the following code tree topics are outlined in this survey: characteristics of the Huffman (1952) code tree; Huffman-type coding for infinite source alphabets and universal coding; the Huffman problem subject to a lexicographic constraint, or the Hu-Tucker (1982) problem; the Huffman problem subject to maximum codeword length constraints; code trees which minimize other functions besides average codeword length; coding for unequal cost code symbols, or the Karp problem, and finite state channels; and variants of Huffman coding in which the assignment of 0s and 1s within codewords is significant such as bidirectionality and synchronization. The literature on parse tree topics is less extensive. Treated here are: variants of Tunstall (1968) parsing; dualities between parsing and coding; dual tree coding in which parsing and coding are combined to yield variable-length-to-variable-length codes; and parsing and random number generation. Finally, questions related to counting and representing code and parse trees are also discussed.

...read moreread less

84 citations

Proceedings Article•DOI•

A signature technique for similarity-based queries

[...]

Christos Faloutsos¹, H. V. Jagadish¹, Alberto O. Mendelzon¹, Tova Milo¹•Institutions (1)

University of Maryland, College Park¹

11 Jun 1997-Sequence

TL;DR: This work proposes the use of a signature-based technique to "shrink" the data sequences into signatures, and search the signatures instead of the real sequences, with further comparison being required only when a possible match is indicated.

...read moreread less

Abstract: Jagadish et al. (see Proc. ACM SIGACT-SIGMOD-SIGART PODS, p.36-45, 1995) developed a general framework for posing queries based on similarity. The framework enables a formal definition of the notion of similarity for an application domain of choice, and then its use in queries to perform similarity-based search. We adapt this framework to the specialized domain of real-valued sequences. (Although some of the ideas we present are applicable to other types of data as well). In particular we focus on whole-match queries. By whole-match query we mean the case where the user has to specify the whole sequence. Similarity-based search can be computationally very expensive. The computation cost depends heavily on the length of sequences being compared. To make such similarity testing feasible on large data sets, we propose the use of a signature based technique. In a nutshell, our approach is to "shrink" the data sequences into signatures, and search the signatures instead of the real sequences, with further comparison being required only when a possible match is indicated. Being shorter, signatures can usually be compared much faster than the original sequences. In addition, signatures are usually easier to index. For such a signature-based technique to be effective one has to assure that (1) the signature comparison is fast, and (2) the signature comparison gives few false alarms, and no false dismissals. We obtain measures of goodness for our technique. The technique is illustrated with a couple of very different examples.

...read moreread less

71 citations

Proceedings Article•DOI•

Multialphabet coding with separate alphabet description

[...]

J. Aberg¹, Yu. M. Shtarkov², Ben Smeets•Institutions (2)

Lund University¹, Russian Academy of Sciences²

11 Jun 1997-Sequence

TL;DR: It is shown that these descriptions of the alphabet can be separated in such a way that the encoding of the actual sequence can be performed independently of the Alphabet description, and sequential coding methods for such sequences are presented.

...read moreread less

Abstract: For lossless universal source coding of memoryless sequences with an a priori unknown alphabet size (multialphabet coding), the alphabet of the sequence must be described as well as the sequence itself. Usually an efficient description of the alphabet can be made only by taking into account some additional information. We show that these descriptions can be separated in such a way that the encoding of the actual sequence can be performed independently of the alphabet description, and present sequential coding methods for such sequences. Such methods have applications in coding methods where the alphabet description is made available sequentially, such as PPM.

...read moreread less

46 citations

Proceedings Article•DOI•

On the approximate pattern occurrences in a text

[...]

Mireille Régnier, Wojciech Szpankowski

11 Jun 1997-Sequence

TL;DR: The frequency of approximate occurrences of the pattern H in a random text when overlapping copies of the approximate pattern are counted separately is studied and exact and asymptotic formulae for mean, variance and probability of occurrence as well as asymPTotic results including the central limit theorem and large deviations are provided.

...read moreread less

Abstract: Consider a given pattern H and a random text T generated randomly according to the Bernoulli model. We study the frequency of approximate occurrences of the pattern H in a random text when overlapping copies of the approximate pattern are counted separately. We provide exact and asymptotic formulae for mean, variance and probability of occurrence as well as asymptotic results including the central limit theorem and large deviations. Our approach is combinatorial: we first construct some language expressions that characterize pattern occurrences which are translated into generating functions, and finally we use analytical methods to extract asymptotic behaviours of the pattern frequency. Applications of these results include molecular biology, source coding, synchronization, wireless communications, approximate pattern matching, games, and stock market analysis. These findings are of particular interest to information theory (e.g., second-order properties of the relative frequency), and molecular biology problems (e.g., finding patterns with unexpected high or low frequencies, and gene recognition).

...read moreread less

34 citations

Proceedings Article•DOI•

Vector quantization and density estimation

[...]

Robert M. Gray¹, Richard A. Olshen•Institutions (1)

Stanford University¹

11 Jun 1997-Sequence

TL;DR: This work considers aspects of estimating conditional and unconditional densities in conjunction with Bayes-risk weighted vector quantization for joint compression and classification.

...read moreread less

Abstract: The connection between compression and the estimation of probability distributions has long been known for the case of discrete alphabet sources and lossless coding. A universal lossless code which does a good job of compressing must implicitly also do a good job of modeling. In particular, with a collection of codebooks, one for each possible class or model, if codewords are chosen from among the ensemble of codebooks so as to minimize bit rate, then the codebook selected provides an implicit estimate of the underlying class. Less is known about the corresponding connections between lossy compression and continuous sources. We consider aspects of estimating conditional and unconditional densities in conjunction with Bayes-risk weighted vector quantization for joint compression and classification.

...read moreread less

33 citations

Proceedings Article•DOI•

Compression of low entropy strings with Lempel-Ziv algorithms

[...]

S.R. Kosaraju¹, Giovanni Manzini•Institutions (1)

Johns Hopkins University¹

11 Jun 1997-Sequence

TL;DR: It is shown that the compression ratio of the Lempel-Ziv algorithms can be much higher than the zeroth order entropy H/sub 0/ of the input string, and it is proved that for any string s the compression ratios achieved by LZ77 is bounded by 8H/ sub 0/(s).

...read moreread less

Abstract: We compare the compression ratio of the Lempel-Ziv algorithms with the empirical entropy of the input string, We show that although these algorithms are optimal according to the generally accepted definition, we can find families of low entropy strings which are not compressed optimally. More precisely, we show that the compression ratio achieved by LZ78 (resp. LZ77) can be much higher than the zeroth order entropy H/sub 0/ (resp. the first order entropy H/sub 1/). We present a compression algorithm which combines LZ78 with run length encoding, and we show that for any string s the new algorithm achieves a compression ratio bounded by 3H/sub 0/(s). Finally, we prove that for any string s the compression ratio achieved by LZ77 is bounded by 8H/sub 0/(s).

...read moreread less

33 citations

Proceedings Article•DOI•

Generalized secret sharing and group-key distribution using short keys

[...]

Andrew Mayer¹, Moti Yung•Institutions (1)

Bell Labs¹

11 Jun 1997-Sequence

TL;DR: Two basic primitives: generalized secret sharing and group-key distribution are related and it is proved that the two are related; a reduction is given showing that group- key distribution implies secret-sharing under pseudo-random functions (i.e., one-way functions).

...read moreread less

Abstract: We relate two basic primitives: generalized secret sharing and group-key distribution. We suggest cryptographic implementations for both and show that they are provably secure according to exact definitions and assumptions given in the present paper. Both solutions require small secret space (namely, short keys). We first consider secret sharing with arbitrary access structures which is a basic primitive for controlling retrieval of secret information. We consider the computational security model, where cryptographic assumptions are allowed. Our design of a general secret-sharing scheme requires considerably less secure memory (i.e., shorter keys) than before. We then introduce the notion of a (single source) group-key distribution protocol which allows a center in an integrated network to securely and repeatedly send different keys to different groups. Such a capability is of increasing importance as it is a building block for secret information dissemination to various groups of participants in the presence of eavesdropping in a network environment. There are only a few previous investigations concerning this primitive and they either require a large amount of storage of secret information (due to their information theoretic security model) or lack rigorous definitions and proofs of security. We base both primitives on pseudo-random functions. We prove that the two are related; we give a reduction showing that group-key distribution implies secret-sharing under pseudo-random functions (i.e., one-way functions).

...read moreread less

25 citations

Proceedings Article•DOI•

Text compression using recency rank with context and relation to context sorting, block sorting and PPM/sup */

[...]

K. Sadakane¹•Institutions (1)

University of Tokyo¹

11 Jun 1997-Sequence

TL;DR: The asymptotical optimality of a variation of block sorting is proved and the relation among the RRC, context sorting, block sorting and PPM/sup is derived.

...read moreread less

Abstract: A block sorting compression scheme was developed and its relation to a statistical scheme was studied, but a theoretical analysis of its performance has not been studied fully. Context sorting is a compression scheme based on context similarity and it is regarded as an on-line version of block sorting and it is asymptotically optimal. However, the compression speed is slower and the real performance is not better. We propose a compression scheme using recency rank code with context (RRC), which is based on context similarity. The proposed method encodes characters to recency ranks according to their contexts. It can be implemented using suffix tree and the recency rank code is realized by move-to-front transformation of edges in the suffix tree. It is faster than context sorting and is also asymptotically optimal. The performance is improved by changing models according to the length of the context and by combining some characters into a code. However, it is still inferior to block sorting in both performance and speed. We investigate the reason for the bad performance and we also prove the asymptotical optimality of a variation of block sorting and derive the relation among the RRC, context sorting, block sorting and PPM/sup */ clear.

...read moreread less

16 citations

Proceedings Article•DOI•

Interleaving entropy codes

[...]

P.G. Howard¹•Institutions (1)

AT&T Labs¹

11 Jun 1997-Sequence

TL;DR: This paper provides three extensions to block Melcode (a coder based on interleaved run-length codes) that allow its use with multisymbol alphabets, allow itsUse with an extended class of prefix codes, and reduce its worst-case inefficiency by almost two thirds.

...read moreread less

Abstract: The paper addresses several issues involved in interleaving compressed output from multiple non-prefix codes or from a combination of prefix and non-prefix codes. The technique used throughout is decoder-synchronized encoding, in which the encoder manipulates the data stream to allow just-in-time decoding. We provide three extensions to block Melcode (a coder based on interleaved run-length codes) that allow its use with multisymbol alphabets, allow its use with an extended class of prefix codes, and reduce its worst-case inefficiency by almost two thirds. We also show that it is possible to interleave output from an arithmetic coder with output from a prefix coder, such as a Huffman coder; we present an encoder back-end that handles all the details transparently, requiring only minor changes to the encoders and no changes to the decoders.

...read moreread less

Proceedings Article•DOI•

Matching for run-length encoded strings

[...]

Alberto Apostolico¹, Gad M. Landau², Steven Skiena³•Institutions (3)

Purdue University¹, University of Haifa², State University of New York System³

11 Jun 1997-Sequence

TL;DR: This work considers the problem of finding the longest common subsequence of two strings, and develops significantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems.

...read moreread less

Abstract: Measuring the similarity between two strings, through such standard measures as Hamming distance, edit distance, and longest common subsequence, is one of the fundamental problems in pattern matching. We consider the problem of finding the longest common subsequence of two strings. A well-known dynamic programming algorithm computes the longest common subsequence of strings X and Y in O(|X|/spl middot/|Y|) time. We develop significantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems. A string S is run-length encoded if it is described as an ordered sequence of pairs (/spl sigma/,i), each consisting of an alphabet symbol /spl sigma/ and an integer i. Each pair corresponds to a run in S consisting of i consecutive occurrences of /spl sigma/. For example, the string aaaabbbbcccabbbbcc can be encoded as a/sup 4/b/sup 4/c/sup 3/a/sup 1/b/sup 4/c/sup 2/. Such a run-length encoded string can be significantly shorter than the expanded string representation. Indeed, runlength coding serves as a popular image compression technique, since many classes of images, such as binary images in facsimile transmission, typically contain large patches of identically-valued pixels.

...read moreread less

Proceedings Article•DOI•

A universal upper bound on the performance of the Lempel-Ziv algorithm on maliciously-constructed data

[...]

James I. Lathrop¹, Martin J. Strauss¹•Institutions (1)

Iowa State University¹

11 Jun 1997-Sequence

TL;DR: It is shown that the converse does not hold, i.e., that there are sequences with perfectly balanced asymptotic statistics that the Lempel-Ziv algorithm compresses optimally.

...read moreread less

Abstract: We consider the performance of the Lempel-Ziv (1978) algorithm on finite strings and infinite sequences having unbalanced statistics. We show that such strings and sequences are compressed by the Lempel-Ziv algorithm. We show that the converse does not hold, i.e., that there are sequences with perfectly balanced asymptotic statistics that the Lempel-Ziv algorithm compresses optimally.

...read moreread less

Proceedings Article•DOI•

Constant-space string-matching in sublinear average time

[...]

Maxime Crochemore, Leszek Gasieniec, Wojciech Rytter¹•Institutions (1)

University of Liverpool¹

11 Jun 1997-Sequence

TL;DR: In this article, the first attempt to the small-space string-matching problem in which sublinear time algorithms are delivered was made, in which all occurrences of one- or two-dimensional patterns can be found in O(n/r) average time with constant memory, where r is the repetition size (size of the longest repeated subword) of P.

...read moreread less

Abstract: Given two strings: pattern P of length m and text T of length n. The string-matching problem is to find all occurrences of the pattern P in the text T. We present a simple string-matching algorithm which works in average o(n) time with constant additional space for one-dimensional texts and two-dimensional arrays. This is the first attempt to the small-space string-matching problem in which sublinear time algorithms are delivered. More precisely we show that all occurrences of one- or two-dimensional patterns can be found in O(n/r) average time with constant memory, where r is the repetition size (size of the longest repeated subword) of P.

...read moreread less

Proceedings Article•DOI•

Annotated statistical indices for sequence analysis

[...]

Alberto Apostolico¹, M.E. Bock, Xuyan Xu•Institutions (1)

Purdue University¹

11 Jun 1997-Sequence

TL;DR: This paper addresses the problem of annotating a statistical index with such parameters as the expected value and variance of the number of occurrences of each substring.

...read moreread less

Abstract: A statistical index for string x is a digital-search tree or trie that returns, for any query string /spl omega/ and in a number of comparisons bounded by the length of /spl omega/, the number of occurrences of /spl omega/ in x. Clever algorithms are available that support the construction and weighting of such indices in time and space linear in the length of x. This paper addresses the problem of annotating a statistical index with such parameters as the expected value and variance of the number of occurrences of each substring.

...read moreread less

Proceedings Article•DOI•

Inferring lexical and grammatical structure from sequences

[...]

C.G.N. Manning¹, Ian H. Witten•Institutions (1)

Stanford University¹

11 Jun 1997-Sequence

TL;DR: The focus of this paper is on how these two algorithms can work together, for their combination is far more powerful than either alone and it is shown how they combine to generate the kind of structure sought in the original motivating example.

...read moreread less

Abstract: In a wide variety of sequences from various sources, from music and text to DNA and computer programs, two different but related kinds of structure can be discerned. First, some segments tend to be repeated exactly, such as motifs in music, words or phrases in text, identifiers and syntactic idioms in computer programs. Second, these segments interact with each other in variable but constrained ways. For example, in English text only certain syntactic word classes can appear after the word 'the' many parts of speech (such as verbs) are necessarily excluded. This paper shows how these kinds of structure can be inferred automatically from sequences. We begin with an example that both illustrates the utility of inferring the kinds of structure we seek and shows what our techniques can do. Next we present an efficient and non-obvious algorithm for identifying exact repetitions-including nested repetitions-in time which is linear with the length of the sequence. Then we describe a very simple algorithm for identifying interactions between sequence elements. The focus of this paper is on how these two algorithms can work together, for their combination is far more powerful than either alone. We show how they combine to generate the kind of structure sought in the original motivating example. Although the two methods work well together on many simple examples, the results frequently conflict with intuition in the inference of branching structure. The minimum description length principle seems to provide the only satisfactory general approach.

...read moreread less

Proceedings Article•DOI•

Thresholding wavelets for image compression

[...]

Maria Grazia Albanesi¹•Institutions (1)

University of Pavia¹

11 Jun 1997-Sequence

TL;DR: A new solution to the problem of an accurate choice of thresholds is presented, based on the concept of local contrast and exploits the localization properties of wavelets and a maximization of the entropy to find the optimal threshold for the wavelet coefficients.

...read moreread less

Abstract: The paper addresses the problem of thresholding wavelet coefficients in a transform-based algorithm for still image compression. Processing data before the quantization phase is a crucial step in a compression algorithm, especially in applications which require high compression ratios. In the paper, after a review on the applications of wavelets to image compression, a new solution to the problem of an accurate choice of thresholds is presented. It is based on the concept of local contrast and exploits the localization properties of wavelets and a maximization of the entropy to find the optimal threshold for the wavelet coefficients. The results are compared with standard thresholding techniques which do not include considerations about local distribution of pixel information within the image. At the end, examples of compression are given, where the algorithm includes the complete processing of transform coefficients (thresholding, quantization and coding).

...read moreread less

Proceedings Article•DOI•

A progressive Ziv-Lempel algorithm for image compression

[...]

D. Greene¹, M. Vishwanath, F. Yao, Tong Zhang•Institutions (1)

PARC¹

11 Jun 1997-Sequence

TL;DR: An algorithm that gives a progression of compressed versions of a single image, each stage of the progression is a lossy compression of the image, with the distortion decreasing in each stage, until the last image is losslessly compressed.

...read moreread less

Abstract: We describe an algorithm that gives a progression of compressed versions of a single image. Each stage of the progression is a lossy compression of the image, with the distortion decreasing in each stage, until the last image is losslessly compressed. Progressive encodings are useful in applications such as Web browsing and multicast, where the best rate/distortion tradeoff often is not known in advance. With progressive encoding, the system can respond dynamically: for example, a low-quality version of an image is sufficient when a user wishes to browse quickly, or when a slow link is encountered in a multicast. Our algorithm assumes an initial vector quantization step which maps important information of an image, such as intensity values, into higher-order bits. The bit planes are then sent successively using a progressive Ziv-Lempel (1978) algorithm. We propose data structuring techniques for selectively coding only those entries in a Ziv-Lempel dictionary that are feasible matches, based on shared knowledge of the data transmitted in earlier stages. Our technique, when applied to sample images on the Web, gives significant improvements over interlaced GIF in both image quality and compression rate. Our progressive LZ algorithm runs in amortized linear time.

...read moreread less

Proceedings Article•DOI•

A parallel decoder for LZ2 compression using the ID update heuristic

[...]

S. De Agostino

11 Jun 1997-Sequence

TL;DR: This work shows an other parallel decoding algorithm for LZ2 compression using the ID update heuristic, which works in O(log/sup 2/n) time with O(n/log(n) processors on a PRAM EREW, where n is the length of the output string.

...read moreread less

Abstract: The LZ2 compression method seems hardly parallelizable since some related heuristics are known to be P-complete. In spite of such negative result, the decoding process can be parallelized efficiently for the next character heuristic. We show an other parallel decoding algorithm for LZ2 compression using the ID update heuristic. The algorithm works in O(log/sup 2/n) time with O(n/log(n)) processors on a PRAM EREW, where n is the length of the output string.

...read moreread less

Proceedings Article•DOI•

Optimization of the SW algorithm for high-dimensional compression

[...]

Krasimir D. Kolarov¹, William C. Lynch•Institutions (1)

Interval Research Corporation¹

11 Jun 1997-Sequence

TL;DR: An extension of the SW algorithm is described using different prediction schemes in the zerotree mechanism that leads to a significant improvement of the compression performance of SW.

...read moreread less

Abstract: This paper describes an algorithm and a software package SW (Spherical Wavelets) that implements a method for compression of scalar functions defined on 3D objects. This method combines discrete second generation wavelet transforms with an extension of the embedded zerotree coding method. We present some results on optimizing the performance of the SW algorithm via the use of arithmetic coding, different scaling and norms of the wavelet coefficients. We describe an extension of the SW algorithm using different prediction schemes in the zerotree mechanism. The combined use of those techniques leads to a significant improvement of the compression performance of SW.

...read moreread less

Proceedings Article•DOI•

Hashing on strings, cryptography, and protection of privacy

[...]

Michael O. Rabin¹•Institutions (1)

Harvard University¹

11 Jun 1997-Sequence

TL;DR: This work presents an efficient protocol for questions as to whether a person is in L to be reliably answered without compromising the data concerning anybody else, and has very strong privacy protection properties.

...read moreread less

Abstract: Summary form only given. The issues of privacy and reliability of personal data are of paramount importance. If L is a list of people carrying some harmful defective gene, we want questions as to whether a person is in L to be reliably answered without compromising the data concerning anybody else. Reliability means that once the list is formed, nobody can play with the answer. Thus the answer should be checkable by the agent posing the question. We present an efficient protocol for this task. Our solution has very strong privacy protection properties.

...read moreread less

Proceedings Article•DOI•

Dense coding-a fast alternative to arithmetic coding

[...]

U. Graf¹•Institutions (1)

Darmstadt University of Applied Sciences¹

11 Jun 1997-Sequence

TL;DR: Dense coding is an enhanced variant of interval coding, where redundancies are mostly removed with a new technique called conditional coding, and achieves nearly the same compact code as arithmetic coding.

...read moreread less

Abstract: With dense coding a new method for minimum redundancy coding is introduced. An analysis of arithmetic coding shows, that it is essentially identical to an encoding of discrete intervals. Interval coding is introduced, which encodes symbols directly by encoding the corresponding discrete intervals. Dense coding is an enhanced variant of interval coding, where redundancies are mostly removed with a new technique called conditional coding. Conditional coding is at most 0.086071... bits per encoding step (0.057304... bits in average) longer than optimal encoding. Dense coding uses conditional coding twice and is therefore 0.114608... bits per encoding step worse than the theoretical limit (unlimited precision arithmetic coding). Dense coding is a lot faster than arithmetic coding or Huffman coding and achieves nearly the same compact code as arithmetic coding.

...read moreread less

Proceedings Article•DOI•

Near-lossless image compression schemes based on weighted finite automata encoding and adaptive context modelling

[...]

Paul Bao, Xiaolin Wu¹•Institutions (1)

University of Western Ontario¹

11 Jun 1997-Sequence

TL;DR: By incorporating the proposed recursive WFA encoding techniques into the context modelling based nearly-lossless CALIC (context based adaptive lossless image codec), the proposed scheme was able to increase its PSNR by 1.5 dB or more and get compression rates 15 per cent or better than the original CALIC.

...read moreread less

Abstract: We study high-fidelity image compression with a given tight bound on the maximum error magnitude. We propose a weighted finite automata (WFA) recursive encoding scheme on the adaptive context modelling based quantizing prediction residue images. By incorporating the proposed recursive WFA encoding techniques into the context modelling based nearly-lossless CALIC (context based adaptive lossless image codec), we were able to increase its PSNR by 1.5 dB or more and get compression rates 15 per cent or better than the original CALIC. By combining wavelet methods and WFA encoding, we were able to obtain competitive PSNR results against the best wavelet coders in both L/sub 2/ and L/spl infin/ metrics, while obtaining much smaller maximum error magnitude than the latter.

...read moreread less

Proceedings Article•DOI•

Multi-string search in BSP

[...]

Paolo Ferragina, Fabrizio Luccio

11 Jun 1997-Sequence

TL;DR: A very simple way to distribute the blind trie data structure among the p processors so that the communication cost is balanced and spatial locality can possibly help in taking advantage of the bandwidth of routers.

...read moreread less

Abstract: We have studied the worst-case complexity of the multi-string search problem in the bulk synchronous parallel (BSP) model (Valiant 1990). For this purpose, we have devised a very simple way to distribute the blind trie data structure among the p processors so that the communication cost is balanced. In the light of the very efficient algorithms and data structures known for external memory and the ones designed for the BSP model in this paper, it becomes a very challenging task to investigate the multi-string search problem in the parallel disk model (Vitter and Shriver, 1994) which combines both I/O, computation and communication complexities. In this setting, it would be also interesting to study the dynamic version of the multi-string search problem in which the set of indexed texts can be changed by inserting or deleting individual texts (Ferragina and Grossi 1995). Another interesting direction of research consists of investigating the multi-string search problem on some variants of the BSP model that have been previously introduced to encourage the use of spatial locality. In our setting, pieces of strings have to be moved among the processors to perform the lexicographic comparisons and thus spatial locality can possibly help in taking advantage of the bandwidth of routers.

...read moreread less

Proceedings Article•DOI•

A criterion for model selection using minimum description length

[...]

Amir Homayoon Najmi¹, Richard A. Olshen, Robert M. Gray•Institutions (1)

Stanford University¹

11 Jun 1997-Sequence

TL;DR: To arrive at a model selection criterion with wider applicability, the present derivation relies upon results from information theory and the theory of rate-distortion.

...read moreread less

Abstract: Rissanen (1978) proposed the idea that the goodness of fit of a parametric model of the probability density of a random variable could be thought of as an information coding problem. He argued that the best model was that which was able to describe the training data together with the model parameters using the fewest number of bits of information (Occam's razor). This paper builds upon that basic insight and derives a more general result than did Rissanen, dealing as he was, with time series analysis. To arrive at a model selection criterion with wider applicability, the present derivation relies upon results from information theory and the theory of rate-distortion.

...read moreread less

Proceedings Article•DOI•

A probabilistic approach to some asymptotics in source coding

[...]

Serap A. Savari¹•Institutions (1)

Bell Labs¹

11 Jun 1997-Sequence

TL;DR: A simple derivation of the asymptotic performance of the prefix condition code that minimizes the average transmission cost when the source symbols are equiprobable is provided.

...read moreread less

Abstract: Renewal theory is a powerful tool in the analysis of source codes. In this paper, we use renewal theory to obtain some asymptotic properties of finite-state noiseless channels. We discuss the relationship between these results and earlier uses of renewal theory to analyze the Lempel-Ziv codes and the Tunstall code. As a new application of our results, we provide a simple derivation of the asymptotic performance of the prefix condition code that minimizes the average transmission cost when the source symbols are equiprobable.

...read moreread less

Proceedings Article•DOI•

Kolmogorov random graphs

[...]

Harry Buhrman, Ming Li, Paul M. B. Vitányi

11 Jun 1997-Sequence

TL;DR: In this article, the authors investigate topological, combinatorial, statistical, and enumeration properties of finite graphs with high Kolmogorov complexity using the novel incompressibility method.

...read moreread less

Abstract: We investigate topological, combinatorial, statistical, and enumeration properties of finite graphs with high Kolmogorov complexity (almost all graphs) using the novel incompressibility method. Example results are: (i) the mean and variance of the number of (possibly overlapping) ordered labeled subgraphs of a labeled graph as a function of its randomness deficiency and (ii) a new elementary proof for the number of unlabeled graphs.

...read moreread less

Proceedings Article•DOI•

Asymmetry in Ziv/Lempel '78 parsing

[...]

M. Cohn¹, H. Helfgott•Institutions (1)

Brandeis University¹

11 Jun 1997-Sequence

TL;DR: It is shown that the number of phrases created by Ziv/Lempel '78 parsing of a binary sequence and of its reversal can vary by a factor that grows at least as fast as the logarithm of the sequence length.

...read moreread less

Abstract: We the compare the number of phrases created by Ziv/Lempel '78 parsing of a binary sequence and of its reversal. We show that the two parsings can vary by a factor that grows at least as fast as the logarithm of the sequence length. We then show that under a suitable condition, the factor can even become polynomial, and argue that the condition may not be necessary.

...read moreread less

Proceedings Article•DOI•

Error resilient data compression with adaptive deletion

[...]

James A. Storer¹•Institutions (1)

Brandeis University¹

11 Jun 1997-Sequence

TL;DR: The k-error protocol is presented, a technique for protecting a dynamic dictionary method from error propagation as the result of any k errors on the communication channel or compressed file and experimental evidence that this approach is highly effective in practice against a noisy channel or faulty storage medium is provided.

...read moreread less

Abstract: In earlier work we presented the k-error protocol, a technique for protecting a dynamic dictionary method from error propagation as the result of any k errors on the communication channel or compressed file. Here we further develop this approach and provide experimental evidence that this approach is highly effective in practice against a noisy channel or faulty storage medium. That is, for LZ2-based methods that "blow up" as a result of a single error, with the protocol in place, high error rates (with far more than the k errors for which the protocol was previously designed) can be sustained with no error propagation (the only corrupted bytes decoded are those that are part of the string represented by a pointer that was corrupted). Our experiments include the use of adaptive deletion, which can provide "insurance" for changing sources.

...read moreread less

Proceedings Article•DOI•

Hardness of flip-cut problems from optical mapping [DNA molecules application]

[...]

Vlado Dančík¹, Sridhar Hannenhalli¹, S. Muthukrishnan²•Institutions (2)

University of Southern California¹, Alcatel-Lucent²

11 Jun 1997-Sequence

TL;DR: The authors prove that the EBFC problem, as well as a number of its variants, are NP-complete, and identify another problem formalized as binary shift cut problem motivated by the fact that there might be missing fragments at the beginnings and/or the ends of the molecules, and prove it to be NP- complete.

...read moreread less

Abstract: Optical mapping is a new technology for constructing restriction maps. Associated computational problems include aligning multiple partial restriction maps into a single "consensus" restriction map, and determining the correct orientation of each molecule, which was formalized as the exclusive binary flip cut (EBFC) problem by Muthukrishnan and Parida (see Proc. of the First ACM Conference on Computational Molecular Biology (RECOMB), Santa Fe, p.209-19, 1997). Here, the authors prove that the EBFC problem, as well as a number of its variants, are NP-complete. They also identify another problem formalized as binary shift cut (BSC) problem motivated by the fact that there might be missing fragments at the beginnings and/or the ends of the molecules, and prove it to be NP-complete. Therefore, they do not have efficient, that is, polynomial time solutions unless P=NP.

...read moreread less