scispace - formally typeset
Search or ask a question

Showing papers in "Sequence in 1997"


Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.
Abstract: Given two documents A and B we define two mathematical notions: their resemblance r(A, B) and their containment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document. This paper discusses the mathematical properties of these measures and the efficient implementation of the sampling process using Rabin (1981) fingerprints.

1,989 citations


Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: The authors isolate the most basic issues in molecular biological group testing and formulate a set of novel group testing problems for designing cost effective experiments.
Abstract: Group testing is a basic paradigm for experimental design. In computational biology, group testing problems come up in designing experiments with sequences for mapping, screening libraries, etc. While a great deal of classical research has been done on group testing over the last fifty years, the current biological applications bring up many new issues in group testing which had not been previously considered. The authors isolate the most basic issues in molecular biological group testing. Given these, they formulate a set of novel group testing problems for designing cost effective experiments. For some of these problems they give solutions, while leaving others open.

101 citations


Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: Questions related to counting and representing code and parse trees are discussed and variants of Huffman coding in which the assignment of 0s and 1s within codewords is significant such as bidirectionality and synchronization are discussed.
Abstract: This paper surveys the theoretical literature on fixed-to-variable-length lossless source code trees, called code trees, and on variable-length-to-fixed lossless source code trees, called parse trees. In particular, the following code tree topics are outlined in this survey: characteristics of the Huffman (1952) code tree; Huffman-type coding for infinite source alphabets and universal coding; the Huffman problem subject to a lexicographic constraint, or the Hu-Tucker (1982) problem; the Huffman problem subject to maximum codeword length constraints; code trees which minimize other functions besides average codeword length; coding for unequal cost code symbols, or the Karp problem, and finite state channels; and variants of Huffman coding in which the assignment of 0s and 1s within codewords is significant such as bidirectionality and synchronization. The literature on parse tree topics is less extensive. Treated here are: variants of Tunstall (1968) parsing; dualities between parsing and coding; dual tree coding in which parsing and coding are combined to yield variable-length-to-variable-length codes; and parsing and random number generation. Finally, questions related to counting and representing code and parse trees are also discussed.

84 citations


Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: This work proposes the use of a signature-based technique to "shrink" the data sequences into signatures, and search the signatures instead of the real sequences, with further comparison being required only when a possible match is indicated.
Abstract: Jagadish et al. (see Proc. ACM SIGACT-SIGMOD-SIGART PODS, p.36-45, 1995) developed a general framework for posing queries based on similarity. The framework enables a formal definition of the notion of similarity for an application domain of choice, and then its use in queries to perform similarity-based search. We adapt this framework to the specialized domain of real-valued sequences. (Although some of the ideas we present are applicable to other types of data as well). In particular we focus on whole-match queries. By whole-match query we mean the case where the user has to specify the whole sequence. Similarity-based search can be computationally very expensive. The computation cost depends heavily on the length of sequences being compared. To make such similarity testing feasible on large data sets, we propose the use of a signature based technique. In a nutshell, our approach is to "shrink" the data sequences into signatures, and search the signatures instead of the real sequences, with further comparison being required only when a possible match is indicated. Being shorter, signatures can usually be compared much faster than the original sequences. In addition, signatures are usually easier to index. For such a signature-based technique to be effective one has to assure that (1) the signature comparison is fast, and (2) the signature comparison gives few false alarms, and no false dismissals. We obtain measures of goodness for our technique. The technique is illustrated with a couple of very different examples.

71 citations


Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: It is shown that these descriptions of the alphabet can be separated in such a way that the encoding of the actual sequence can be performed independently of the Alphabet description, and sequential coding methods for such sequences are presented.
Abstract: For lossless universal source coding of memoryless sequences with an a priori unknown alphabet size (multialphabet coding), the alphabet of the sequence must be described as well as the sequence itself. Usually an efficient description of the alphabet can be made only by taking into account some additional information. We show that these descriptions can be separated in such a way that the encoding of the actual sequence can be performed independently of the alphabet description, and present sequential coding methods for such sequences. Such methods have applications in coding methods where the alphabet description is made available sequentially, such as PPM.

46 citations


Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: The frequency of approximate occurrences of the pattern H in a random text when overlapping copies of the approximate pattern are counted separately is studied and exact and asymptotic formulae for mean, variance and probability of occurrence as well as asymPTotic results including the central limit theorem and large deviations are provided.
Abstract: Consider a given pattern H and a random text T generated randomly according to the Bernoulli model. We study the frequency of approximate occurrences of the pattern H in a random text when overlapping copies of the approximate pattern are counted separately. We provide exact and asymptotic formulae for mean, variance and probability of occurrence as well as asymptotic results including the central limit theorem and large deviations. Our approach is combinatorial: we first construct some language expressions that characterize pattern occurrences which are translated into generating functions, and finally we use analytical methods to extract asymptotic behaviours of the pattern frequency. Applications of these results include molecular biology, source coding, synchronization, wireless communications, approximate pattern matching, games, and stock market analysis. These findings are of particular interest to information theory (e.g., second-order properties of the relative frequency), and molecular biology problems (e.g., finding patterns with unexpected high or low frequencies, and gene recognition).

34 citations


Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: This work considers aspects of estimating conditional and unconditional densities in conjunction with Bayes-risk weighted vector quantization for joint compression and classification.
Abstract: The connection between compression and the estimation of probability distributions has long been known for the case of discrete alphabet sources and lossless coding. A universal lossless code which does a good job of compressing must implicitly also do a good job of modeling. In particular, with a collection of codebooks, one for each possible class or model, if codewords are chosen from among the ensemble of codebooks so as to minimize bit rate, then the codebook selected provides an implicit estimate of the underlying class. Less is known about the corresponding connections between lossy compression and continuous sources. We consider aspects of estimating conditional and unconditional densities in conjunction with Bayes-risk weighted vector quantization for joint compression and classification.

33 citations


Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: It is shown that the compression ratio of the Lempel-Ziv algorithms can be much higher than the zeroth order entropy H/sub 0/ of the input string, and it is proved that for any string s the compression ratios achieved by LZ77 is bounded by 8H/ sub 0/(s).
Abstract: We compare the compression ratio of the Lempel-Ziv algorithms with the empirical entropy of the input string, We show that although these algorithms are optimal according to the generally accepted definition, we can find families of low entropy strings which are not compressed optimally. More precisely, we show that the compression ratio achieved by LZ78 (resp. LZ77) can be much higher than the zeroth order entropy H/sub 0/ (resp. the first order entropy H/sub 1/). We present a compression algorithm which combines LZ78 with run length encoding, and we show that for any string s the new algorithm achieves a compression ratio bounded by 3H/sub 0/(s). Finally, we prove that for any string s the compression ratio achieved by LZ77 is bounded by 8H/sub 0/(s).

33 citations


Proceedings ArticleDOI
Andrew Mayer1, Moti Yung
11 Jun 1997-Sequence
TL;DR: Two basic primitives: generalized secret sharing and group-key distribution are related and it is proved that the two are related; a reduction is given showing that group- key distribution implies secret-sharing under pseudo-random functions (i.e., one-way functions).
Abstract: We relate two basic primitives: generalized secret sharing and group-key distribution. We suggest cryptographic implementations for both and show that they are provably secure according to exact definitions and assumptions given in the present paper. Both solutions require small secret space (namely, short keys). We first consider secret sharing with arbitrary access structures which is a basic primitive for controlling retrieval of secret information. We consider the computational security model, where cryptographic assumptions are allowed. Our design of a general secret-sharing scheme requires considerably less secure memory (i.e., shorter keys) than before. We then introduce the notion of a (single source) group-key distribution protocol which allows a center in an integrated network to securely and repeatedly send different keys to different groups. Such a capability is of increasing importance as it is a building block for secret information dissemination to various groups of participants in the presence of eavesdropping in a network environment. There are only a few previous investigations concerning this primitive and they either require a large amount of storage of secret information (due to their information theoretic security model) or lack rigorous definitions and proofs of security. We base both primitives on pseudo-random functions. We prove that the two are related; we give a reduction showing that group-key distribution implies secret-sharing under pseudo-random functions (i.e., one-way functions).

25 citations


Proceedings ArticleDOI
K. Sadakane1
11 Jun 1997-Sequence
TL;DR: The asymptotical optimality of a variation of block sorting is proved and the relation among the RRC, context sorting, block sorting and PPM/sup is derived.
Abstract: A block sorting compression scheme was developed and its relation to a statistical scheme was studied, but a theoretical analysis of its performance has not been studied fully. Context sorting is a compression scheme based on context similarity and it is regarded as an on-line version of block sorting and it is asymptotically optimal. However, the compression speed is slower and the real performance is not better. We propose a compression scheme using recency rank code with context (RRC), which is based on context similarity. The proposed method encodes characters to recency ranks according to their contexts. It can be implemented using suffix tree and the recency rank code is realized by move-to-front transformation of edges in the suffix tree. It is faster than context sorting and is also asymptotically optimal. The performance is improved by changing models according to the length of the context and by combining some characters into a code. However, it is still inferior to block sorting in both performance and speed. We investigate the reason for the bad performance and we also prove the asymptotical optimality of a variation of block sorting and derive the relation among the RRC, context sorting, block sorting and PPM/sup */ clear.

16 citations


Proceedings ArticleDOI
P.G. Howard1
11 Jun 1997-Sequence
TL;DR: This paper provides three extensions to block Melcode (a coder based on interleaved run-length codes) that allow its use with multisymbol alphabets, allow itsUse with an extended class of prefix codes, and reduce its worst-case inefficiency by almost two thirds.
Abstract: The paper addresses several issues involved in interleaving compressed output from multiple non-prefix codes or from a combination of prefix and non-prefix codes. The technique used throughout is decoder-synchronized encoding, in which the encoder manipulates the data stream to allow just-in-time decoding. We provide three extensions to block Melcode (a coder based on interleaved run-length codes) that allow its use with multisymbol alphabets, allow its use with an extended class of prefix codes, and reduce its worst-case inefficiency by almost two thirds. We also show that it is possible to interleave output from an arithmetic coder with output from a prefix coder, such as a Huffman coder; we present an encoder back-end that handles all the details transparently, requiring only minor changes to the encoders and no changes to the decoders.

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: This work considers the problem of finding the longest common subsequence of two strings, and develops significantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems.
Abstract: Measuring the similarity between two strings, through such standard measures as Hamming distance, edit distance, and longest common subsequence, is one of the fundamental problems in pattern matching. We consider the problem of finding the longest common subsequence of two strings. A well-known dynamic programming algorithm computes the longest common subsequence of strings X and Y in O(|X|/spl middot/|Y|) time. We develop significantly faster algorithms for a special class of strings which emerge frequently in pattern matching problems. A string S is run-length encoded if it is described as an ordered sequence of pairs (/spl sigma/,i), each consisting of an alphabet symbol /spl sigma/ and an integer i. Each pair corresponds to a run in S consisting of i consecutive occurrences of /spl sigma/. For example, the string aaaabbbbcccabbbbcc can be encoded as a/sup 4/b/sup 4/c/sup 3/a/sup 1/b/sup 4/c/sup 2/. Such a run-length encoded string can be significantly shorter than the expanded string representation. Indeed, runlength coding serves as a popular image compression technique, since many classes of images, such as binary images in facsimile transmission, typically contain large patches of identically-valued pixels.

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: It is shown that the converse does not hold, i.e., that there are sequences with perfectly balanced asymptotic statistics that the Lempel-Ziv algorithm compresses optimally.
Abstract: We consider the performance of the Lempel-Ziv (1978) algorithm on finite strings and infinite sequences having unbalanced statistics. We show that such strings and sequences are compressed by the Lempel-Ziv algorithm. We show that the converse does not hold, i.e., that there are sequences with perfectly balanced asymptotic statistics that the Lempel-Ziv algorithm compresses optimally.

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: In this article, the first attempt to the small-space string-matching problem in which sublinear time algorithms are delivered was made, in which all occurrences of one- or two-dimensional patterns can be found in O(n/r) average time with constant memory, where r is the repetition size (size of the longest repeated subword) of P.
Abstract: Given two strings: pattern P of length m and text T of length n. The string-matching problem is to find all occurrences of the pattern P in the text T. We present a simple string-matching algorithm which works in average o(n) time with constant additional space for one-dimensional texts and two-dimensional arrays. This is the first attempt to the small-space string-matching problem in which sublinear time algorithms are delivered. More precisely we show that all occurrences of one- or two-dimensional patterns can be found in O(n/r) average time with constant memory, where r is the repetition size (size of the longest repeated subword) of P.

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: This paper addresses the problem of annotating a statistical index with such parameters as the expected value and variance of the number of occurrences of each substring.
Abstract: A statistical index for string x is a digital-search tree or trie that returns, for any query string /spl omega/ and in a number of comparisons bounded by the length of /spl omega/, the number of occurrences of /spl omega/ in x. Clever algorithms are available that support the construction and weighting of such indices in time and space linear in the length of x. This paper addresses the problem of annotating a statistical index with such parameters as the expected value and variance of the number of occurrences of each substring.

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: The focus of this paper is on how these two algorithms can work together, for their combination is far more powerful than either alone and it is shown how they combine to generate the kind of structure sought in the original motivating example.
Abstract: In a wide variety of sequences from various sources, from music and text to DNA and computer programs, two different but related kinds of structure can be discerned. First, some segments tend to be repeated exactly, such as motifs in music, words or phrases in text, identifiers and syntactic idioms in computer programs. Second, these segments interact with each other in variable but constrained ways. For example, in English text only certain syntactic word classes can appear after the word 'the' many parts of speech (such as verbs) are necessarily excluded. This paper shows how these kinds of structure can be inferred automatically from sequences. We begin with an example that both illustrates the utility of inferring the kinds of structure we seek and shows what our techniques can do. Next we present an efficient and non-obvious algorithm for identifying exact repetitions-including nested repetitions-in time which is linear with the length of the sequence. Then we describe a very simple algorithm for identifying interactions between sequence elements. The focus of this paper is on how these two algorithms can work together, for their combination is far more powerful than either alone. We show how they combine to generate the kind of structure sought in the original motivating example. Although the two methods work well together on many simple examples, the results frequently conflict with intuition in the inference of branching structure. The minimum description length principle seems to provide the only satisfactory general approach.

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: A new solution to the problem of an accurate choice of thresholds is presented, based on the concept of local contrast and exploits the localization properties of wavelets and a maximization of the entropy to find the optimal threshold for the wavelet coefficients.
Abstract: The paper addresses the problem of thresholding wavelet coefficients in a transform-based algorithm for still image compression. Processing data before the quantization phase is a crucial step in a compression algorithm, especially in applications which require high compression ratios. In the paper, after a review on the applications of wavelets to image compression, a new solution to the problem of an accurate choice of thresholds is presented. It is based on the concept of local contrast and exploits the localization properties of wavelets and a maximization of the entropy to find the optimal threshold for the wavelet coefficients. The results are compared with standard thresholding techniques which do not include considerations about local distribution of pixel information within the image. At the end, examples of compression are given, where the algorithm includes the complete processing of transform coefficients (thresholding, quantization and coding).

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: An algorithm that gives a progression of compressed versions of a single image, each stage of the progression is a lossy compression of the image, with the distortion decreasing in each stage, until the last image is losslessly compressed.
Abstract: We describe an algorithm that gives a progression of compressed versions of a single image. Each stage of the progression is a lossy compression of the image, with the distortion decreasing in each stage, until the last image is losslessly compressed. Progressive encodings are useful in applications such as Web browsing and multicast, where the best rate/distortion tradeoff often is not known in advance. With progressive encoding, the system can respond dynamically: for example, a low-quality version of an image is sufficient when a user wishes to browse quickly, or when a slow link is encountered in a multicast. Our algorithm assumes an initial vector quantization step which maps important information of an image, such as intensity values, into higher-order bits. The bit planes are then sent successively using a progressive Ziv-Lempel (1978) algorithm. We propose data structuring techniques for selectively coding only those entries in a Ziv-Lempel dictionary that are feasible matches, based on shared knowledge of the data transmitted in earlier stages. Our technique, when applied to sample images on the Web, gives significant improvements over interlaced GIF in both image quality and compression rate. Our progressive LZ algorithm runs in amortized linear time.

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: This work shows an other parallel decoding algorithm for LZ2 compression using the ID update heuristic, which works in O(log/sup 2/n) time with O(n/log(n) processors on a PRAM EREW, where n is the length of the output string.
Abstract: The LZ2 compression method seems hardly parallelizable since some related heuristics are known to be P-complete. In spite of such negative result, the decoding process can be parallelized efficiently for the next character heuristic. We show an other parallel decoding algorithm for LZ2 compression using the ID update heuristic. The algorithm works in O(log/sup 2/n) time with O(n/log(n)) processors on a PRAM EREW, where n is the length of the output string.

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: An extension of the SW algorithm is described using different prediction schemes in the zerotree mechanism that leads to a significant improvement of the compression performance of SW.
Abstract: This paper describes an algorithm and a software package SW (Spherical Wavelets) that implements a method for compression of scalar functions defined on 3D objects. This method combines discrete second generation wavelet transforms with an extension of the embedded zerotree coding method. We present some results on optimizing the performance of the SW algorithm via the use of arithmetic coding, different scaling and norms of the wavelet coefficients. We describe an extension of the SW algorithm using different prediction schemes in the zerotree mechanism. The combined use of those techniques leads to a significant improvement of the compression performance of SW.

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: This work presents an efficient protocol for questions as to whether a person is in L to be reliably answered without compromising the data concerning anybody else, and has very strong privacy protection properties.
Abstract: Summary form only given. The issues of privacy and reliability of personal data are of paramount importance. If L is a list of people carrying some harmful defective gene, we want questions as to whether a person is in L to be reliably answered without compromising the data concerning anybody else. Reliability means that once the list is formed, nobody can play with the answer. Thus the answer should be checkable by the agent posing the question. We present an efficient protocol for this task. Our solution has very strong privacy protection properties.

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: Dense coding is an enhanced variant of interval coding, where redundancies are mostly removed with a new technique called conditional coding, and achieves nearly the same compact code as arithmetic coding.
Abstract: With dense coding a new method for minimum redundancy coding is introduced. An analysis of arithmetic coding shows, that it is essentially identical to an encoding of discrete intervals. Interval coding is introduced, which encodes symbols directly by encoding the corresponding discrete intervals. Dense coding is an enhanced variant of interval coding, where redundancies are mostly removed with a new technique called conditional coding. Conditional coding is at most 0.086071... bits per encoding step (0.057304... bits in average) longer than optimal encoding. Dense coding uses conditional coding twice and is therefore 0.114608... bits per encoding step worse than the theoretical limit (unlimited precision arithmetic coding). Dense coding is a lot faster than arithmetic coding or Huffman coding and achieves nearly the same compact code as arithmetic coding.

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: By incorporating the proposed recursive WFA encoding techniques into the context modelling based nearly-lossless CALIC (context based adaptive lossless image codec), the proposed scheme was able to increase its PSNR by 1.5 dB or more and get compression rates 15 per cent or better than the original CALIC.
Abstract: We study high-fidelity image compression with a given tight bound on the maximum error magnitude. We propose a weighted finite automata (WFA) recursive encoding scheme on the adaptive context modelling based quantizing prediction residue images. By incorporating the proposed recursive WFA encoding techniques into the context modelling based nearly-lossless CALIC (context based adaptive lossless image codec), we were able to increase its PSNR by 1.5 dB or more and get compression rates 15 per cent or better than the original CALIC. By combining wavelet methods and WFA encoding, we were able to obtain competitive PSNR results against the best wavelet coders in both L/sub 2/ and L/spl infin/ metrics, while obtaining much smaller maximum error magnitude than the latter.

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: A very simple way to distribute the blind trie data structure among the p processors so that the communication cost is balanced and spatial locality can possibly help in taking advantage of the bandwidth of routers.
Abstract: We have studied the worst-case complexity of the multi-string search problem in the bulk synchronous parallel (BSP) model (Valiant 1990). For this purpose, we have devised a very simple way to distribute the blind trie data structure among the p processors so that the communication cost is balanced. In the light of the very efficient algorithms and data structures known for external memory and the ones designed for the BSP model in this paper, it becomes a very challenging task to investigate the multi-string search problem in the parallel disk model (Vitter and Shriver, 1994) which combines both I/O, computation and communication complexities. In this setting, it would be also interesting to study the dynamic version of the multi-string search problem in which the set of indexed texts can be changed by inserting or deleting individual texts (Ferragina and Grossi 1995). Another interesting direction of research consists of investigating the multi-string search problem on some variants of the BSP model that have been previously introduced to encourage the use of spatial locality. In our setting, pieces of strings have to be moved among the processors to perform the lexicographic comparisons and thus spatial locality can possibly help in taking advantage of the bandwidth of routers.

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: To arrive at a model selection criterion with wider applicability, the present derivation relies upon results from information theory and the theory of rate-distortion.
Abstract: Rissanen (1978) proposed the idea that the goodness of fit of a parametric model of the probability density of a random variable could be thought of as an information coding problem. He argued that the best model was that which was able to describe the training data together with the model parameters using the fewest number of bits of information (Occam's razor). This paper builds upon that basic insight and derives a more general result than did Rissanen, dealing as he was, with time series analysis. To arrive at a model selection criterion with wider applicability, the present derivation relies upon results from information theory and the theory of rate-distortion.

Proceedings ArticleDOI
Serap A. Savari1
11 Jun 1997-Sequence
TL;DR: A simple derivation of the asymptotic performance of the prefix condition code that minimizes the average transmission cost when the source symbols are equiprobable is provided.
Abstract: Renewal theory is a powerful tool in the analysis of source codes. In this paper, we use renewal theory to obtain some asymptotic properties of finite-state noiseless channels. We discuss the relationship between these results and earlier uses of renewal theory to analyze the Lempel-Ziv codes and the Tunstall code. As a new application of our results, we provide a simple derivation of the asymptotic performance of the prefix condition code that minimizes the average transmission cost when the source symbols are equiprobable.

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: In this article, the authors investigate topological, combinatorial, statistical, and enumeration properties of finite graphs with high Kolmogorov complexity using the novel incompressibility method.
Abstract: We investigate topological, combinatorial, statistical, and enumeration properties of finite graphs with high Kolmogorov complexity (almost all graphs) using the novel incompressibility method. Example results are: (i) the mean and variance of the number of (possibly overlapping) ordered labeled subgraphs of a labeled graph as a function of its randomness deficiency and (ii) a new elementary proof for the number of unlabeled graphs.

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: It is shown that the number of phrases created by Ziv/Lempel '78 parsing of a binary sequence and of its reversal can vary by a factor that grows at least as fast as the logarithm of the sequence length.
Abstract: We the compare the number of phrases created by Ziv/Lempel '78 parsing of a binary sequence and of its reversal. We show that the two parsings can vary by a factor that grows at least as fast as the logarithm of the sequence length. We then show that under a suitable condition, the factor can even become polynomial, and argue that the condition may not be necessary.

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: The k-error protocol is presented, a technique for protecting a dynamic dictionary method from error propagation as the result of any k errors on the communication channel or compressed file and experimental evidence that this approach is highly effective in practice against a noisy channel or faulty storage medium is provided.
Abstract: In earlier work we presented the k-error protocol, a technique for protecting a dynamic dictionary method from error propagation as the result of any k errors on the communication channel or compressed file. Here we further develop this approach and provide experimental evidence that this approach is highly effective in practice against a noisy channel or faulty storage medium. That is, for LZ2-based methods that "blow up" as a result of a single error, with the protocol in place, high error rates (with far more than the k errors for which the protocol was previously designed) can be sustained with no error propagation (the only corrupted bytes decoded are those that are part of the string represented by a pointer that was corrupted). Our experiments include the use of adaptive deletion, which can provide "insurance" for changing sources.

Proceedings ArticleDOI
11 Jun 1997-Sequence
TL;DR: The authors prove that the EBFC problem, as well as a number of its variants, are NP-complete, and identify another problem formalized as binary shift cut problem motivated by the fact that there might be missing fragments at the beginnings and/or the ends of the molecules, and prove it to be NP- complete.
Abstract: Optical mapping is a new technology for constructing restriction maps. Associated computational problems include aligning multiple partial restriction maps into a single "consensus" restriction map, and determining the correct orientation of each molecule, which was formalized as the exclusive binary flip cut (EBFC) problem by Muthukrishnan and Parida (see Proc. of the First ACM Conference on Computational Molecular Biology (RECOMB), Santa Fe, p.209-19, 1997). Here, the authors prove that the EBFC problem, as well as a number of its variants, are NP-complete. They also identify another problem formalized as binary shift cut (BSC) problem motivated by the fact that there might be missing fragments at the beginnings and/or the ends of the molecules, and prove it to be NP-complete. Therefore, they do not have efficient, that is, polynomial time solutions unless P=NP.