scispace - formally typeset
Search or ask a question

Showing papers on "Compressed pattern matching published in 2017"


Journal ArticleDOI
TL;DR: A new semistatic data compression model that has a fast coding process and that allows compressed pattern matching is introduced and has a better performance against ETDC only on a file that has been written in Turkish.
Abstract: In this study a new semistatic data compression model that has a fast coding process and that allows compressed pattern matching is introduced. The name of the proposed model is chosen as tagged word-based compression algorithm (TWBCA) since it has a word-based coding and word-based compressed matching algorithm. The model has two phases. In the first phase a dictionary is constructed by adding a phrase, paying attention to word boundaries, and in the second phase compression is done by using codewords of phrases in this dictionary. The first byte of the codeword determines whether the word is compressed or not. By paying attention to this rule, the CPM process can be conducted as word based. In addition, the proposed method makes it possible to also search for the group of consecutively compressed words. Any of the previous pattern matching algorithms can be chosen to use in compressed pattern matching as a black box. The duration of the CPM process is always less than the duration of the same process on the texts coded by Gzip tool. While matching longer patterns, compressed pattern matching takes more time on the texts coded by compress and end-tagged dense code (ETDC). However, searching shorter patterns takes less time on texts coded by our approach than the texts compressed with compress. Besides this, the compression ratio of our algorithm has a better performance against ETDC only on a file that has been written in Turkish. The compression performance of TWBCA is stable and does not vary over 6% on different text files.

4 citations


Book ChapterDOI
26 Sep 2017
TL;DR: The notion of optimal skeleton trees is introduced, and an algorithm for achieving such trees is investigated, and the resulting more compact trees can be used to further enhance the time and space complexities of the corresponding algorithms.
Abstract: A skeleton Huffman tree is a Huffman tree from which all complete subtrees of depth \(h \ge 1\) have been pruned. Skeleton Huffman trees are used to save storage and enhance processing time in several applications such as decoding, compressed pattern matching and Wavelet trees for random access. However, the straightforward way of basing the construction of a skeleton tree on a canonical Huffman tree does not necessarily yield the least number of nodes. The notion of optimal skeleton trees is introduced, and an algorithm for achieving such trees is investigated. The resulting more compact trees can be used to further enhance the time and space complexities of the corresponding algorithms.

4 citations



Proceedings ArticleDOI
01 Oct 2017
TL;DR: This paper attempts to explore a pattern matching technique in compressed genomic data without uncompressing it, which attempts to detect the presence of known patterns, in a compressed sequence without decompressing it.
Abstract: Compressing Genomic data for efficiency of storage, transmission and retrieval has been a challenge for biologists as well as computer scientists across the globe for the past decade. The researchers and scientists have concluded on many measures to compress and store the genomic data. The present challenge faced by the research and scientific community is the analysis of these compressed data. It is always possible to decompress the data and do the analysis. Researchers do not consider it as an efficient method as it nullifies the advantages of compressing the genomic data. Analyzing the genomic data involves identifying the presence of microsatellites, tandem repeats, genes, etc. Pattern matching is the efficient way to detect the presence of a known pattern within a sequence. Compressed pattern matching, attempts to detect the presence of known patterns, in a compressed sequence without decompressing it. Compressed pattern matching has been successfully implemented for textual data. This paper attempts to explore a pattern matching technique in compressed genomic data without uncompressing it.

1 citations