scispace - formally typeset
Open AccessJournal ArticleDOI

Tree compression and optimization with applications

Reads0
Chats0
TLDR
Tree compression can be seen as a trade-off problem between time and space in which the authors can choose different strategies depending on whether they prefer better compression results or more efficient operations in the compressed structure.
Abstract
Different methods for compressing trees are surveyed and developed. Tree compression can be seen as a trade-off problem between time and space in which we can choose different strategies depending on whether we prefer better compression results or more efficient operations in the compressed structure. Of special interest is the case where space can be saved while preserving the functionality of the operations; this is called data optimization. The general compression scheme employed here consists of separate linearization of the tree structure and the data stored in the tree. Also some applications of the tree compression methods are explored. These include the syntax-directed compression of program files, the compression of pixel trees, trie compaction and dictionaries maintained as implicit data structures.

read more

Content maybe subject to copyright    Report

International Journal of Foundations of Computer Science Vol. 1 No. 4 (1990), 425-447
World Scientific Publishing Company
TREE COMPRESSION AND OPTIMIZATION WITH APPLICATIONS
Dedicated to the memory of Markku Tamminen (1945-1989)
JYRKI KATAJAINEN
Department of Computer Science, University of Turku,
Lemminkäisenk. 14, SF-20520 Turku, Finland
and
ERKKI MÄKINEN
Department of Computer Science, University of Tampere
P.O. Box 607, SF-33101 Tampere, Finland
Received 20 October 1989
Revised 1 March 1990
Different methods for compressing trees are surveyed and developed. Tree compression can be
seen as a trade-off problem between time and space in which we can choose different strategies
depending on whether we prefer better compression results or more efficient operations in the
compressed structure. Of special interest is the case where space can be saved while preserving
the functionality of the operations; this is called data optimization. The general compression
scheme employed here consists of separate linearization of the tree structure and the data stored
in the tree. Also some applications of the tree compression methods are explored. These
include the syntax-directed compression of program files, the compression of pixel trees, trie
compaction and dictionaries maintained as implicit data structures.
Keywords: data optimization, linearization of tree structure, linearization of data, syntax-
directed compression, pixel tree, trie, implicit dictionary.
1. Introduction
Data compression is a standard operation for which there are well-known utilities,
e.g. compact and compress in UNIX
a
. Traditionally, input consists of a stream of
tokens which can be bits, characters or words of fixed or varying length.
Concatenation of consecutive tokens is the structure in the input. However, the data to
be compressed is often stored within a structure of some special form, e.g. list, tree,
a
UNIX is trademark of AT&T Bell Laboratories.

graph, etc. The plain data is usually worthless or at least of little value outside its
context. Thus, besides the data itself, we must also store and compress if possible the
structure in which the data is stored. In this paper we suppose that the data structure to
be compressed and then manipulated is a tree. For a general survey of data
compression the reader is referred to the recent work by Lelewer and Hirschberg [1].
Trees are widely used structures for maintaining data. They are also used as
auxiliary data structures when compressing data (see e.g. [2,3]). Often it is also
necessary to reduce the space needed for storing a tree itself. In this paper different
tree compression methods are surveyed and developed. Trees are regarded both as the
object and medium of data compression. We mainly concentrate on binary trees but
also make some remarks on k-ary trees; these (ordered, rooted) trees cover most cases
of practical importance.
Let us now formulate the tree compression problem abstractly. Given a tree, the
task is to map it as compactly as possible to memory, which is seen as a string of bits,
a set of memory locations (words), or a set of memory blocks (pages). The range of
the mapping depends on the application in question. In traditional tree compression
the only operations performed are encoding of a tree to a bit string and decoding the
bit string back to a tree.
The benifits gained through data compression of large-to-very-large trees are
obvious since compression reduces storage and data transfer requirements. On the
other hand, there are some severe disadvantages of tree compression. Above all,
compression makes all the normal tree operations (children, parent, search, delete,
insert, etc.) more expensive. In most compression methods there is no other way to
perform these operations other than to decode the compressed tree, carry out the
operation, and encode the tree again!
In the tree optimization problem, a term adopted from the work of Jacobson [4],
the task is to maintain the functionality of a tree in the compressed form. That is, we
want to perform some tree operations as efficiently as done in the uncompressed case
(where an operation is a simple matter of pointer manipulation). We are mainly
concerned with applications where the trees are manipulated in the internal memory of
a computer. So, the mappings are to bit strings or memory words only. For
applications concerning external memories, see for example [5,6].
The compression and optimization of trees is usually performed in two phases: the
compression of the structure (linearization of the tree structure) and the compression
of the data stored in it (linearization of data). Our general policy is to handle the data
and the structure separately. This enables us to compress the plain data by using any
of the known methods and independently find an efficient coding method for the tree
structure irrespective of the form and contents of the data items stored in the nodes.
We shall not consider normal data compression methods but assume that the reader is
familiar with e.g. arithmetic coding [7] and Huffman coding [2]; for a recent textbook
on data compression, see [8]. We want to emphasize that the separation of the data
from the structure will not always give an optimal compression result (see the
applications in sections 6.1 and 8.2) and may not even be possible in some cases.

Appropriate linearization methods for binary trees are presented in chapters 2 and
3. We shall see that only about 2n bits are needed for the structure of any tree on n
nodes; this result is asymptotically optimal. In chapter 4 we attempt to find the
information theoretic optimum. The idea is to represent the structure of a tree by a
single natural number (called the rank of the tree) from the interval 1,…, B
n
, where
B
n
stands for to n
th
Catalan number giving the number of different binary trees on n
nodes. Chapter 5 deals with the encoding of k-ary trees.
Typical application areas of the tree compression methods include the
representation of graphics as pixel trees [9-12] and program files as syntax trees
[13,14]. Both applications are of practical importance, and the tree methods support
excellent compression results. These applications are examined in greater detail in
chapter 6.
Chapter 7 contains a sketch of the basic ideas presented in [4] showing that tree
traversal is possible in asymptotically optimal space.
One can separate the concepts of concrete and abstract optimization. In concrete
optimization the data structure to be compressed is given while in abstract
optimization only the desired operations are defined. As data optimization
applications we study trie compaction [15-17] and the design of implicit data
structures [18,19] in chapter 8. Trie compaction is an example of concrete
optimization where the purpose of the data optimization is to implement the search
operation within the same time bound as for the uncompressed trie. The trie
compaction problem contains several NP-complete sub-problems. We consider
heuristics for solving one of them. On the other hand, the construction of an implicit
dictionary is an example of abstract optimization. Given only a constant amount of
extra space, the goal is to perform the operations search, insert, and deletion as
efficiently as possible.
The paper closes with some concluding remarks in chapter 9.
2. Encoding with Fixed Length Codewords
Several encoding methods for binary trees are presented in the literature; see [20]
for a treatment of general tree types and [21] for a survey on binary trees. When
forming a one-to-one correspondence between binary trees and integers, these
methods use different kinds of number sequences as intermediate representations. The
present chapter and the next are devoted to encoding methods whose intermediate
phases have reasonable space requirements. We also suggest that these intermediate
phases can be used as compressed representations for tree structures. In fact, there are
several methods for forming a one-to-one correspondence between trees and integers
which are not presented in this paper because they employ space intensive
intermediate representations, e.g. permutations (see [21]).
We start by introducing an encoding method presented by Zaks [22]. Consider
the tree in figure 1a. Label all the nodes by 1 and all the missing subtrees by 0 as in
figure 1b. We obtain the codeword, called Zaks' sequence, by reading the labels in

preorder. (Visit first the root, then recursively traverse the left subtree in preorder, and
then the right subtree in preorder.) Hence, Zaks' sequence related to the tree in figure
1a is 111100100100111001000.
u
u
u
u
u
u
u
u
u
u
6
10
4
2
3
1
7
9
8
5
1
1
1
0
1
1
11 1
00000000
00
Fig. 1a. A binary tree. Fib. 1b. The numbering of nodes and
leaves related to Zaks' sequences.
We have the following characterization for feasible Zaks' sequences. A bit string is
a Zaks' sequence if and only if the following three conditions hold
i) the string begins with 1,
ii) the number of 0's is one greater than the number of 1's,
iii) no proper prefix of the string has the property 2).
The length of a Zaks' sequence is 2n + 1 for a tree with n nodes [22].
The children pattern sequence is a codeword closely related to Zaks' sequence. In
the children pattern method we label the nodes of a binary tree by 00, 01, 10 or 11
depending on whether the node has no children, only the right child, only the left child
or two children, respectively. The codeword is obtained by reading the labels in
preorder as in the Zaks' method. The children pattern sequence of the tree in figure 1a
is 11111100000010110000. Generally, the codeword obtained for a binary tree on n
nodes has length 2n.
Yet another method for representing a binary tree with 2n bits is to use balanced
parentheses. A pair of parentheses corresponds to the root and the children are
recursively represented in the same way inside the parentheses.
To decode a tree structure from a given Zaks', children pattern, or balanced
paranthesis sequence is a relatively straightforward task and for this reason we
exclude it from our treatment.
In the above methods we traverse the trees in preorder. Lee et al. [23] have used
level-by-level order, i.e. first the root, then the children of the root from left to right,
then their children from left to right, and so on. The length of these level-to-level
sequences naturally equals the length of the codewords obtained by the the methods
described above. Lee et al. [23] have found the codewords so obtained useful for
some special purposes. Moreover, in chapter 7 we shall see that the level-by-level
sequence allows tree traversal in compressed trees.

We end this chapter by considering three types of binary trees which form a
hierarchy of the number of bits needed to represent their structure by using encodings
like Zaks' sequence.
A binary tree is said to be regular if each node has either two children or no
children at all. Consider now the children pattern sequence of a regular binary tree.
We do not need labels 01 and 10. Thus, we may label a node having two children with
label 1 and a node having no children with label 0. It follows that the children pattern
sequence of a regular binary tree has only n bits. It is in fact easy to prove that there
are exactly as many regular binary trees on 2n + 1 nodes as there are arbitrary binary
trees on n nodes. Hence, the space requirement of n bits is asymptotically optimal (cf.
chapter 4).
The (almost) complete tree structure is an example of an even more drastic
instance of the above phenomenon. A binary tree is said to be complete if all the
internal nodes have two children and all the paths from leaves to the root are of equal
length. In such a tree the number of nodes is of the form 2
n
- 1. A binary tree is almost
complete if it can be made complete by inserting leaves to the right-hand side end of
the bottom level. The only thing needed to describe the shape of the tree is the number
of nodes! Hence,
log n
bits are needed when representing the shape of an almost
complete tree on n nodes. This property is extensively used in implicit data structures
(cf. section 8.2).
One criteria for comparing different methods is how easy it is to detect from the
encoded string whether or not there are any regularities in the tree. In Zaks' sequence
the code information related to a node is in different parts of the string, while in the
children pattern method the two bits describing a node are always together. Suppose
we are compressing a regular tree without knowing its degree of regularity. By using
the children pattern method together with arithmetic coding we obtain a compression
result much better than 2 bits per node provided that the compression model reads the
string as a sequence 2-bit integers. Even better compression gain is obtained when the
children patterns are output in the level-by-level order and the tree to be compressed
is almost complete. The resulting string has a long sequence of 1's followed by
another sequence of 0's.
3. Encoding with Varying Length Codewords
In this chapter we introduce two encoding methods based on rotations. (For the
various ways of using rotations in maintaining data structures, see e.g. [24].) These
methods assign an integer to each node of the tree in question and the codeword is
obtained by traversing the tree in symmetric order (traverse first the left subtree in
symmetric order, then visit the root, and then traverse the right subtree in symmetric
order). The number of bits needed in the resulting codeword varies because the size of
the integers assigned to the nodes depends on the shape of the tree. Sometimes less
than 2n bits are needed for representing a given tree's shape.

Citations
More filters
Proceedings ArticleDOI

Point cloud attribute compression with graph transform

TL;DR: This paper constructs graphs on small neighborhoods of the point cloud by connecting nearby points, and treats the attributes as signals over the graph, and adopts graph transform, which is equivalent to Karhunen-Loève Transform on such graphs, to decorrelate the signal.
Book ChapterDOI

Efficient memory representation of XML documents

TL;DR: A technique is presented that allows to represent the tree structure of an XML document in an efficient way by “compressing” their tree structure, which allows to directly execute queries without prior decompression.
Proceedings ArticleDOI

Efficient suffix trees on secondary storage

TL;DR: A new representation for suffix trees, a data structure used in full text searching, is presented that uses little more storage than the log n bits per index point required to store the list of index points.
Journal ArticleDOI

Graph-based compression of dynamic 3D point cloud sequences

TL;DR: This is the first paper that exploits both the spatial correlation inside each frame and the temporal correlation between the frames (through the motion estimation) to compress the color and the geometry of 3D point cloud sequences in an efficient way.
Journal ArticleDOI

Graph-Based Compression of Dynamic 3D Point Cloud Sequences

TL;DR: In this article, a spectral graph wavelet descriptor is used to estimate the motion of 3D point clouds between consecutive frames and a dense motion field is interpolated by solving a graph-based regularization problem.
References
More filters
Book

Compilers: Principles, Techniques, and Tools

TL;DR: This book discusses the design of a Code Generator, the role of the Lexical Analyzer, and other topics related to code generation and optimization.
Journal ArticleDOI

A method for the construction of minimum-redundancy codes

TL;DR: A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.
Journal ArticleDOI

Arithmetic coding for data compression

TL;DR: The state of the art in data compression is arithmetic coding, not the better-known Huffman method, which gives greater compression, is faster for adaptive models, and clearly separates the model from the channel encoding.
Journal ArticleDOI

The Quadtree and Related Hierarchical Data Structures

TL;DR: L'accentuation est mise sur la representation de donnees dans les applications de traitement d'images, d'infographie, les systemes d'informations geographiques and the robotique.
Journal ArticleDOI

An effective way to represent quadtrees

TL;DR: The sorted array of black nodes is referred to as the “linear quadtree” and it is shown that it introduces a saving of at least 66 percent of the computer storage required by regular quadtrees.
Frequently Asked Questions (11)
Q1. What are the contributions in this paper?

A survey of tree-based data compression methods can be found in this paper, where the authors focus on binary trees and k-ary trees. 

In linear quadtrees pointers are eliminated by storing pixels (black only) by using an encoding which reflects the successive quadrant subdivisions. 

A good alternative to simple sequential storage is to use pixel trees which try to divide the picture into uniform areas, where adjacent pixels have the same colour, and to hierarchically organize these areas [9,10,12,38]. 

A straightforward method for saving space and allowing efficient traversal in binary trees is to "thread" the tree so that isomorphic subtrees are stored only once. 

Due to the double exponential growth of the sizes, the cost of a search in each of these structures isdominated by that of the largest. 

When the resulting matrix fulfilling the harmonic decay property is compressed by the first-fit method, a structure with space requirement O(n log log n) is obtained. 

One of the simplest methods encodes the production label of a rule, whose left-hand side non-terminal occurs on the left-hand side of r rules by log2 r bits that will uniquely specify the production which has been used in the substitution of the non-terminal. 

It can be (1) null, (2) a pointer to an auxiliary table containing the strings currently represented by the trie, or (3) a pointer to another node in the trie. 

The authors omit the details concerning the use of the directories, however the results show that it is possible to construct a structure which uses 2n+ο(n) space and in which a tree traversal requires O(log n) bit-accesses, or O(1) time if O(log n) consecutive bits can be manipulated at unit cost. 

A binary tree is said to be complete if all the internal nodes have two children and all the paths from leaves to the root are of equal length. 

Another problem is to minimize the worstcase binary search time under the restriction that the number of sets does not exceed a given bound.