What are the contributions in this paper?

A survey of tree-based data compression methods can be found in this paper, where the authors focus on binary trees and k-ary trees.

How do the authors eliminate pointers in a linear quadtree?

In linear quadtrees pointers are eliminated by storing pixels (black only) by using an encoding which reflects the successive quadrant subdivisions.

What is the alternative to simple sequential storage?

A good alternative to simple sequential storage is to use pixel trees which try to divide the picture into uniform areas, where adjacent pixels have the same colour, and to hierarchically organize these areas [9,10,12,38].

What is the way to save space and allow efficient traversal in binary trees?

A straightforward method for saving space and allowing efficient traversal in binary trees is to "thread" the tree so that isomorphic subtrees are stored only once.

Why is the cost of a search in each of these structures dominated by the largest?

Due to the double exponential growth of the sizes, the cost of a search in each of these structures isdominated by that of the largest.

What is the heuristic for calculating the harmonic decay property?

When the resulting matrix fulfilling the harmonic decay property is compressed by the first-fit method, a structure with space requirement O(n log log n) is obtained.

What is the simplest method to encode the production label of a rule?

One of the simplest methods encodes the production label of a rule, whose left-hand side non-terminal occurs on the left-hand side of r rules by log2 r bits that will uniquely specify the production which has been used in the substitution of the non-terminal.

What is the common way to search for a string in a trie?

It can be (1) null, (2) a pointer to an auxiliary table containing the strings currently represented by the trie, or (3) a pointer to another node in the trie.

How can the authors make a tree traversal more efficient?

The authors omit the details concerning the use of the directories, however the results show that it is possible to construct a structure which uses 2n+ο(n) space and in which a tree traversal requires O(log n) bit-accesses, or O(1) time if O(log n) consecutive bits can be manipulated at unit cost.

What is the problem of minimizing the number of sets needed for binary search?

Another problem is to minimize the worstcase binary search time under the restriction that the number of sets does not exceed a given bound.

(Open Access) Tree compression and optimization with applications (1990) | Jyrki Katajainen

International Journal of Foundations of Computer Science Vol. 1 No. 4 (1990), 425-447



World Scientific Publishing Company

TREE COMPRESSION AND OPTIMIZATION WITH APPLICATIONS

Dedicated to the memory of Markku Tamminen (1945-1989)

JYRKI KATAJAINEN

Department of Computer Science, University of Turku,

Lemminkäisenk. 14, SF-20520 Turku, Finland

and

ERKKI MÄKINEN

Department of Computer Science, University of Tampere

P.O. Box 607, SF-33101 Tampere, Finland

Received 20 October 1989

Revised 1 March 1990

Different methods for compressing trees are surveyed and developed. Tree compression can be

seen as a trade-off problem between time and space in which we can choose different strategies

depending on whether we prefer better compression results or more efficient operations in the

compressed structure. Of special interest is the case where space can be saved while preserving

the functionality of the operations; this is called data optimization. The general compression

scheme employed here consists of separate linearization of the tree structure and the data stored

in the tree. Also some applications of the tree compression methods are explored. These

include the syntax-directed compression of program files, the compression of pixel trees, trie

compaction and dictionaries maintained as implicit data structures.

Keywords: data optimization, linearization of tree structure, linearization of data, syntax-

directed compression, pixel tree, trie, implicit dictionary.

1. Introduction

Data compression is a standard operation for which there are well-known utilities,

e.g. compact and compress in UNIX

. Traditionally, input consists of a stream of

tokens which can be bits, characters or words of fixed or varying length.

Concatenation of consecutive tokens is the structure in the input. However, the data to

be compressed is often stored within a structure of some special form, e.g. list, tree,

UNIX is trademark of AT&T Bell Laboratories.

graph, etc. The plain data is usually worthless or at least of little value outside its

context. Thus, besides the data itself, we must also store and compress if possible the

structure in which the data is stored. In this paper we suppose that the data structure to

be compressed and then manipulated is a tree. For a general survey of data

compression the reader is referred to the recent work by Lelewer and Hirschberg [1].

Trees are widely used structures for maintaining data. They are also used as

auxiliary data structures when compressing data (see e.g. [2,3]). Often it is also

necessary to reduce the space needed for storing a tree itself. In this paper different

tree compression methods are surveyed and developed. Trees are regarded both as the

object and medium of data compression. We mainly concentrate on binary trees but

also make some remarks on k-ary trees; these (ordered, rooted) trees cover most cases

of practical importance.

Let us now formulate the tree compression problem abstractly. Given a tree, the

task is to map it as compactly as possible to memory, which is seen as a string of bits,

a set of memory locations (words), or a set of memory blocks (pages). The range of

the mapping depends on the application in question. In traditional tree compression

the only operations performed are encoding of a tree to a bit string and decoding the

bit string back to a tree.

The benifits gained through data compression of large-to-very-large trees are

obvious since compression reduces storage and data transfer requirements. On the

other hand, there are some severe disadvantages of tree compression. Above all,

compression makes all the normal tree operations (children, parent, search, delete,

insert, etc.) more expensive. In most compression methods there is no other way to

perform these operations other than to decode the compressed tree, carry out the

operation, and encode the tree again!

In the tree optimization problem, a term adopted from the work of Jacobson [4],

the task is to maintain the functionality of a tree in the compressed form. That is, we

want to perform some tree operations as efficiently as done in the uncompressed case

(where an operation is a simple matter of pointer manipulation). We are mainly

concerned with applications where the trees are manipulated in the internal memory of

a computer. So, the mappings are to bit strings or memory words only. For

applications concerning external memories, see for example [5,6].

The compression and optimization of trees is usually performed in two phases: the

compression of the structure (linearization of the tree structure) and the compression

of the data stored in it (linearization of data). Our general policy is to handle the data

and the structure separately. This enables us to compress the plain data by using any

of the known methods and independently find an efficient coding method for the tree

structure irrespective of the form and contents of the data items stored in the nodes.

We shall not consider normal data compression methods but assume that the reader is

familiar with e.g. arithmetic coding [7] and Huffman coding [2]; for a recent textbook

on data compression, see [8]. We want to emphasize that the separation of the data

from the structure will not always give an optimal compression result (see the

applications in sections 6.1 and 8.2) and may not even be possible in some cases.

Appropriate linearization methods for binary trees are presented in chapters 2 and

3. We shall see that only about 2n bits are needed for the structure of any tree on n

nodes; this result is asymptotically optimal. In chapter 4 we attempt to find the

information theoretic optimum. The idea is to represent the structure of a tree by a

single natural number (called the rank of the tree) from the interval 1,…, B

, where

stands for to n

Catalan number giving the number of different binary trees on n

nodes. Chapter 5 deals with the encoding of k-ary trees.

Typical application areas of the tree compression methods include the

representation of graphics as pixel trees [9-12] and program files as syntax trees

[13,14]. Both applications are of practical importance, and the tree methods support

excellent compression results. These applications are examined in greater detail in

chapter 6.

Chapter 7 contains a sketch of the basic ideas presented in [4] showing that tree

traversal is possible in asymptotically optimal space.

One can separate the concepts of concrete and abstract optimization. In concrete

optimization the data structure to be compressed is given while in abstract

optimization only the desired operations are defined. As data optimization

applications we study trie compaction [15-17] and the design of implicit data

structures [18,19] in chapter 8. Trie compaction is an example of concrete

optimization where the purpose of the data optimization is to implement the search

operation within the same time bound as for the uncompressed trie. The trie

compaction problem contains several NP-complete sub-problems. We consider

heuristics for solving one of them. On the other hand, the construction of an implicit

dictionary is an example of abstract optimization. Given only a constant amount of

extra space, the goal is to perform the operations search, insert, and deletion as

efficiently as possible.

The paper closes with some concluding remarks in chapter 9.

2. Encoding with Fixed Length Codewords

Several encoding methods for binary trees are presented in the literature; see [20]

for a treatment of general tree types and [21] for a survey on binary trees. When

forming a one-to-one correspondence between binary trees and integers, these

methods use different kinds of number sequences as intermediate representations. The

present chapter and the next are devoted to encoding methods whose intermediate

phases have reasonable space requirements. We also suggest that these intermediate

phases can be used as compressed representations for tree structures. In fact, there are

several methods for forming a one-to-one correspondence between trees and integers

which are not presented in this paper because they employ space intensive

intermediate representations, e.g. permutations (see [21]).

We start by introducing an encoding method presented by Zaks [22]. Consider

the tree in figure 1a. Label all the nodes by 1 and all the missing subtrees by 0 as in

figure 1b. We obtain the codeword, called Zaks' sequence, by reading the labels in

preorder. (Visit first the root, then recursively traverse the left subtree in preorder, and

then the right subtree in preorder.) Hence, Zaks' sequence related to the tree in figure

1a is 111100100100111001000.

11 1

00000000

Fig. 1a. A binary tree. Fib. 1b. The numbering of nodes and

leaves related to Zaks' sequences.

We have the following characterization for feasible Zaks' sequences. A bit string is

a Zaks' sequence if and only if the following three conditions hold

i) the string begins with 1,

ii) the number of 0's is one greater than the number of 1's,

iii) no proper prefix of the string has the property 2).

The length of a Zaks' sequence is 2n + 1 for a tree with n nodes [22].

The children pattern sequence is a codeword closely related to Zaks' sequence. In

the children pattern method we label the nodes of a binary tree by 00, 01, 10 or 11

depending on whether the node has no children, only the right child, only the left child

or two children, respectively. The codeword is obtained by reading the labels in

preorder as in the Zaks' method. The children pattern sequence of the tree in figure 1a

is 11111100000010110000. Generally, the codeword obtained for a binary tree on n

nodes has length 2n.

Yet another method for representing a binary tree with 2n bits is to use balanced

parentheses. A pair of parentheses corresponds to the root and the children are

recursively represented in the same way inside the parentheses.

To decode a tree structure from a given Zaks', children pattern, or balanced

paranthesis sequence is a relatively straightforward task and for this reason we

exclude it from our treatment.

In the above methods we traverse the trees in preorder. Lee et al. [23] have used

level-by-level order, i.e. first the root, then the children of the root from left to right,

then their children from left to right, and so on. The length of these level-to-level

sequences naturally equals the length of the codewords obtained by the the methods

described above. Lee et al. [23] have found the codewords so obtained useful for

some special purposes. Moreover, in chapter 7 we shall see that the level-by-level

sequence allows tree traversal in compressed trees.

We end this chapter by considering three types of binary trees which form a

hierarchy of the number of bits needed to represent their structure by using encodings

like Zaks' sequence.

A binary tree is said to be regular if each node has either two children or no

children at all. Consider now the children pattern sequence of a regular binary tree.

We do not need labels 01 and 10. Thus, we may label a node having two children with

label 1 and a node having no children with label 0. It follows that the children pattern

sequence of a regular binary tree has only n bits. It is in fact easy to prove that there

are exactly as many regular binary trees on 2n + 1 nodes as there are arbitrary binary

trees on n nodes. Hence, the space requirement of n bits is asymptotically optimal (cf.

chapter 4).

The (almost) complete tree structure is an example of an even more drastic

instance of the above phenomenon. A binary tree is said to be complete if all the

internal nodes have two children and all the paths from leaves to the root are of equal

length. In such a tree the number of nodes is of the form 2

- 1. A binary tree is almost

complete if it can be made complete by inserting leaves to the right-hand side end of

the bottom level. The only thing needed to describe the shape of the tree is the number

of nodes! Hence,



log n



bits are needed when representing the shape of an almost

complete tree on n nodes. This property is extensively used in implicit data structures

(cf. section 8.2).

One criteria for comparing different methods is how easy it is to detect from the

encoded string whether or not there are any regularities in the tree. In Zaks' sequence

the code information related to a node is in different parts of the string, while in the

children pattern method the two bits describing a node are always together. Suppose

we are compressing a regular tree without knowing its degree of regularity. By using

the children pattern method together with arithmetic coding we obtain a compression

result much better than 2 bits per node provided that the compression model reads the

string as a sequence 2-bit integers. Even better compression gain is obtained when the

children patterns are output in the level-by-level order and the tree to be compressed

is almost complete. The resulting string has a long sequence of 1's followed by

another sequence of 0's.

3. Encoding with Varying Length Codewords

In this chapter we introduce two encoding methods based on rotations. (For the

various ways of using rotations in maintaining data structures, see e.g. [24].) These

methods assign an integer to each node of the tree in question and the codeword is

obtained by traversing the tree in symmetric order (traverse first the left subtree in

symmetric order, then visit the root, and then traverse the right subtree in symmetric

order). The number of bits needed in the resulting codeword varies because the size of

the integers assigned to the nodes depends on the shape of the tree. Sometimes less

than 2n bits are needed for representing a given tree's shape.

Tree compression and optimization with applications

Figures

Citations

Point cloud attribute compression with graph transform

Efficient memory representation of XML documents

Efficient suffix trees on secondary storage

Graph-based compression of dynamic 3D point cloud sequences

Graph-Based Compression of Dynamic 3D Point Cloud Sequences

References

Compilers: Principles, Techniques, and Tools

A method for the construction of minimum-redundancy codes

Arithmetic coding for data compression

The Quadtree and Related Hierarchical Data Structures

An effective way to represent quadtrees

Related Papers (5)

Lexicographic generation of ordered trees

The Art of Computer Programming

The art of computer programming, volume 1 (3rd ed.): fundamental algorithms

A guided tour to approximate string matching

Faster tree pattern matching

Frequently Asked Questions (11)

Q1. What are the contributions in this paper?

Q2. How do the authors eliminate pointers in a linear quadtree?

Q3. What is the alternative to simple sequential storage?

Q4. What is the way to save space and allow efficient traversal in binary trees?

Q5. Why is the cost of a search in each of these structures dominated by the largest?

Q6. What is the heuristic for calculating the harmonic decay property?

Q7. What is the simplest method to encode the production label of a rule?

Q8. What is the common way to search for a string in a trie?

Q9. How can the authors make a tree traversal more efficient?

Q10. What is the length of the paths from leaves to the root?

Q11. What is the problem of minimizing the number of sets needed for binary search?