How data compression works: exploring LZ78

By Dhanesh Budhrani

Dhanesh Budhrani

In this post we are going to explore LZ78, a lossless data-compression algorithm created by Lempel and Ziv in 1978. As an example, the GIF format is based on LZ78. LZ78 takes advantage of a dictionary-based data structure to compress our data. In this case, it makes use of a trie data structure, as it’s more efficient for this compression technique.

The motivation behind this approach was to get rid of the parameterization that was required to optimize LZ77’s performance. For instance, in LZ77 if our search buffer was too small, the resulting encoding would require more space, although the compression time would be lower. On the opposite side, if our search buffer was too big, the compression time would take longer, but the required space would be lower. Obviously, there is not a universal set of parameters that would perform optimally in every case: we need to optimize these paremeters depending on the pattern of our input data. All in all, one of the main motivations behind LZ78 was to create a universal compression algorithm that does not require any knowledge on the input.

Compression

Before getting into the details of the compression process, we should define the trie data structure that will help us store our dictionary of string patterns (also known as phrases):

  • It is a non-binary tree.
  • The root node represents an empty string.
  • Every node is marked with its dictionary index.
  • Every edge contains the character that should be added to get the value of the child node.
  • In order to get the value of a node, we just need to traverse from our target node to the root node, reading the resulting string from top to bottom.

With this data structure in mind, we may define the compression process:

  1. We read the next character of the input string.
  2. We check whether the current node (starting at the root node) has any outbound edge that contains the character read in step 1.
  • If so, we follow the outbound edge and set the current node as the found node. Then, we go back to step 1.
  • Otherwise, we create an outbound edge with the character read in step 1, leading to a new node. The new node is then marked with an incremental index. We create an encoding tuple with the following structure (pn, c)i where pn represents the index of the parent node, c represents the new character read in step 1 and i represents the index of the new node. After appending the new tuple to the encoding stream, we set the current node as the root node and return to step 1.

Let’s illustrate this process with an example, attempting to compress the following string:

a b a b c b a b a b a a

Initially, our trie data structure only contains the root node, which represents an empty string. We start by reading the first character: ‘a’, which is not available in any of the outbound edges of the current node. We then create a new edge from the root node to a new node, which is marked with the index 1. We create an encoding tuple with the following values: pn = 0 (root node), c = ‘a’, i = 1. Therefore, we append (0,a)1 to the encoding stream. We set the current node as the root node and read the next character (indicated onwards using the square brackets []).

a [b] a b c b a b a b a a
LZ78 encoding: (0,a)1

The root node does not have any outbound edge labelled with ‘b’, so we create it along with a new node, marked with the index 2. We create an encoding tuple with the following values: pn = 0, c = ‘b’, i = 2. Therefore, we append (0,b)2 to the encoding stream, We set the current node as the root node and read the next character.

a b [a] b c b a b a b a a
LZ78 encoding: (0,a)1, (0,b)2

The root node does have an outbound edge labelled with ‘a’, so we follow it through and find ourselves in node 1. We read the next character ‘b’ and check whether node 1 has any outbound edge labelled with ‘b’. That’s not the case, so we create a new outbound edge from node 1, leading to a new node, which is marked with the index 3. We create an encoding tuple with the following values: pn = 1, c = ‘b’, i = 3. Therefore, we append (1,b)3 to the encoding stream. We set the current node as the root node and read the next character.

a b a b [c] b a b a b a a
LZ78 encoding: (0,a)1, (0,b)2, (1,b)3

The root node does not have an outbound edge labelled with ‘c’, so we create it along with a new node, marked with the index 4. We create an encoding tuple with the following values: pn = 0, c = ‘c’, i = 4. Therefore, we append (0,c)4 to the encoding stream. We set the current node as the root node and read the next character.

a b a b c [b] a b a b a a
LZ78 encoding: (0,a)1, (0,b)2, (1,b)3, (0,c)4

The root node does have an outbound edge labelled with ‘b’, so we follow it through and find ourselves in node 2. We read the next character ‘a’ and check whether node 2 has any outbound edge labelled with ‘a’. That’s not the case, so we create a new outbound edge from node 2, leading to a new node, which is marked with the index 5. We create an encoding tuple with the following values: pn = 2, c = ‘a’, i = 5. Therefore, we append (2,a)5 to the encoding stream. We set the current node as the root node and read the next character.

a b a b c b a [b] a b a a
LZ78 encoding: (0,a)1, (0,b)2, (1,b)3, (0,c)4, (2,a)5

The root node does have an outbound edge labelled with ‘b’, so we follow it through and find ourselves in node 2. We read the next character ‘a’ and verify that node 2 does have an outbound edge labelled with ‘a’, leading to node 5. We follow it through and read the next character ‘b’. We see that node 5 does not have any outbound edge labelled with ‘b’, so we create it along with a new node, marked with the index 6. We create an encoding tuple with the following values: pn = 5, c = ‘b’, i = 6. Therefore, we append (5,b)6 to the encoding stream. We set the current node as the root node and read the next character.

a b a b c b a b a b [a] a
LZ78 encoding: (0,a)1, (0,b)2, (1,b)3, (0,c)4, (2,a)5, (5,b)6

The root node does have an outbound edge labelled with ‘a’, so we follow it through and find ourselves in node 1. We read the next character ‘a’ and verify that node 1 does not have an outbound edge labelled with ‘a’, so we create it along with a new node, which is marked with the index 7. We create an encoding tuple with the following values: pn = 1, c = ‘a’, i = 7. Therefore, we append (1,a)7 to the encoding stream. We’ve fully read the input string, so the final encoding stream looks as follows:

a b a b c b a b a b a a
LZ78 encoding: (0,a)1, (0,b)2, (1,b)3, (0,c)4, (2,a)5, (5,b)6, (1,a)7

The following image represents the final state of the trie:

LZ78 trie of ababcbababaa
LZ78 trie of ababcbababaa

Decompression

Let’s see how LZ78 uses its encoded form to reproduce the original string. LZ78 is categorized as a lossless data-compression algorithm, which means that we should be able to fully recover the original string. As opposed to LZ77, LZ78 does allow us to start decompressing from a random tuple.

In order to illustrate the decompression process, let’s attempt to decompress the obtained encoding in the previous section, aiming to obtain the original string. Therefore, our encoding stream in this example would be the following:

(0,a)1, (0,b)2, (1,b)3, (0,c)4, (2,a)5, (5,b)6, (1,a)7

As mentioned before, in order to reproduce the original string that corresponds to a node, we simply need to access the node, traverse the trie until we reach the root node and reverse the obtained string (or read it from top to bottom). With this in mind, we may easily decompress each of the nodes to their corresponding values:

  • node 1: a
  • node 2: b
  • node 3: (node 1) + b = ab
  • node 4: c
  • node 5: (node 2) + a = ba
  • node 6: (node 5) + b = bab
  • node 7: (node 1) + a = aa

Following the order of the encoding stream, we may decompress it to the following value:

Fully decompressed string: a b a b c b a b a b a a

If you check the original to-be-compressed string in the previous section, you will see that they are the same!