Copmression (not definitive but a guide to some forms), page 1

I had to do an essay for my school and did it on Data Compression. If anyone is interested in some very basics then hopefully this will explain some techniques to you:

* LZ78 was the best lossless data compression algorithm at its conception *

Introduction
With the advent of mass computer communication and archiving in the 1970s compression techniques were seen to be key to the development of efficient storing of data passed between systems. It is my argument that the algorithm named LZ78 was the best data compression algorithm used at the time of its creation. The LZ78 algorithm is so called, because it was derived from the initials and year the mathematicians, Abraham Lempel and Jacob Ziv officially released it in 1978. However, they did much of their work in the 1970’s and are considered to be the fathers of modern data compression. Their initial algorithms are still the basis for many of today’s data compression techniques, including the popular ZIP and RAR formats.

In order to argue the statement “LZ78 was the best lossless data compression algorithm at its conception” we must examine other techniques used at the time and those under development. Whilst there are a number of unique and different compression algorithms, they are all ‘Lossless’ and can usually be based upon the five following techniques, each with its advantages and disadvantages: Equation based, Delta Encoding, Run Length Encoding, Self Delimiting and Dictionary based compression. Being ‘lossless’ means that when decoded the decompressed data will be identical to that of the data before compression, whereas lossy compression involves non-essential data being intentionally cut from the original, so that the decompressed data, whilst not identical, is of sufficient quality to be fit for purpose. This is often acceptable to humans as they can tolerate and calculate approximate data, but this is not suitable for computers as they need exact and unambiguous data. As a result, lossy compression is unlikely to be used in the transmission of instructions, text, and computer data as the quality is not suitable for computers For example the sentence “Don’t send money please,” could end being received as “Do send money please” or “Don send money please”

Equation Based
Equation based compression routines convert the data to numerical values and in effect plot them on a graph of location against value (such as the ASCII code of the character or value). This graph can then be calculated upon and an equation generated for a line that goes through all points. This is then decompressed by applying to the equation, the locations in order. The solutions are output in the appropriate locations, within the text file. However, this technique is dependent on the data being able to be converted to numerical values. For example, to compress a text document it would require that the text be converted to ASCII, Unicode, or a similar numeric based coding system so that the values can be calculated upon. This presents issues when the data to be sent is of a format that is not necessarily suitable to be transformed into numerical values, e.g. commands. Also, not all series of data are suitable for being modelled by an equation, as some data arrangement may be small in size or totally random. Therefore, to optimise the power of the lossless compression method, it is best utilised for large files of a regular nature. However as a lossy compression method, the algorithm performs well as the quality of output can be adjusted by varying the accuracy of the equation. One downside of this method is it consumes a relatively large amount of processing time and system memory to calculate the values from the equation. If processing time and system memory are of concern, the Delta Encoding process may be a better alternative. Delta encoding also works by converting the data to numerical values. The compression is achieved by calculating and outputting the difference in values between the current character and the preceding character. This can result in the code difference being smaller than the numerical value or character code; however, issues arise when the difference is great. Assume that we have a series in which the differences amount to a number that can be represented by 8 bits. If the numbers to be sent are 16 bits or more long, then the series can be compressed, however if the numbers to be sent are only 8 bits then no advantage is gained. In addition, if the series had an anomalous value such as a two bit number, within a series of 8-bit numbers, then more data would be sent. This is because it would be more efficient to send the two bits rather than the potential 252 number difference. Finally, if there were a series of repetitive numbers then this is not the most efficient compression method as a series of zeroes will be repeated as the difference. To address this situation it may be better to use Run Length Encoding.

Run Length Encoding

Run Length Encoding (RLE) is where a series of repeated numbers, which may be ASCII codes, can be condensed in order to compress the data. In a situation where a series of numbers are repeated, a human could write “aaaaaaaaaa” as “a x 10” and will understand the meaning of this expression. Run Length Encoding does much the same but uses very little processing and memory space. In the process of coding, the read buffer looks at a character and following characters, if a match is found where the next digit is the same then an escape character (marker) is entered with the number of repetitions. The character that is repeated represented here as α ß where is the escape character, α the number of repetitions and ß the character to be repeated. However, if there were only two characters that repeat, then to send them without compression would be sending two bytes. Sending them using this method would be sending 3 bytes in total to represent those two characters. Therefore, this method is actually a hindrance to the reduction of file sizes, where there are repetitions of characters less than three bytes. This can be overcome, by coding the implementation so that it sends repetitions of three bytes or less without compression, but this adds to the inefficiency of the algorithm. Despite this, the algorithm is flexible and can be easily applied to images (ones with few colours/shades would be particularly suited to Run Length Encoding) and rather than repetitions of individual characters, the code could be adjusted to allow for repetitions of entire words or sequences. Finally, this only compresses data that has repetition. Within the English language, for example, there are few repetitions of letters more than three times in a word, however, the English language does make use of some characters more than others. If this is the case, Self-Delimiting may be a more appropriate alternative.

Self-Delimiting

Letter Probability
A 0.4
B 0.2
C 0.2
D 0.1
E 0.1

Self-Delimiting is where the frequency of each character is gathered from the entire data source and the probability of the character appearing is stored. This can be done in sections, but that requires restarting the process and in effect creates a new document that is appended to the already encoded document. Once the probability of the character appearing is stored, the character with the highest probability of occurrence is given the shortest code. The codes get progressively longer until the least likely two characters are given the longest codes. To illustrate this assume we have a document with only the letters: A B C D E within it, where the letters have the following probability of occurrence:

*Here would be a diagram PM me if you want it*

Based upon this the letter with the highest probability is given the shortest code; the next character is assigned the next longest code. See diagram below :

*Here would be a diagram PM me if you want it*

Here we can see that the most frequently used letter A is given the shortest binary code of just 1, where the least frequently used D and E letters are given the longest codes. This system can be very effective where there are letters that are used relatively frequently amongst those that are used less frequently. However, in a random document, where the characters are more or less used equally, there is not much compression using this technique. Depending on the size of the binary codes, data can actually be added and the file size enlarged. In addition, the table of codes must be sent with the data, but with small files the sending of the table alone will add more data than that saved! This presents a better compression technique known as Dictionary Coding.

Dictionary Coding

The basis of Dictionary Coding is that it adds and creates a code for any word it comes across, that is not in a custom-built dictionary, and can reference that code in future. In practice the words may not be whole words however, words can be parts of what we know as words and can include other characters including spaces. The most well-known dictionary techniques are those developed by Lempel and Ziv, these being the LZ77 and LZ78 algorithms. Each is different in some way, yet all are based upon the idea of using a dictionary from which subsequent words that are identical can be referenced to. The LZ77 algorithm uses a two-buffer system for reading the data and linking to it. The first buffer is the “Look-Ahead Buffer” which reads the data that is to be encoded. This is preceded by the “Search Buffer,” which looks at data that has just been encoded. When the first character in the Look-Ahead buffer is read, if there is a match between this character and the one in the search buffer then the next character is read in both buffers. If there is no match then the character is sent as a single uncompressed character. However, if there is a match then the process continues until the longest possible match is identified. When the longest match is located in the search buffer a ‘pointer’ is inserted instead of the word (Note, that a word is not a literal word but a collection of characters). The link must be to a character in the search buffer, however the repetition can continue into the look-ahead buffer. The pointer includes an escape character, data regarding the offset of the word (distance in characters between the start of each of the words), length of the match and the following character. This system obviously adds characters (like the RLE method) to repetitions of words of less than four bits. However, depending on the word size stored in the dictionary this can be very efficient at compression especially when the type of data is already pre-known so that the size of the words stored in the buffer can be optimized. This technique can be extremely memory demanding however, as the size of the buffers grow. Whilst not intensive on the processor it may take longer than other methods as data is retrieved from memory, due to the buffer size often being too big to fit within the processors limited cache. The method is also wasteful as the algorithm has the system constantly looking back into areas it has already dealt with and the size of the buffers can sometimes cause issues. Given the repeating sequence ABCDEFGHI, if the size of the buffers were 8 bits for the search buffer and 7 bits as shown below , the graphical representation would be as follows:

*Here would be a diagram PM me if you want it*

Here there would be no compression, as the algorithm would look at the A in the Look-Ahead buffer and there is no corresponding match even though there is for the B onward. The B would then be “pushed” out of the Search buffer and the same would happen with this instance. This shows us that some sequences are uncompressible using certain sized buffers. This drawback can be overcome by use of a global (in the sense of document wide) dictionary of words.

The LZ78 does just that, it creates a dictionary of words and single characters based upon the document itself, rather than the search buffer of LZ77. The algorithm starts with an empty dictionary. As each character is read, the algorithm checks to see if the character has an entry. If not then the character is added to the dictionary and the character sent as normal. From now on if a character with an entry is found, then the code for that character is entered in its place and the character along with its following character is added as a new entry to the dictionary. This adding has to eventually be limited or the process will continue until the document ends or there is an error as the system runs out of memory trying to hold the dictionary. This has the advantage over LZ77 that the dictionary and the references have a global scope meaning that if a word is written as a series at the beginning and end of a data series then the word only needs one code throughout. Additionally this method does not require the dictionary to be sent separately as the dictionary can be recreated at decompression in the same way that it was created, as when characters are added to the dictionary they are still transmitted without any alteration. The process is the most memory demanding of all mentioned methods and can use considerable amounts of processor time to compress as the word is compared to all those in the dictionary (despite a binary tree structure being used to speed up searches).

Summary
Having seen some of the different compression methods, it is clear that each has different characteristics; some have more demanding requirements of the system than others. However, in order to argue whether LZ78 was the best lossless data compression algorithm at its conception we need to compare LZ78 with the other algorithms and techniques using the following widely recognized criteria:

 Complexity
 Memory requirements
 Compression Efficiency

In complexity, LZ78 is one of the more complex algorithms described. The method of creating the dictionary at either end involves searching through the dictionary and assigning codes where necessary. With larger dictionaries this can involve more complex calculations to locate matches. By comparison, its predecessor, LZ77, does not have as complex a system due to the size of the buffers and the referencing.

Conversely, Equation based algorithms tend to be more complex. This is due to the complexity of the creation process when devising the equation for the data and the calculation of values when decompressing.

Delta Encoding is the simplest of the algorithms mentioned. By only adding the received number to the previously calculated value the coding and decoding is a simple and relatively trivial process for a computer.

Self-Delimiting, whilst not easy to assign codes to characters is a simpler process to apply and decode, making the algorithm less complex than dictionary coding. However, it is more complex than Delta Encoding and Run Length Encoding which are very simple in comparison to all the other methods discussed. Having discussed memory requirements when describing the compression techniques it can be seen that Dictionary Coding and specifically LZ78 uses the most memory. It is intensive on access to the memory compared with Self-Delimiting, which is the next most memory intensive technique. In comparison, Equation based compression uses varying amounts of memory depending on the complexity of the equation that is to be generated or calculated upon. Both Run Length Encoding and Delta Encoding use very little memory.

Concerning compression efficiency, it is difficult to evaluate fairly, the different techniques discussed as they are all designed for different situations. For instance, data of a repetitive nature would be better compressed using Run Length Encoding as opposed to using Dictionary coding. Where words are repeated throughout a transmission at random intervals Dictionary Coding would compress the data better than Run Length Encoding. Therefore, the compression efficiency is only a valid measure of the quality of the algorithm when applied to the same data.
Conclusion
The objective of this paper was to argue, “LZ78 was the best lossless data compression algorithm at its conception”. LZ78 was not designed to be operated independently, but in combination with other algorithms. Without them, LZ78 would be very limited, resource hungry and inefficient in certain compression situations.

Therefore, it is concluded that although the creation of LZ78 did greatly improved the efficiency and quality of data compression, it was not ‘the best’ lossless data compression algorithm. However, it did made a very significant and important contribution to improving data compression when combined with other data compression methods. Many data compression programs used today are based on a variant of LZ78.

References

Books

Salomon, D., Data Compression – The Complete Reference, 2nd Edition (Berlin: Springer, 1999), ISBN: 0-387-982-80-9
Sayood, K., Introduction to Data Compression, 2nd Edition (San Francisco: Morgan Kaufmann, 2000), ISBN: 1-55860-558-4

Journals

Welch, T. A., 'A Technique for High-Performance Data Compression', Computer, 1984, pp. 8-18.
Ziv, J. and Lempel, A., 'Compression of Individual Sequences Via Variable-Rate Coding', IEEE Transactions on Information Theory, Vol. 24 (1978), pp. 530-536.

University of Surrey Library.

Websites (all sites cached on 21/08/05):

www.data-compression.com...
www.dcs.ed.ac.uk...
en.wikipedia.org...
en.wikipedia.org...
www.faqs.org...
www.ics.uci.edu...
www.ics.uci.edu...
michael.dipperstein.com...
michael.dipperstein.com...
michael.dipperstein.com...
michael.dipperstein.com...
www.prepressure.com...
www.prepressure.com...
www.prepressure.com...
www.prepressure.com...
www.vectorsite.net...
www.wolframscience.com...

[edit on 10/9/05 by Infidellic]