Assignment #8: Huffman Encoder and Decoder

Due: Friday, December 3, 2PM

Overview of the Lab

For this lab, you will write a program to compress or decompress a text file using Huffman encoding and decoding as described in lecture. Your program should have a graphical user interface which will prompt the user for the name of an input file and the name of an output file. There should be two buttons labelled `Encode' or `Decode' which will initiate the encoding or decoding process. If the program is encoding, the input file is a text file and the output file is a binary file. The opposite should be true of the program is decoding. I recommend that you make your program a stand-alone program that implements JFrame, so you do not run into any security problems opening files.

Your encoding scheme should work by making two passes through the file. In the first pass, you will simply collect the frequency data about how many times each ASCII character has appeared in the file. Then your program will build the Huffman Tree for these character and store the tree in the compact table form described in lecture and in your book. The first thing you will write to the output file is the table describing the Huffman tree. Then, in the second pass of the input file, you will read each character and write its code into the output file. You may discover that you don't actually need to create the tree as an intermediate step.

File I/O

I have provided for your the MyFileReader and MyFileWriter class that can just open up and read from or write to a file, one byte at a time. Note that the read method in MyFileWriter actually returns an integer. If the file being read is a text file, the int returned should always be in the range from 0 through 127. (As usual, you should check of any unexpected behavior and handle it gracefully).

A word of warning about casting between bytes and integers. When a byte is cast as an integer, it will create a number in the range from -128 through 127. In order to get the correspinding number in the range from 0 through 255, you should add 256 if the resulting number is negative after the cast.

Make sure to close the file when you are done with it.

There is no mechanism to write a single bit to a file or read a single bit from a file (things you will want to do for your assignment). Instead, you will have to accumulate bits to be written to the file until 8 have been accumulated. Then the 8 bits should be writted as a byte to the file. Similarly for reading. These utilities are best written within the MyFileReader and MyFileWriter class.

Writing bits creates a special problem, however, since the encoded file may not end on a byte boundary. You need, therefore, some special mechanism to indicate to the decoder that the remaining bits are not actually part of the file. We will have an ASCII charachter which we will designate as the end of file marker. This will be the ASCII code 1 which you should assign as a final, static vaiable LAST_CHAR. (We are assuming here that the ASCII code 1 will never appear in the input file, which is reasonable since it is a fairly obscure unprintable character). Your encoder should set the frequency in the file of this character to be 1. Then, when it is done writing out the codes for all the characters in the input text file, output the code for LAST_CHAR. The decoder will decode each characer. If the character is LAST_CHAR, it will ignore any remaining bits in the encoded file. It is an error if the end of the encoded file is reached before the code for LAST_CHAR is read.

Remember when closing the file in the encoding process, your FileWriter needs to output any remaining bits which are being kept to accumulate a full byte.

The Huffman Table

The Huffman tree should be stored in a compact table form as described in lecture. You should have a leaf in your Huffman tree for each of the 128 ASCII characters, even if some of them never appear in the input file. They will just have a frequency of zero. This means that the Huffman tree will have exactly 255 nodes which should be named from 0 through 254. The leaves of the tree should have names in the range from 0 through 127. The node representing ASCII character x should have name x.

The table which describes the tree will have exactly 255 bytes, each of which will describe a single node. If the bytes are numbered in order from 0 through 254, then the node with name x should be byte x in the table. Thus, the first 128 bytes will represent the 128 leaves of the tree and next 127 bytes will represent the internal nodes. The decoder will assume that the first 255 bytes in the encoded file describe the Huffman tree and the remaining represent the encoded characters.

Each byte will indicate the parent of the node and its child type (i.e. whether it is a left or a right child). The child type for node x, will be stored in the low order bit of byte x in the table. This bit will be 0 if the node is a left child and 1 if it is a right child. The parent information will be stored in the seven higher order bits of byte x. Note that 7 bits can only represent numbers in the range from 0 through 127, but there are 255 nodes in the tree. However, any node which is a parent will be in the range from 128 through 254, so the first bit will always be a 1. Thus, you will need to add 128 to the number represented by the 7 high order bits from byte x to get the parent of node x. (Alternatively, you can shift right and mask in a 1 for the high-order bit).

For example, suppose that node 5 is the left child of node 241. Then byte 5 in the table will be `11100010'. The rightmost 0 represents the fact that node 5 is a left child of its parent. The seven leftmost bits (`1110001') represent 113. When we add 128 to 113, we get 241, the parent of node 5.

The root of the tree should indicate that its parent is node 255 (which does not exist since the nodes are numbered from 0 through 254). This is how you can distinguish the root. The child type of the root does not matter. Thus, the byte representing the root should look like `1111111b' in binary, where b is either 0 or 1.

Remember that it is good programming practice to have all of these constants (128, 255, etc.) named as final static integers in your methods and appear only as variable names in your programming.

Compression factor

In addition to creating the encoded output file, you should write the compression rate to the graphical user interface. That is, you should write the number of bytes in the input file (I), the number of bytes in the output file (O), and the percentage of compression: 100(I - O)/I. Note that for short files, the output file may be even longer than the input file due to the overhead of printing out the table. This could be partially addressed by only creating the Huffman tree for the characters which appear in the input file. For simplicity, we will not add this feature for this assignment.

Java Files

Class Files for Completed Huffman Encoder and Decoder

Here are the .class files for a Huffman encoder and decoder that I have written. You can use them to test your code. You can run them through the Command Prompt by typing `java HuffmanEncoder' or `java HuffmanDecoder'. Make sure all the classfiles are in the directory from which you issue that command. In my version, the compression or decompression is initiated when you hit return in the output file text field. You can find all the files by clicking here .