For this lab, you will write a program to compress or decompress a text file using Huffman encoding and decoding as described in lecture. Your program should have a graphical user interface which will prompt the user for the name of an input file and the name of an output file. There should be two buttons labelled `Encode' or `Decode' which will initiate the encoding or decoding process. If the program is encoding, the input file is a text file and the output file is a binary file. The opposite should be true of the program is decoding. I recommend that you make your program a stand-alone program that implements JFrame, so you do not run into any security problems opening files.
Your encoding scheme should work by making two passes through the file. In the first pass, you will simply collect the frequency data about how many times each ASCII character has appeared in the file. Then your program will build the Huffman Tree for these character and store the tree in the compact table form described in lecture and in your book. The first thing you will write to the output file is the table describing the Huffman tree. Then, in the second pass of the input file, you will read each character and write its code into the output file. You may discover that you don't actually need to create the tree as an intermediate step.A word of warning about casting between bytes and integers. When a byte is cast as an integer, it will create a number in the range from -128 through 127. In order to get the correspinding number in the range from 0 through 255, you should add 256 if the resulting number is negative after the cast.
Make sure to close the file when you are done with it.
There is no mechanism to write a single bit to a file or read a single bit from a file (things you will want to do for your assignment). Instead, you will have to accumulate bits to be written to the file until 8 have been accumulated. Then the 8 bits should be writted as a byte to the file. Similarly for reading. These utilities are best written within the MyFileReader and MyFileWriter class.
Writing bits creates a special problem, however, since the encoded file may not end on a byte boundary. You need, therefore, some special mechanism to indicate to the decoder that the remaining bits are not actually part of the file. We will have an ASCII charachter which we will designate as the end of file marker. This will be the ASCII code 1 which you should assign as a final, static vaiable LAST_CHAR. (We are assuming here that the ASCII code 1 will never appear in the input file, which is reasonable since it is a fairly obscure unprintable character). Your encoder should set the frequency in the file of this character to be 1. Then, when it is done writing out the codes for all the characters in the input text file, output the code for LAST_CHAR. The decoder will decode each characer. If the character is LAST_CHAR, it will ignore any remaining bits in the encoded file. It is an error if the end of the encoded file is reached before the code for LAST_CHAR is read.
Remember when closing the file in the encoding process,
your FileWriter
needs to output any remaining bits which are being kept
to accumulate a full byte.
The table which describes the tree will have exactly 255 bytes, each of which will describe a single node. If the bytes are numbered in order from 0 through 254, then the node with name x should be byte x in the table. Thus, the first 128 bytes will represent the 128 leaves of the tree and next 127 bytes will represent the internal nodes. The decoder will assume that the first 255 bytes in the encoded file describe the Huffman tree and the remaining represent the encoded characters.
Each byte will indicate the parent of the node and its child type (i.e. whether it is a left or a right child). The child type for node x, will be stored in the low order bit of byte x in the table. This bit will be 0 if the node is a left child and 1 if it is a right child. The parent information will be stored in the seven higher order bits of byte x. Note that 7 bits can only represent numbers in the range from 0 through 127, but there are 255 nodes in the tree. However, any node which is a parent will be in the range from 128 through 254, so the first bit will always be a 1. Thus, you will need to add 128 to the number represented by the 7 high order bits from byte x to get the parent of node x. (Alternatively, you can shift right and mask in a 1 for the high-order bit).
For example, suppose that node 5 is the left child of node 241. Then byte 5 in the table will be `11100010'. The rightmost 0 represents the fact that node 5 is a left child of its parent. The seven leftmost bits (`1110001') represent 113. When we add 128 to 113, we get 241, the parent of node 5.
The root of the tree should indicate that its parent is node 255 (which does not exist since the nodes are numbered from 0 through 254). This is how you can distinguish the root. The child type of the root does not matter. Thus, the byte representing the root should look like `1111111b' in binary, where b is either 0 or 1.
Remember that it is good programming practice to have all of these constants (128, 255, etc.) named as final static integers in your methods and appear only as variable names in your programming.