Huffman coding explained
Taken from wikipedia
In computer science and information theory, Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to using a variable-length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol. David A. Huffman developed it while he was a Ph.D. student at MIT and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes."
Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix code (sometimes called "prefix-free codes," that is, the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol) that expresses the most common source symbols using shorter strings of bits than are used for less common source symbols. Huffman was able to design the most efficient compression method of this type; no other mapping of individual source symbols to unique strings of bits will produce a smaller average output size when the actual symbol frequencies agree with those used to create the code.
Huffman coding is such a widespread method for creating prefix codes that the term "Huffman code" is widely used as a synonym for "prefix code" even when Huffman's algorithm does not produce such a code.
The technique works by creating a binary tree of nodes. Initially, all nodes are leaf nodes, which contain the symbol itself, the weight (frequency of appearance) of the symbol, and optionally, a link to a parent node, making it easy to read the code (in reverse) starting from a leaf node. Internal nodes contain symbol weight, links to two child nodes, and the optional link to a parent node. As a standard convention, bit '0' represents following the left child, and the bit '1' represents following the right child. A finished tree has up to n leaf nodes and n-1 internal nodes. A Huffman tree that omits unused symbols produces the most optimal code lengths.
The process essentially begins with the leaf nodes containing the probabilities of the symbol they represent. A new node whose children are the 2 nodes with the smallest probability is created, such that the new node's probability is equal to the sum of the children's probability. The previous 2 nodes merged into one node (thus not considering them anymore). With the new node now considered, the procedure is repeated until only one node remains in the Huffman tree.