A Study That Started by Curiosity of How Data is Compressed

25 Sep, 2024

Huffman Encoding Study

How is data compressed? How do we compress data? What does it imply? What computations are done? What algorithms are involved, and what is going on inside the box that big data enters, allowing us to access it in a lighter form?

My curiosity was piqued while watching ThePrimeAgen's Project: DOOM, streamed via an SSH server on Twitch. I observed how he reduced traffic costs through encoding, implementing his own encoding techniques. This sparked a deep interest in understanding data compression.

Exploring Compression Techniques

I began my exploration with Huffman encoding and discovered that it works best when combined with various techniques, such as:

BWT (Burrows-Wheeler Transform)
LZW (Lempel-Ziv-Welch)
MTF (Move-To-Front)
RLE (Run-Length Encoding)

Through my research, I noted that Huffman encoding is most efficient with random data, where different types of characters are scrambled.

Performance and Space Comparison

I conducted tests on various input sizes to analyze the performance and space efficiency of different encoding techniques. Below are the results:

Testing with Input Size: 100 Characters

String: qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrttttttttttyyyyyyyyyyuuuuuuuuuuiiiiiiiiiioooooooooopppppppppp

Encoding Technique	Time (µs)	Encoded Size (bits)	Compression Ratio
Simple Encoding	107.458	800	1.00000
Huffman Encoding	35.209	340	0.42500
BWT + Huffman Encoding	50.125	354	0.44250
RLE + Huffman Encoding	19.542	84	0.10500
BWT + RLE + Huffman Encoding	44.042	135	0.16875
BWT + MTF + RLE + Huffman Encoding	97.583	126	0.15750

Testing with Input Size: 100 Characters

String: abababababababababababababababababababababababababababababababababababababababababababababababababab

Encoding Technique	Time (µs)	Encoded Size (bits)	Compression Ratio
Simple Encoding	50.083	800	1.00000
Huffman Encoding	11.833	100	0.12500
BWT + Huffman Encoding	39.833	152	0.19000
RLE + Huffman Encoding	45.042	300	0.37500
BWT + RLE + Huffman Encoding	31.000	20	0.02500
BWT + MTF + RLE + Huffman Encoding	68.708	12	0.01500

Additional Tests

I conducted further tests with varying input sizes, including strings of 26, 52, 1, 99, and 6 characters. The results consistently showed that while simple encoding maintained the original size, Huffman encoding and its combinations significantly reduced the size, especially for repetitive data.

Analysis

From the results, it is evident that:

Efficiency of Encoding Techniques: Huffman encoding, when combined with other techniques like RLE and BWT, can drastically reduce the size of data, especially for repetitive patterns. For instance, RLE combined with Huffman encoding achieved a compression ratio of 0.00379 for highly repetitive data.
Performance Trade-offs: While Huffman encoding is efficient in terms of size, it can be slower than simple encoding for small inputs. The time taken for encoding varies significantly based on the complexity of the input data.
Optimal Use Cases: The choice of encoding technique should depend on the nature of the data. For random data, Huffman encoding is more effective, while for repetitive data, RLE combined with Huffman encoding yields better results.

Conclusion

This investigation into data compression has been enlightening. The interplay between different encoding techniques reveals the complexity and efficiency of data handling in modern computing. The ability to visualize the Huffman tree using the Go library I discovered added a fascinating layer to my understanding.

Moving forward, I plan to delve deeper into JPEG compression and explore how photo matrices are compressed. This journey into data compression has not only satisfied my curiosity but has also opened new avenues for exploration in the realm of data science and computer science.

Tree Photo

GitHub Repository