A Study That Started by Curiosity of How Data is Compressed
Huffman Encoding Study
How is data compressed? How do we compress data? What does it imply? What computations are done? What algorithms are involved, and what is going on inside the box that big data enters, allowing us to access it in a lighter form?
My curiosity was piqued while watching ThePrimeAgen's Project: DOOM, streamed via an SSH server on Twitch. I observed how he reduced traffic costs through encoding, implementing his own encoding techniques. This sparked a deep interest in understanding data compression.
Exploring Compression Techniques
I began my exploration with Huffman encoding and discovered that it works best when combined with various techniques, such as:
- BWT (Burrows-Wheeler Transform)
- LZW (Lempel-Ziv-Welch)
- MTF (Move-To-Front)
- RLE (Run-Length Encoding)
Through my research, I noted that Huffman encoding is most efficient with random data, where different types of characters are scrambled.
Performance and Space Comparison
I conducted tests on various input sizes to analyze the performance and space efficiency of different encoding techniques. Below are the results:
Testing with Input Size: 100 Characters
String: qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrttttttttttyyyyyyyyyyuuuuuuuuuuiiiiiiiiiioooooooooopppppppppp
Encoding Technique | Time (Β΅s) | Encoded Size (bits) | Compression Ratio |
---|---|---|---|
Simple Encoding | 107.458 | 800 | 1.00000 |
Huffman Encoding | 35.209 | 340 | 0.42500 |
BWT + Huffman Encoding | 50.125 | 354 | 0.44250 |
RLE + Huffman Encoding | 19.542 | 84 | 0.10500 |
BWT + RLE + Huffman Encoding | 44.042 | 135 | 0.16875 |
BWT + MTF + RLE + Huffman Encoding | 97.583 | 126 | 0.15750 |
Testing with Input Size: 100 Characters
String: abababababababababababababababababababababababababababababababababababababababababababababababababab
Encoding Technique | Time (Β΅s) | Encoded Size (bits) | Compression Ratio |
---|---|---|---|
Simple Encoding | 50.083 | 800 | 1.00000 |
Huffman Encoding | 11.833 | 100 | 0.12500 |
BWT + Huffman Encoding | 39.833 | 152 | 0.19000 |
RLE + Huffman Encoding | 45.042 | 300 | 0.37500 |
BWT + RLE + Huffman Encoding | 31.000 | 20 | 0.02500 |
BWT + MTF + RLE + Huffman Encoding | 68.708 | 12 | 0.01500 |
Additional Tests
I conducted further tests with varying input sizes, including strings of 26, 52, 1, 99, and 6 characters. The results consistently showed that while simple encoding maintained the original size, Huffman encoding and its combinations significantly reduced the size, especially for repetitive data.
Analysis
From the results, it is evident that:
Efficiency of Encoding Techniques: Huffman encoding, when combined with other techniques like RLE and BWT, can drastically reduce the size of data, especially for repetitive patterns. For instance, RLE combined with Huffman encoding achieved a compression ratio of 0.00379 for highly repetitive data.
Performance Trade-offs: While Huffman encoding is efficient in terms of size, it can be slower than simple encoding for small inputs. The time taken for encoding varies significantly based on the complexity of the input data.
Optimal Use Cases: The choice of encoding technique should depend on the nature of the data. For random data, Huffman encoding is more effective, while for repetitive data, RLE combined with Huffman encoding yields better results.
Conclusion
This investigation into data compression has been enlightening. The interplay between different encoding techniques reveals the complexity and efficiency of data handling in modern computing. The ability to visualize the Huffman tree using the Go library I discovered added a fascinating layer to my understanding.
Moving forward, I plan to delve deeper into JPEG compression and explore how photo matrices are compressed. This journey into data compression has not only satisfied my curiosity but has also opened new avenues for exploration in the realm of data science and computer science.