ml3m's journey

A Study That Started by Curiosity of How Data is Compressed

Huffman Encoding Study

How is data compressed? How do we compress data? What does it imply? What computations are done? What algorithms are involved, and what is going on inside the box that big data enters, allowing us to access it in a lighter form?

My curiosity was piqued while watching ThePrimeAgen's Project: DOOM, streamed via an SSH server on Twitch. I observed how he reduced traffic costs through encoding, implementing his own encoding techniques. This sparked a deep interest in understanding data compression.

Exploring Compression Techniques

I began my exploration with Huffman encoding and discovered that it works best when combined with various techniques, such as:

Through my research, I noted that Huffman encoding is most efficient with random data, where different types of characters are scrambled.

Performance and Space Comparison

I conducted tests on various input sizes to analyze the performance and space efficiency of different encoding techniques. Below are the results:

Testing with Input Size: 100 Characters

String: qqqqqqqqqqwwwwwwwwwweeeeeeeeeerrrrrrrrrrttttttttttyyyyyyyyyyuuuuuuuuuuiiiiiiiiiioooooooooopppppppppp

Encoding Technique Time (Β΅s) Encoded Size (bits) Compression Ratio
Simple Encoding 107.458 800 1.00000
Huffman Encoding 35.209 340 0.42500
BWT + Huffman Encoding 50.125 354 0.44250
RLE + Huffman Encoding 19.542 84 0.10500
BWT + RLE + Huffman Encoding 44.042 135 0.16875
BWT + MTF + RLE + Huffman Encoding 97.583 126 0.15750

Testing with Input Size: 100 Characters

String: abababababababababababababababababababababababababababababababababababababababababababababababababab

Encoding Technique Time (Β΅s) Encoded Size (bits) Compression Ratio
Simple Encoding 50.083 800 1.00000
Huffman Encoding 11.833 100 0.12500
BWT + Huffman Encoding 39.833 152 0.19000
RLE + Huffman Encoding 45.042 300 0.37500
BWT + RLE + Huffman Encoding 31.000 20 0.02500
BWT + MTF + RLE + Huffman Encoding 68.708 12 0.01500

Additional Tests

I conducted further tests with varying input sizes, including strings of 26, 52, 1, 99, and 6 characters. The results consistently showed that while simple encoding maintained the original size, Huffman encoding and its combinations significantly reduced the size, especially for repetitive data.

Analysis

From the results, it is evident that:

  1. Efficiency of Encoding Techniques: Huffman encoding, when combined with other techniques like RLE and BWT, can drastically reduce the size of data, especially for repetitive patterns. For instance, RLE combined with Huffman encoding achieved a compression ratio of 0.00379 for highly repetitive data.

  2. Performance Trade-offs: While Huffman encoding is efficient in terms of size, it can be slower than simple encoding for small inputs. The time taken for encoding varies significantly based on the complexity of the input data.

  3. Optimal Use Cases: The choice of encoding technique should depend on the nature of the data. For random data, Huffman encoding is more effective, while for repetitive data, RLE combined with Huffman encoding yields better results.

Conclusion

This investigation into data compression has been enlightening. The interplay between different encoding techniques reveals the complexity and efficiency of data handling in modern computing. The ability to visualize the Huffman tree using the Go library I discovered added a fascinating layer to my understanding.

Moving forward, I plan to delve deeper into JPEG compression and explore how photo matrices are compressed. This journey into data compression has not only satisfied my curiosity but has also opened new avenues for exploration in the realm of data science and computer science.

Tree Photo

GitHub Repository