15
votes
Where would a beginner start with data compression? What are some good books for it?
Mostly the title. I have experience with Python, and I was thinking of learning more about data compression. How should I proceed? And what are some good books I could read, both about specifics and abstracts of data compression, data management, data in general.
First thing to learn would be Huffman coding. Don't know where to go after that though.
The absolute most basic place to start for data compression would be run length encoding, which is essentially the most basic type of compression imaginable. After that, look into how some of the more popular compression formats like
.zip
work.Sure thanks! Do you know of any long term roadmaps on this?
My local library has Sayood's Introduction to Data Compression and Salomon/Motta's Handbook of Data Compression, both of which I would recommend. Sayood is a great medium-paced read that's usable as a self-taught course, although it's a little heavy on the mathematics. Salomon/Motta is an awfully dry reference that is unusable for learning the basics, but it's an incredible overview of compression strategies and has a section for basically every algorithm in common use.
Second step after RLE (that @tesseractcat mentioned) is LZW. The algorithm is simple, any tutorial should help, here's one and this is the one I used when I was interested in data compression.
Then you can branch pretty much anywhere, it's not simple after that (at least it wasn't for me). Somebody mentioned Huffman coding there, you can also try reading some arithmetic coding implementations. Matt Mahoney has a good book for free online, it's called Data Compression Explained. His whole website about data compression is a gold mine, make sure you read it and learn from common pitfalls and myths about compression too.
Make a
.zip
parser! Then move on to.tar
,.gz
,.xz
, etc.Do you know of any books that might explain data compression in a more abstract way? Explain the concepts that goes into data compression?
I've never read books on it but I know some good resources to get started:
https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
https://www.youtube.com/watch?v=JsTptu56GM8
https://www.youtube.com/watch?v=goOa3DGezUA
https://www.youtube.com/watch?v=9rIy0xY99a0
If you want the theory behind compression, you're looking for information theory. I'm a fan of Pierce's book on Information Theory but there are a lot of excellent textbooks to choose from.
A basis in information theory will help you understanding how it works and what the limitations of it are. Depending on the book you choose, you will probably come to see a lot of different approaches to compression.
Colt McAnlis did a series called Compressor Head that has great explanations of a bunch of basic compression algorithms. He also wrote a book but honestly start with the videos, they're great.