Programming Challenge: Text compression
In an effort to make these weekly, I present a new programming challenge.
The challenge this week is to compress some text using a prefix code. Prefix codes associate each letter with a given bit string, such that no encoded bitstring is the prefix of any other. These bit strings are then concatenated into one long integer which is separated into bytes for ease of reading. These bytes can be represented as hex values as well. The provided prefix encoding is as follows:
char | value | char | value |
---|---|---|---|
' ' | 11 | 'e' | 101 |
't' | 1001 | 'o' | 10001 |
'n' | 10000 | 'a' | 011 |
's' | 0101 | 'i' | 01001 |
'r' | 01000 | 'h' | 0011 |
'd' | 00101 | 'l' | 001001 |
'~' | 001000 | 'u' | 00011 |
'c' | 000101 | 'f' | 000100 |
'm' | 000011 | 'p' | 0000101 |
'g' | 0000100 | 'w' | 0000011 |
'b' | 0000010 | 'y' | 0000001 |
'v' | 00000001 | 'j' | 000000001 |
'k' | 0000000001 | 'x' | 00000000001 |
'q' | 000000000001 | 'z' | 000000000000 |
Challenge
Your program should accept a lowercase string (including the ~ character), and should output the formatted compressed bit string in binary and hex. Your final byte should be 0 padded so that it has 8 bits as required. For your convenience, here is the above table in a text file for easy read-in.
Example
Here is an example:
$> tildes ~comp
10010100 10010010 01011010 10111001 00000010 11000100 00110000 10100000
94 92 5A B9 02 C4 30 A0
Bonuses
- Print the data compression ratio for a given compression, assuming the original input was encoded in 8 bit ASCII (one byte per character).
2. Output the ASCII string corresponding to the encoded byte string in addition to the above outputs. - @onyxleopard points out that many bytes won't actually be valid ASCII. Instead, do as they suggested and treat each byte as an ordinal value and print it as if encoded as UTF-8.
- An input prefixed by 'D' should be interpreted as an already compressed string using this encoding, and should be decompressed (by inverting the above procedure).
Previous Challenges (I am aware of prior existing ones, but it is hard to collect them as they were irregular. Thus I list last week's challenge as 'Week 1')
Week 1
The encoded bytes are not guaranteed to be valid ASCII… E.g., in your example, there are only 3 bytes of 8 that are in the ordinal range [0..128):
For the second bonus, I decided to treat each byte as an ordinal value and print it as if encoded as UTF-8.
Python3 solution
Edit: The unicode thing is probably really bad if you end up printing control characters that have particular semantics depending on your terminal. (I should probably remove that…)
Good catch here, I wrote up the bonuses a little quickly so that I could post before the work day started. I'll fix this now. And nice solution btw :)
Thanks. I always like to try doing a problem myself before posting it—even things that seem simple can be harder than you think if you’ve never done them before. Text-encoding is always harder/hairier than it should be.
That's really nice, thanks for this challenge. I started learning Rust, so this is great opportunity to try it out!
I still don't have the decoding bonus, but maybe I'll finish it later.
Out of curiosity, I piped The Sorrows of Young Werther (without unsupported characters, which took it from 251K to 239K) into this encoder and it saved 115577 bytes (52.707966%). I'm surprised how good this encoding is, @gpl!
This encoding is taking into account relative frequency of the English alphabet in English (e.g. common words like ’the’ and 'a' get compressed rather well—uncommon words like 'syzygy' don’t.
Also, if I’m reading your code correctly (I’m not fluent in Rust), it looks like in your
create_bytes
function you’re throwing away invalid characters. Lossy compression is fine and all, but for text, it’s probably not going to be excusable if you decompress a novel and all the punctuation is missing ;P.This program uses the default output format in C++. This has the side effect that the program will happily print out complete garbage every time it encodes a line.
I did something special-ish for decoding by using a trie to deduce letters from the input, but other than that (and abusing the char to int correspondence), it's mostly normal stuff.
Does 3(?)/3 bonuses
C++