Processing math: 0%

Wednesday, April 2, 2025

Compression, entropy and polymorphism

The Modexp blog has a great collection of compression algorithms from 8-bit computers, demos, and viruses. I noticed that most of them are variations on the Lempel-Ziv theme. This raised a few questions for me: Is it possible to make compression "polymorphic" so that it would be impossible to create a signature for the compressed data itself? And another question: Can the same algorithms be used for the opposite task โ€” entropy normalization? (Compressed text has high entropy, and antivirus software often uses entropy as an indicator that the file is compressed and requires deeper analysis).

Huffman as per the textbook (the boring part)

The decompression function for the Huffman algorithm is not much more complicated than Lempel-Ziv. If we already have the code tree, the function fits in two lines:

for (node = tree; node->left; node = getbit() == 0 ? node->left : node->right)
	;
*out++ = node->value;

Since the tree has no nodes with only a left or only a right branch, we can serialize the tree โ€” a 1-bit for a node, and a 0-bit + value for a leaf. If our tree has 256 symbols, in serialized form it will take (256*9 + 255) / 8 = 320 bytes. And then comes the most boring part: building the code tree as described in any textbook:

	/* count frequencies */
        int i, j;
        unsigned int freq[256];
        bzero(freq, sizeof(freq));
        for (i = 0; i < l; i++)
                freq[in[i]]++;

        /* make the list of nodes sorted by freq, lowest first */
        node_t *insert(node_t *list, node_t *node) {
                if (list == NULL)
                        return node;
                node_t *p = list, *prev = NULL;
                while (p != NULL) {
                        if (node->freq < p->freq) {
                                if (prev == NULL) {
                                        node->next = list;
                                        list = node;
                                } else {
                                        node->next = prev->next;
                                        prev->next = node;
                                }
                                return list;
                        }
                        prev = p;
                        p = p->next;
                }
                node->next = NULL;
                prev->next = node;
                return list;
        }
        node_t *list = NULL;
        for (i = 0; i < 256; i++)
                if (freq[i])
                        list = insert(list, alloc_node(i, freq[i], NULL, NULL));
	/* make the tree out of sorted list */
        /* get L, get R, put <L, R, L.freq + R.freq> */
        node_t *tree = NULL, *left, *right;
        while (list) {
                left = list;
                list = list->next;

                if (list == NULL) {
                        tree = left;
                        break;
                }
                right = list;
                list = list->next;
                /* insert new node */
                list = insert(list, alloc_node(0, left->freq + right->freq, left, right));
        }
        /* serialize the tree */
        void save_tree(node_t *n) {
                if (n->left == NULL) {
                        putbit(0);
                        unsigned char s = n->v;
                        for (i = 0; i < 8; i++) {
                                putbit(s >> 7);
                                s <<= 1;
                        }
                } else {
                        putbit(1);
                        save_tree(n->left);
                        save_tree(n->right);
                }
        }
        save_tree(tree);

Randomized huffman encoding

The code above is a very naive and inefficient implementation of the classic algorithm. For it to work faster, tables should be used, and the tree's maximum depth should be limited (for this, there is both an approximate method โ€” dividing each frequency by two โ€” and an exact algorithm called package-merge). I strongly suggest to read the article on bistreams by ryg and huffman revisited by cyan. But what's important from an algorithmic perspective is not the actual code value but only its length in bits. \sum 2^{โˆ’len} \leq 1 (Kraft-MacMillan inequality). This means we can swap two codes of equal length, and the compression ratio will not change. And this can be done very easily โ€” in the function that builds the tree, we randomly swap the left and right branches:

	node_t *tree = NULL, *left, *right;
        while (list) {
		// ....
                right = list;
                list = list->next;

                if (random() > (RAND_MAX / 2)) {
                        void *tmp;
                        tmp = left;
                        left = right;
                        right = tmp;
                }

The same compression, different encoding:

$ ./a.out "tell me o muse of that ingenious hero who travelled far and wide"
Input:
tell me o muse of that ingenious hero who travelled far and wide
Len: 64
Compressed len: 53
y..e.&.......s.L..3....3>...*~_..t]..R..g....L.oq.*.h
....Y.U...;..f]-...Y...W~K..`.O.5p.P.F....g.m..?6~..`
s,.a..I...3..v7...f6.g]...\..%.j..z.....\.\.:.`Y.(n..
..sY.... .....YV..n.....hq:......-..k.+..g.+....*.Y..
...Y....l,....YW+.i.,......k..0....f.9.....J."....{.y
....Y.U...u..o..m.s...._........?P....T.M.M.x..;.j.v.
|.f.Ns..7..9.A.'#.i7M.B.l....Z..A.....o..W#)........Y
s,.a......&..g;gCfm3Ng\...T..%vc.EK.-...T...;..i.(h..
.*.r-....H...w..5.s...O.."\..%.{..J.-.Q...\.*Gpy.8H..
.........H...g..5.u...O}..\....x..:...pZ....J.s..X...
y.....i. ..........2.F....21I......tK...2}...0.....I.
.*.r-.E..H...g....m......BT.2%nr.F{..c1...T.+..Y.8N..
...[l.....k=.@.V.)n.....,..!.Z...=.nj...#_...(..Y.9.{
~g;....s:......g#.n4..F..p.ZQ...[..&h......j.c....y.;
y..2.3....cl.u....w....wz/~....Q.`....X.~...^..-.L.[.
|.s.L... ..y...&...2.F..hY2.....y(....*.2u2.....?..i.

Inversing compression

For a long time, antivirus programs have used heuristics to look for encrypted or compressed fragments in files. They can be used to, for example, enable emulation and wait for the unpacking procedure to complete. This led me to the question of how to reduce the entropy of encrypted data. One obvious method is base32; by adding three zero bits to every five bits, the entropy will not exceed five.

Another option is inverse compression. We take the encrypted text and decode it using the Huffman algorithm, and the entropy will decrease. How the "unpacked" text will look depends on the frequency table, which is calculated during compression (if the text were compressed). However, that would require storing it somewhere, which is not very elegant. Fortunately, there is an algorithm that allows generating a random set of probabilities with a specified entropy (Peter Swaszek, Sidharth Wali "Generating probabilities with a specified entropy")

H_{n-1} = \frac{H_n + p \times \log_2 p + (1 - p) \times (\log_2 1 - p)}{(1 - p)}

Since we know H_n (this is the entropy value we are aiming for), we need to find the roots of this equation. From these, we can get ranges from which we can randomly select probability p_i. To solve the equation, I first find the minimum of the function and then find the roots using bisection. Since the sum of all probabilities cannot exceed one, we need to scale the obtained values to form a simplex:

p_k = p_k \times (1 - \sum_{i=k+1}^{n}p_i)

Once we have the vector with probabilities, we use the classic Huffman algorithm. We sort the probabilities in ascending order, build the character tree, and instead of compressing the text, we decompress it. Naturally, its length will increase, but in terms of entropy value and frequencies, it will correspond to a random model. To decode this text, we will need either the saved frequency table or the seed of the pseudorandom generator used to generate the table. Then, it must be compressed using the standard algorithm

We set the target entropy to 4.1

Simplex entropy 4.100000
Input:
.Y.mD.!.....4..&e......H...A..Ex
P...#.cn...........V.....nz...`F
....&n.`........]x@..j......]XK.
[..CO...........N...{...&....=..
..`......Q...Hn..w..8.....[..8.&
....I5.LN.+O_.k/.n...,.,.....|..
....l.bd..'..DPY....L4.....?Q...
;.....8.^bY...2..e../....^.8.C..
Input size: 4096
Input entropy 7.949748
Output:
.............'..................
.F.......h..5.....-.............
...........{.....4..............
.g...........................]..
.....N............K.............
.....................%..........
.....Y.\..........1.=...........
.......$....................y...
Output size: 7592
Output entropy 4.290585
Compressed size: 4096
Decoded successfully

Not exact, but close enough, or we could train our model on plain text (compile with -DTEXT) :

Simplex entropy 4.376084
Input:
- ...}P...+....k.~..i..7.......H
....jXUL......z.....E..6..7Dx\Zp
....p.c.J.$.>.3....5..k-.D.q...z
8c.."oHx..p.b....)..............
..r....Ka._...a..*.D....... ..`.
j...ev._cD ~X..%5Ajce..K........
b.@..?o<R.....9.n....y.[+.v../..
.B.<sB.{.{...,.$ Yc.f.G.=....o..
Input size: 4096
Input entropy 7.958878
Output:
d,eeelIi .,onn,sypei ioyhrPn.fi
rbths   huiei  iielhl rssohhieei
eepypolt.egte-pueo.ne shaouiMen,
thriwecrili lmd, eplheahiciwltwn
eeNs la-twisesgcdhdags"M.sls  cn
o watn i el.r.oee huse nemii ngl
ovbe. ets ieP nere-ilsiWwcdpaois
h ees ags.lfaso.y  noon yeiybcns
Output size: 7449
Output entropy 4.395597
Compressed size: 4096
Decoded successfully

huff-random.c, huff-ent.c on github

2 comments: