The Modexp blog has a great collection of compression algorithms from 8-bit computers, demos, and viruses. I noticed that most of them are variations on the Lempel-Ziv theme. This raised a few questions for me: Is it possible to make compression "polymorphic" so that it would be impossible to create a signature for the compressed data itself? And another question: Can the same algorithms be used for the opposite task โ entropy normalization? (Compressed text has high entropy, and antivirus software often uses entropy as an indicator that the file is compressed and requires deeper analysis).
Huffman as per the textbook (the boring part)
The decompression function for the Huffman algorithm is not much more complicated than Lempel-Ziv. If we already have the code tree, the function fits in two lines:
for (node = tree; node->left; node = getbit() == 0 ? node->left : node->right)
;
*out++ = node->value;
Since the tree has no nodes with only a left or only a right branch, we can serialize the tree โ a 1-bit for a node, and a 0-bit + value for a leaf. If our tree has 256 symbols, in serialized form it will take (256*9 + 255) / 8 = 320 bytes. And then comes the most boring part: building the code tree as described in any textbook:
/* count frequencies */
int i, j;
unsigned int freq[256];
bzero(freq, sizeof(freq));
for (i = 0; i < l; i++)
freq[in[i]]++;
/* make the list of nodes sorted by freq, lowest first */
node_t *insert(node_t *list, node_t *node) {
if (list == NULL)
return node;
node_t *p = list, *prev = NULL;
while (p != NULL) {
if (node->freq < p->freq) {
if (prev == NULL) {
node->next = list;
list = node;
} else {
node->next = prev->next;
prev->next = node;
}
return list;
}
prev = p;
p = p->next;
}
node->next = NULL;
prev->next = node;
return list;
}
node_t *list = NULL;
for (i = 0; i < 256; i++)
if (freq[i])
list = insert(list, alloc_node(i, freq[i], NULL, NULL));
/* make the tree out of sorted list */
/* get L, get R, put <L, R, L.freq + R.freq> */
node_t *tree = NULL, *left, *right;
while (list) {
left = list;
list = list->next;
if (list == NULL) {
tree = left;
break;
}
right = list;
list = list->next;
/* insert new node */
list = insert(list, alloc_node(0, left->freq + right->freq, left, right));
}
/* serialize the tree */
void save_tree(node_t *n) {
if (n->left == NULL) {
putbit(0);
unsigned char s = n->v;
for (i = 0; i < 8; i++) {
putbit(s >> 7);
s <<= 1;
}
} else {
putbit(1);
save_tree(n->left);
save_tree(n->right);
}
}
save_tree(tree);
Randomized huffman encoding
The code above is a very naive and inefficient implementation of the classic algorithm. For it to work faster, tables should be used, and the tree's maximum depth should be limited (for this, there is both an approximate method โ dividing each frequency by two โ and an exact algorithm called package-merge). I strongly suggest to read the article on bistreams by ryg and huffman revisited by cyan. But what's important from an algorithmic perspective is not the actual code value but only its length in bits. \sum 2^{โlen} \leq 1 (Kraft-MacMillan inequality). This means we can swap two codes of equal length, and the compression ratio will not change. And this can be done very easily โ in the function that builds the tree, we randomly swap the left and right branches:
node_t *tree = NULL, *left, *right;
while (list) {
// ....
right = list;
list = list->next;
if (random() > (RAND_MAX / 2)) {
void *tmp;
tmp = left;
left = right;
right = tmp;
}
The same compression, different encoding:
$ ./a.out "tell me o muse of that ingenious hero who travelled far and wide" Input: tell me o muse of that ingenious hero who travelled far and wide Len: 64 Compressed len: 53 y..e.&.......s.L..3....3>...*~_..t]..R..g....L.oq.*.h ....Y.U...;..f]-...Y...W~K..`.O.5p.P.F....g.m..?6~..` s,.a..I...3..v7...f6.g]...\..%.j..z.....\.\.:.`Y.(n.. ..sY.... .....YV..n.....hq:......-..k.+..g.+....*.Y.. ...Y....l,....YW+.i.,......k..0....f.9.....J."....{.y ....Y.U...u..o..m.s...._........?P....T.M.M.x..;.j.v. |.f.Ns..7..9.A.'#.i7M.B.l....Z..A.....o..W#)........Y s,.a......&..g;gCfm3Ng\...T..%vc.EK.-...T...;..i.(h.. .*.r-....H...w..5.s...O.."\..%.{..J.-.Q...\.*Gpy.8H.. .........H...g..5.u...O}..\....x..:...pZ....J.s..X... y.....i. ..........2.F....21I......tK...2}...0.....I. .*.r-.E..H...g....m......BT.2%nr.F{..c1...T.+..Y.8N.. ...[l.....k=.@.V.)n.....,..!.Z...=.nj...#_...(..Y.9.{ ~g;....s:......g#.n4..F..p.ZQ...[..&h......j.c....y.; y..2.3....cl.u....w....wz/~....Q.`....X.~...^..-.L.[. |.s.L... ..y...&...2.F..hY2.....y(....*.2u2.....?..i.
Inversing compression
For a long time, antivirus programs have used heuristics to look for encrypted or compressed fragments in files. They can be used to, for example, enable emulation and wait for the unpacking procedure to complete. This led me to the question of how to reduce the entropy of encrypted data. One obvious method is base32; by adding three zero bits to every five bits, the entropy will not exceed five.
Another option is inverse compression. We take the encrypted text and decode it using the Huffman algorithm, and the entropy will decrease. How the "unpacked" text will look depends on the frequency table, which is calculated during compression (if the text were compressed). However, that would require storing it somewhere, which is not very elegant. Fortunately, there is an algorithm that allows generating a random set of probabilities with a specified entropy (Peter Swaszek, Sidharth Wali "Generating probabilities with a specified entropy")
Since we know H_n (this is the entropy value we are aiming for), we need to find the roots of this equation. From these, we can get ranges from which we can randomly select probability p_i. To solve the equation, I first find the minimum of the function and then find the roots using bisection. Since the sum of all probabilities cannot exceed one, we need to scale the obtained values to form a simplex:
Once we have the vector with probabilities, we use the classic Huffman algorithm. We sort the probabilities in ascending order, build the character tree, and instead of compressing the text, we decompress it. Naturally, its length will increase, but in terms of entropy value and frequencies, it will correspond to a random model. To decode this text, we will need either the saved frequency table or the seed of the pseudorandom generator used to generate the table. Then, it must be compressed using the standard algorithm
We set the target entropy to 4.1
Simplex entropy 4.100000 Input: .Y.mD.!.....4..&e......H...A..Ex P...#.cn...........V.....nz...`F ....&n.`........]x@..j......]XK. [..CO...........N...{...&....=.. ..`......Q...Hn..w..8.....[..8.& ....I5.LN.+O_.k/.n...,.,.....|.. ....l.bd..'..DPY....L4.....?Q... ;.....8.^bY...2..e../....^.8.C.. Input size: 4096 Input entropy 7.949748 Output: .............'.................. .F.......h..5.....-............. ...........{.....4.............. .g...........................].. .....N............K............. .....................%.......... .....Y.\..........1.=........... .......$....................y... Output size: 7592 Output entropy 4.290585 Compressed size: 4096 Decoded successfully
Not exact, but close enough, or we could train our model on plain text (compile with -DTEXT) :
Simplex entropy 4.376084 Input: - ...}P...+....k.~..i..7.......H ....jXUL......z.....E..6..7Dx\Zp ....p.c.J.$.>.3....5..k-.D.q...z 8c.."oHx..p.b....).............. ..r....Ka._...a..*.D....... ..`. j...ev._cD ~X..%5Ajce..K........ b.@..?o<R.....9.n....y.[+.v../.. .B.<sB.{.{...,.$ Yc.f.G.=....o.. Input size: 4096 Input entropy 7.958878 Output: d,eeelIi .,onn,sypei ioyhrPn.fi rbths huiei iielhl rssohhieei eepypolt.egte-pueo.ne shaouiMen, thriwecrili lmd, eplheahiciwltwn eeNs la-twisesgcdhdags"M.sls cn o watn i el.r.oee huse nemii ngl ovbe. ets ieP nere-ilsiWwcdpaois h ees ags.lfaso.y noon yeiybcns Output size: 7449 Output entropy 4.395597 Compressed size: 4096 Decoded successfully
huff-random.c, huff-ent.c on github
sheesh
ReplyDeletesheesh
ReplyDelete