One of the interesting features in the XZ backdoor is string searching using a trie, and for the same reason, it runs catastrophically slowly, which led to its discovery. But we were taught that tries are supposed to enable fast search. How do bitmap tries work, and how can you make search fast?
Let’s start with simple tries. Each node has a flag indicating whether it is a terminal node, and a set of pointers equal to the number of characters in the alphabet (I use lowercase letters):
struct trie {
/* basic trie stuff, term flag anf edges */
unsigned term:1;
struct trie *c[32];
/* we will use these later */
unsigned bitmap;
unsigned size;
};
int match0(struct trie *node, char *s)
{
if (*s == 0) {
if (node->term)
return 1;
return 0;
}
unsigned b = *s - 'a';
if (node->c[b] == NULL)
return 0;
return match0(node->c[b], s + 1);
}
It takes up a lot of space. We could store only the non-empty pointers and somehow determine which character each one corresponds to. This kind of structure is called a trie with bitmap:
/* now we keep only non-NULL pointers and bitmap to find their index */
void compress(struct trie *node)
{
for (int i = 0, p = 0; i < 32; i++)
if (node->c[i]) {
node->bitmap |= 1 << i;
node->c[p++] = node->c[i];
compress(node->c[i]);
}
}
/* match with bitmap trie */
int match1(struct trie *node, char *s)
{
if (*s == 0) {
if (node->term)
return 1;
return 0;
}
unsigned b = 1 << (*s - 'a');
if ((node->bitmap & b) == 0)
return 0;
unsigned next = __builtin_popcount(node->bitmap & (b - 1));
return match1(node->c[next], s + 1);
}
Now we need to store the resulting tree. A node consists of a bitmap, edges, and a terminal flag. We can store them in a single array. This structure is called an array-mapped trie (Phil Bagwell "Fast And Space Efficient Trie Searches", 2000)
/* array mapped trie */
void encode(struct trie *node)
{
uint32_t base = nmax;
n[base] = node->bitmap;
/* we use single array for masks and indexes */
/* one could allocate more "slots" for the mask */
nmax = nmax + 1 + node->size;
for (int i = 0; i < node->size; i++) {
/* note that nmax is a global */
n[base + 1 + i] = nmax | (node->c[i]->term << 31);
encode(node->c[i]);
}
}
void traverse2(int i)
{
int bits = 0;
unsigned mask = n[i];
for (int j = 0; j < 31; j++) {
if ((mask >> j) & 1) {
int next = i + 1 + bits;
putchar(j + 'a');
if (n[next] >> 31)
putchar('\'');
/* clean term flag */
traverse2(n[next] & 0x7fffffff);
bits++;
}
}
}
And finally, let’s try to build an AMT similar to the way it’s done in the XZ backdoor. We’ll store the bitmaps in one array, and the indices/offsets in another.
unsigned encodex(struct trie *node)
{
unsigned mask = node->bitmap;
/* find index in mask array (n), we keep only unique masks */
int im = -1;
for (int i = 0; i < nmax; i++)
if (n[i] == mask) {
im = i;
break;
}
if (im == -1) {
im = nmax;
n[nmax++] = mask;
}
/* fill array of edges (x) */
uint32_t base = xmax;
xmax += node->size;
for (int i = 0; i < node->size; i++) {
/* use relative values */
/* one could achieve some LZ-style compression by finding
overlapping values and adjusting relative indexes */
uint32_t next = xmax - base;
/* i-th edge: term:1 | next offset:15 | mask index:16 */
x[base + i] = encodex(node->c[i]) | (next << 16) |
((unsigned)node->c[i]->term << 31);
}
return im;
}
int _match(int mi, int ni, char *s)
{
unsigned bp = 1 << (*s - 'a');
unsigned ma = n[mi];
if ((ma & bp) == 0)
return 0;
unsigned next = ni + __builtin_popcount(ma & (bp - 1));
if ((x[next] >> 31) && s[1] == 0)
return 1;
return _match(x[next] & 0xffff, ni + ((x[next] >> 16) & 0x7fff), s + 1);
}
int match(char *s)
{
return _match(0, 0, s);
}
Of course, this isn’t the only way to store tries — there are plenty of different structures, like HAMT, LOUDS-tries, Fast Succinct Tries, DAFSA (Deterministic acyclic finite state automaton) and so on. In the end, we can even represent the tree as a parenthesis expression (in this case, the order in which we traverse the branches doesn’t matter, which gives us polymorphism for free). One important thing that q3k noticed right away is that if the tree is encoded, we can reconstruct the original string table.
/* serialize as S-expression */
char output[128], os = 0;
void serialize(struct trie *node)
{
// output[os++] = '(';
for (int i = 0; i < 32; i++)
if (node->c[i]) {
output[os++] = node->c[i]->term ?
toupper(i + 'a') : i + 'a';
serialize(node->c[i]);
}
output[os++] = ')';
}
...
printf("Unserialize:\n");
char buf[32], pos = 0;
/* simple automata to reconstruct string array from
serialized trie */
for (int i = 0; i < os; i++) {
char in = output[i];
switch (in) {
case '(':
/* ignore, it's redundant */
break;
case ')':
pos--;
break;
case 'A' ... 'Z':
buf[pos++] = tolower(in);
buf[pos] = 0;
printf("%s,", buf);
break;
default:
buf[pos++] = in;
break;
}
}
Serialize trie into S-expression:
arroW)))TiclE)))))))baR)sE))T)))caN)pE))RD)E))sE)tlE))))TS))))deaR))eP)R)))))
Unserialize:
arrow,art,article,bar,base,bat,can,cape,car,card,care,case,castle,cat,cats,
dear,deep,deer,
The same can be done with the AMT — at the very least, it can be fully unpacked and the indexes can be converted into pointers. Or it can be used for generation instead of string recognition.
The XZ backdoor doesn't try to reconstruct strings, and when it needs to find a function name, it scans the GNU hash table (apparently, it originally used fast lookup with hashing and Bloom filters). At the same time, it repeatedly checks address ranges and whether a memory region is mapped. For this construct to work efficiently, it would’ve made more sense to check once that the segments are mapped correctly, and then either find all the symbols in the table at once using the Aho–Corasick algorithm (by storing the automaton’s transition table or a trie), or replace the recognizer automaton with a generator, use fast hash-based lookup and memmem/ACA searches, and clean up memory afterward.
This is yet another example in the case of this backdoor where the author seems uncertain about which technique to use and how exactly to implement it. A similar situation occurs elsewhere — for instance, the backdoor imports the __tls_get_addr function in order to find the base address of the dynamic loader. It’s possible the author intended to retrieve it from the Thread Control Block. However, since the backdoor runs from an ifunc constructor — which is executed by the dynamic loader itself — it would’ve been enough to just inspect __builtin_return_address(0) or directly import _r_debug instead of __tls_get_addr.
Still, the ideas are interesting and inspiring — thank you, "Jia" (whoever you may be), for the food for thought.
trie.c, trie with bitmap/AMT on github
Basic AMT (masks and edges in the same array): arrow't'icle'bar'se't'can'pe'r'd'e'se'tle't's'dear'ep'r' size is 308 bytes 0000000f 00000005 00000018 00000023 00000040 00020000 00000007 000a0000 0000000a 8000000f 00004000 0000000c 00400000 8000000e 00000000 00000100 00000011 00000004 00000013 00000800 00000015 00000010 80000017 00000000 00000001 0000001a 000e0000 8000001e 0000001f 80000022 00000000 00000010 80000021 00000000 00000000 00000001 00000025 000ea000 8000002b 0000002c 8000002f 00000034 8000003d 00000000 00000010 8000002e 00000000 00000018 80000032 80000033 00000000 00000000 00080010 80000037 00000038 00000000 00000800 0000003a 00000010 8000003c 00000000 00040000 8000003f 00000000 00000010 00000042 00000011 00000045 00000048 00020000 80000047 00000000 00028000 8000004b 8000004c 00000000 00000000 AMT, XZB-style arrow't'icle'bar'se't'can'pe'r'd'e'se'tle't's'dear'ep'r' masks and edges in the separate arrays: size is 224 bytes (18 masks, 38 edges) masks: 0000000f 00020000 000a0000 00004000 00400000 00000000 00000100 00000004 00000800 00000010 00000001 000e0000 000ea000 00000018 00080010 00040000 00000011 00028000 edges: 00040001 000d000a 0012000a 00200009 00010002 00020003 80040006 00010004 80010005 00010007 00010008 00010009 80010005 0001000b 80030005 00030009 80040005 80010005 0001000c 80050005 00050009 8006000d 0008000e 800c000f 80010005 80020005 80020005 80020005 00020008 00010009 80010005 80010005 00010010 00020001 00030011 80010005 80020005 80020005 OK
P.S. Funny enough, I thoroughly dug through the internet and couldn’t find a single AMT implementation that even remotely resembles the one used in the backdoor. Even the hardworking Chinese folks — they figured out ifunc, resolver and audit parts, but mistook the tree for a radix tree, and ended up with GPT code, which still doesn’t match the idea.
No comments:
Post a Comment