Friday, December 9, 2016

Bit-at-a-time huffman decoding on an AVR

Optimized inner loop of bit-at-a-time Huffman decoder is indistinguishable from magic:

     106:       22 0f           add     r18, r18
     108:       33 1f           adc     r19, r19
     10a:       37 1b           sub     r19, r23
     10c:       11 97           sbiw    r26, 0x01       ; 1
     10e:       fd 01           movw    r30, r26
     110:       54 91           lpm     r21, Z
     112:       75 0f           add     r23, r21
     114:       4f 5f           subi    r20, 0xFF       ; 255
     116:       48 30           cpi     r20, 0x08       ; 8
     118:       58 f4           brcc    .+22            ; 0x130
     11a:       37 17           cp      r19, r23
     11c:       a0 f7           brcc    .-24            ; 0x106

Tree format

A header of 8-bit codeword counts per bit length in reverse order followed by a table of fixed-length values in codebook order. The header specifies the canonical huffman code; explicit codewords are not stored. The file header points at the beginning of the value table; codeword counts are found by decrementing that pointer (hence reverse order).

Example: Given this code:
0 = 4
10 = 5
110 = 1
111 = 2

The tree would be stored as {2,1,1,4,5,1,2} with its file pointer pointing at the first value byte (4).

Decoding the tree

The bit-at-time decoder works by reconstructing the last code in the codebook for each bit length and comparing it to the decoded codeword so far. It stops when the decoded codeword is less than the last codeword for that bit length. The result is an index into the value table computed from the codeword and the number of codewords one shorter than the resulting codeword.

This can be accomplished with the following psuedocode:

codeword = 0, max_codeword = 0, num_codewords_so_far = 0, n = 0
while True:
  max_codeword = max_codeword << 1
  codeword = (codeword << 1) + getbit(n)
  num_codewords_this_length = codewords_of_length(n)
  if codeword < max_codeword + num_codewords_this_length:
    return num_codewords_so_far + codeword - max_codeword
  num_codewords_so_far += num_codewords_this_length
  max_codeword += num_codewords_this_length
  n = n + 1

This can be simplified algebraically by subtracting num_codewords_so_far from both max_codeword and codeword just before each comparison in the loop. This doesn't affect the outcome of the comparison.

adj_codeword = 0, adj_max_codeword = 0, num_codewords_so_far = 0, n = 0
while True:
  num_codewords_this_length = codewords_of_length(n)
  adj_max_codeword = (adj_max_codeword << 1) - num_codewords_so_far
  adj_codeword = (adj_codeword << 1) + getbit(n) - num_codewords_so_far
  if adj_codeword < adj_max_codeword + num_codewords_this_length:
    return num_codewords_so_far + adj_codeword - adj_max_codeword
  num_codewords_so_far += num_codewords_this_length
  adj_max_codeword += num_codewords_this_length
  n = n + 1

However after this transformation adj_max_codeword always equals num_codewords_so_far, so we can eliminate the redundant variable and greatly simplify the loop.

adj_codeword = 0, num_codewords_so_far = 0, n = 0
while True:
  num_codewords_this_length = codewords_of_length(n)
  adj_codeword = (adj_codeword << 1) + getbit(n) - num_codewords_this_length
  num_codewords_so_far += num_codewords_this_length
  if adj_codeword < num_codewords_so_far:
    return adj_codeword
  n = n + 1

Assembly code

r18 = bits from bitstream
r19 = adjusted codeword (index into the codebook)
r20 = number of bits shifted from r18
r21 = number of codewords at this length
r23 = sum number of codewords up to this length
r26:r27 = pointer to table of codewords per length (reverse order)

It works out to 16 cycles per bit (15 when it terminates; and some more when r18 runs out if bits).

The add/adc consumes one bit from the bitstream by shifting: r18 bit 7 is shifted into r19 bit 0. The subtraction at 10a subtract the number of total codewords in the codebook previous so that r19 is an index into the codebook. If it is less than the total number of codewords at the length, the complete codeword is decoded. Otherwise the offset points at a longer codeword and more bits are needed.

The branch at 118 is a subroutine that gets more bits from the bitstream loading them into r18 and setting r20=0. That routine jumps to 11a when it's done.

Examples

Lengths = 0, 2, 0, 3 (not reversed)
Codebook:
00
01
1000
1001
1010

Say the input bits are 1001. After each loop:
r19 = 0 * 2 + 1 - 0 = 1 not less than 0
r19 = 1 * 2 + 0 - 0 = 2 not less than 2
r19 = 2 * 2 + 0 - 2 = 2 not less than 2
r19 = 2 * 2 + 1 - 2 = 3 is less than 5

The loop terminates and r19 = 3 is the number of codewords in the codebook before 1001. This is the offset into the table of values indexed by codeword position in the codebook.

Same but for input 01:
r19 = 0 * 2 + 0 - 0 = 0 not less than 0
r19 = 0 * 2 + 1 - 0 = 1 is less than 1

After the loop exits r19 = 1 is the number of codewords in the codebook before 01.