Hixie's Natural Log: A crash course in UTF-8 mathematics

2003-09-23 13:49 UTC A crash course in UTF-8 mathematics

Imagine you have a file you know is in UTF-8, and that you are viewing the raw bytes of this file in a text editor which displays high-bit bytes as octal sequences. For example, you could have the string:

Escamillo\342\200\231s supporters

How can you work out what the corresponding Unicode codepoint is?

Write out each digit of each octal sequence like this:

3	4	2	2	0	0	2	3	1

Then, below each digit, write out the corresponding binary using the table below, remembering to pad the results so that the first digit corresponds to two bits and the second and third digits correspond to three bits each.

Octal	Binary
0	0
1	1
2	10
3	11
4	100
5	101
6	110
7	111

So now your notes look like:

3	4	2	2	0	0	2	3	1
11	100	010	10	000	000	10	011	001

Rewrite the binary string in groups of eight (i.e. in bytes):

11100010 10000000 10011001

Here, a vague understanding of how UTF-8 works helps. Count how many of the most-significant bits in your string are on. This tells you how many bytes your character takes. In this case, we have three:

11100010 10000000 10011001

...which is lucky since we do indeed have three bytes. The two other bytes start with a single high bit, which means they are continuation bytes. To get the actual bits that form your character, you take the least significant bits of each byte up to the zero before the most significant bits that are set.

11100010 10000000 10011001

Take these bits and stick them together:

0010000000011001

...and then group them in fours:

0010 0000 0001 1001

In this case we happen to have a multiple of four bits, but sometimes you don't have such a convenient number of bits, so start counting at the least significant end (the right hand side) and then pad the most significant end with zero bits.

Next, you convert each of these nibbles to hexidecimal:

0010	0000	0001	1001
2	0	1	9

And finally you look up your character, in this case U+2019, in the Unicode names list, which in this case gives us "RIGHT SINGLE QUOTATION MARK".