Hixie's Natural Log

2003-09-23 13:49 UTC A crash course in UTF-8 mathematics

Imagine you have a file you know is in UTF-8, and that you are viewing the raw bytes of this file in a text editor which displays high-bit bytes as octal sequences. For example, you could have the string:

Escamillo\342\200\231s supporters

How can you work out what the corresponding Unicode codepoint is?

Write out each digit of each octal sequence like this:

342 200 231

Then, below each digit, write out the corresponding binary using the table below, remembering to pad the results so that the first digit corresponds to two bits and the second and third digits correspond to three bits each.

Octal Binary
0 0
1 1
2 10
3 11
4 100
5 101
6 110
7 111

So now your notes look like:

342 200 231
11100010 10000000 10011001

Rewrite the binary string in groups of eight (i.e. in bytes):

11100010 10000000 10011001

Here, a vague understanding of how UTF-8 works helps. Count how many of the most-significant bits in your string are on. This tells you how many bytes your character takes. In this case, we have three:

11100010 10000000 10011001

...which is lucky since we do indeed have three bytes. The two other bytes start with a single high bit, which means they are continuation bytes. To get the actual bits that form your character, you take the least significant bits of each byte up to the zero before the most significant bits that are set.

11100010 10000000 10011001

Take these bits and stick them together:


...and then group them in fours:

0010 0000 0001 1001

In this case we happen to have a multiple of four bits, but sometimes you don't have such a convenient number of bits, so start counting at the least significant end (the right hand side) and then pad the most significant end with zero bits.

Next, you convert each of these nibbles to hexidecimal:

2 0 1 9

And finally you look up your character, in this case U+2019, in the Unicode names list, which in this case gives us "RIGHT SINGLE QUOTATION MARK".