2003-09-23 13:49 UTC A crash course in UTF-8 mathematics
Imagine you have a file you know is in UTF-8, and that you are viewing the raw bytes of this file in a text editor which displays high-bit bytes as octal sequences. For example, you could have the string:
Escamillo\342\200\231s supporters
How can you work out what the corresponding Unicode codepoint is?
Write out each digit of each octal sequence like this:
3 | 4 | 2 | 2 | 0 | 0 | 2 | 3 | 1 |
Then, below each digit, write out the corresponding binary using the table below, remembering to pad the results so that the first digit corresponds to two bits and the second and third digits correspond to three bits each.
Octal | Binary |
---|---|
0 | 0 |
1 | 1 |
2 | 10 |
3 | 11 |
4 | 100 |
5 | 101 |
6 | 110 |
7 | 111 |
So now your notes look like:
3 | 4 | 2 | 2 | 0 | 0 | 2 | 3 | 1 |
11 | 100 | 010 | 10 | 000 | 000 | 10 | 011 | 001 |
Rewrite the binary string in groups of eight (i.e. in bytes):
11100010 10000000 10011001
Here, a vague understanding of how UTF-8 works helps. Count how many of the most-significant bits in your string are on. This tells you how many bytes your character takes. In this case, we have three:
11100010 10000000 10011001
...which is lucky since we do indeed have three bytes. The two other bytes start with a single high bit, which means they are continuation bytes. To get the actual bits that form your character, you take the least significant bits of each byte up to the zero before the most significant bits that are set.
11100010 10000000 10011001
Take these bits and stick them together:
0010000000011001
...and then group them in fours:
0010 0000 0001 1001
In this case we happen to have a multiple of four bits, but sometimes you don't have such a convenient number of bits, so start counting at the least significant end (the right hand side) and then pad the most significant end with zero bits.
Next, you convert each of these nibbles to hexidecimal:
0010 | 0000 | 0001 | 1001 |
2 | 0 | 1 | 9 |
And finally you look up your character, in this case U+2019, in the Unicode names list, which in this case gives us "RIGHT SINGLE QUOTATION MARK".