2004-01-03 00:48 UTC Unicode decoder tools
To help with debugging of Unicode and UTF-8 related problems, I've written two tools:
Paste in some UTF-8 bytes (either as hexadecimal, decimal, octal, or binary numbers, or as a hex dump, or as raw bytes in the form of Windows-1252 or ISO-8859-1 characters) and this script will tell you what the characters are, including UTF-8 decoding diagnostics.
For example, if you are viewing a UTF-8 encoded file in a raw Emacs buffer, and your buffer contains \342\200\253\330\263\331\204\330\247\331\205, and you want to know what on earth that is, you just need to select that exact string, paste it into the script's input field, and click the submit button. It will then tell you the characters are:
202B RIGHT-TO-LEFT EMBEDDING 0633 ARABIC LETTER SEEN 0644 ARABIC LETTER LAM 0627 ARABIC LETTER ALEF 0645 ARABIC LETTER MEEM
This can be very useful, especially since the first one above (the RLE) is not a visible character! The script also includes some other useful information, such as the binary representation of each input byte and the entities you would use to include the characters in a US-ASCII HTML or XML file.
This little script will simply search for the characters you specify in the Unicode
NamesList.txtfile, giving the information for each character you selected.
For example if you enter ☺ into the input field and submit the form, it will tell you, amongst other things:
Character number 1 is decimal 9786, hex 0x263A, octal \23072, binary 10011000111010 U+263A WHITE SMILING FACE = have a nice day!
Full source code is of course available.