loge.hixie.ch

Hixie's Natural Log

2004-01-03 00:48 UTC Unicode decoder tools

To help with debugging of Unicode and UTF-8 related problems, I've written two tools:

utf8-decoder

Paste in some UTF-8 bytes (either as hexadecimal, decimal, octal, or binary numbers, or as a hex dump, or as raw bytes in the form of Windows-1252 or ISO-8859-1 characters) and this script will tell you what the characters are, including UTF-8 decoding diagnostics.

For example, if you are viewing a UTF-8 encoded file in a raw Emacs buffer, and your buffer contains \342​\200​\253​\330​\263​\331​\204​\330​\247​\331​\205, and you want to know what on earth that is, you just need to select that exact string, paste it into the script's input field, and click the submit button. It will then tell you the characters are:

202B	RIGHT-TO-LEFT EMBEDDING
0633	ARABIC LETTER SEEN
0644	ARABIC LETTER LAM
0627	ARABIC LETTER ALEF
0645	ARABIC LETTER MEEM

This can be very useful, especially since the first one above (the RLE) is not a visible character! The script also includes some other useful information, such as the binary representation of each input byte and the entities you would use to include the characters in a US-ASCII HTML or XML file.

character-identifier

This little script will simply search for the characters you specify in the Unicode NamesList.txt file, giving the information for each character you selected.

For example if you enter into the input field and submit the form, it will tell you, amongst other things:

Character number 1 is decimal 9786, hex 0x263A, octal \23072, binary 10011000111010

U+263A	WHITE SMILING FACE
	= have a nice day!

Full source code is of course available.