Finally understanding Unicode and UTF-8
Summary
Unicode maps 32-bit (4 byte) integers, also called code points or runes, to characters. UTF-8 is a way of storing those code points using less than 4 bytes per character.
61 is the Unicode code point for a, 229 is å and 5793 is ᚡ. Unicode is how most modern programming languages represent strings: Java, .Net (C#, VB.Net), Go, and Python3, for example. Code points are usually written as two hexadecimal bytes prefixed by the letter u, or four prefixed by U. In python 3 this will display ᚡ:
print('\u16a1')
The first byte of that 32-bit integer (the code point) covers most characters used by European languages. The first 127 code points (hex values 00 to 7f) are the same as
ASCII: 61 is both Unicode and ASCII code for a
. The next 128 code points (0x80-0xff) are the same as ISO-8859-1, also called latin-1: e5 (229) is both Unicode and ISO-8859-1 for å.
The first two bytes cover characters for almost all modern languages. It is extremely rare to need the full 4 bytes as they are mostly empty. A rare exception, sad kitty U0001F640 needs three bytes. It broke WordPress when I put it in this post – that’s how common characters above two bytes are!
An encoding is a mapping from bytes to Unicode code points. If you use the code points directly as their mapping (4 bytes per code point) you have UTF-32. So 00 00 00 61
is UTF-32 for Unicode code point 61, which is a.
English speakers will usually only need one byte, and other language users two, so there are more efficient encodings. The most common Unicode encoding is UTF-8.
The first 127 values of UTF-8 map directly to Unicode code points, and hence to ASCII codes. 61
is UTF-8 for Unicode code point 61, which is character a. If you only ever use values up to 127, UTF-8, Unicode code points, and ASCII are the same. This makes confusion easy.
Above 127, UTF-8 uses between two and four bytes for each code point. c3 a5
is UTF-8 for Unicode code point u00e5, which is å. In python3:
bytes([0xc3, 0xa5]).decode("utf8")
This means UTF-8 is not compatible with ISO-8859-1.
When you receive a string of bytes, you also need to know it’s encoding to interpret it as Unicode. Luckily is is quite easy to test for valid UTF-8. In Go you use the Valid
function of unicode/utf8. In Python you try to .decode("utf8")
and catch the UnicodeDecodeError
.
In summary (all values are hex):
UTF-8 | UTF-32 | Unicode code point | ASCII | ISO-8859-1 | Character |
---|---|---|---|---|---|
61 | 00 00 00 61 | 61 (decimal 97) | 61 | 61 | a |
c3 a5 | 00 00 00 e5 | e5 (decimal 229) | None | e5 | å |
e1 9a a1 | 00 00 16 a1 | 16a1 (decimal 5793) | None | None | ᚡ |