Unicode in Python 2: Decode in, encode out

August 10, 2011 software python unicode

Summary

In Python 2, managing strings involves converting between encoded strings and Unicode by following three rules: first, decode all input strings using `.decode()`, specifying the encoding like 'utf8' while ignoring invalid bytes; second, work internally with Unicode strings, using a 'u' prefix (e.g., `u'Something'`) and ensuring Unicode is maintained during operations like concatenation; third, encode all output strings back to bytes using `.encode()` with specified encoding before output. This process involves identifying the appropriate encoding, with common ones being utf-8, latin-1, and cp1250. By contrast, Python 3 simplifies this as it inherently handles all strings as Unicode.

In Python 2 you need to convert between encoded strings and unicode. It’s easy if you follow these three simple rules:

Decode all input strings

name = input_name.decode('utf8', 'ignore')

You need to decode all input text: filenames, file contents, console input, database contents, socket data, etc. If you are using Django, it already does this for you, as much as it can.

The trick is figuring out which encoding you have. Here are

Python’s supported encodings. The main ones I try are utf_8, latin_1 and cp1250.

Inside, work only with unicode

That means prefixing strings with ‘u’:

my_thing = u'Something'

Be careful when concatenating, always use a u:

my_other = part_one + u'-' + part_two

Encode all output strings

output_name = name.encode('utf8')
print(output_name)

Strings (bytes) have an encoding. The ‘ignore’ parameter to decode tells it to throw away any bytes that aren’t valid UTF8.

Unicode does not have an encoding. Hence you need to decode the byte string, specifying the encoding, to get unicode. When you want to output, you need to encode your unicode back to a byte string, again specifying the encoding.

In Python 3, all strings are Unicode, and happiness fills the land.

Graham King

Decode all input strings

Inside, work only with unicode

Encode all output strings