Unicode in Python 2: Decode in, encode out
Summary
In Python 2 you need to convert between encoded strings and unicode. It’s easy if you follow these three simple rules:
Decode all input strings
name = input_name.decode('utf8', 'ignore')
You need to decode all input text: filenames, file contents, console input, database contents, socket data, etc. If you are using Django, it already does this for you, as much as it can.
The trick is figuring out which encoding you have. Here are
Python’s supported encodings. The main ones I try are utf_8, latin_1 and cp1250.
Inside, work only with unicode
That means prefixing strings with ‘u’:
my_thing = u'Something'
Be careful when concatenating, always use a u:
my_other = part_one + u'-' + part_two
Encode all output strings
output_name = name.encode('utf8')
print(output_name)
Strings (bytes) have an encoding. The ‘ignore’ parameter to decode
tells it to throw away any bytes that aren’t valid UTF8.
Unicode does not have an encoding. Hence you need to decode the byte string, specifying the encoding, to get unicode. When you want to output, you need to encode your unicode back to a byte string, again specifying the encoding.
In Python 3, all strings are Unicode, and happiness fills the land.