Understanding Python and Unicode

So you are getting wrong symbols in your webpage, maybe missing some chars or you are getting this runtime exceptions:

TypeError: decoding Unicode is not supported
UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xc3 in position 0: ordinal not in range(128)

And you definitively don’t understand why. It is time to learn how to fix it forever!

The information in this post might not be precise but it is short and direct for you to understand at least as I do. Because you don’t want to read all those unicode howtos, do you? (but you should!)

———————————-

Python supports the string type and the unicode type. A string is a sequence of chars while a unicode is a sequence of “pointers”. The unicode is an in-memory representation of the sequence and every symbol on it is not a char but a number (in hex format) intended to select a char in a map. So a unicode var does not have encoding because it does not contain chars.

Why is encoding needed? In order to print text or write text to a file, you can not write unicode because it would be a sequence of hex numbers. You have to replace those hex number by chars. The way you choose to do that conversion is the encoding. You can use the ASCII “alphabet” to encode the unicode, or can use UTF-8, etc. The problem is that ASCII is not enough rich to express some symbols like ‘ñáéíóú’. You need UTF-8 to encode that unicode string because otherwise, you will miss some chars.

Python manages the string type as a sequence of chars encoded in UTF-8. I guess it should be possible to tell python to encode strings in other ways but all the installations I’ve seen work like this. So if you work all the time with string types, not mixing them with unicode types, everything will work perfectly and you will have no problems at all. However this is not always possible.If you want to write into a file you have to encode the text. Same if you want to send html to the browser. Both, the file and the browser need chars, not a sequence of numbers so encoding is needed.

Django for example, gives you all the variables sent by http as unicode variables. This means that using request.POST or request.GET you will get unicode variables. From this point you can start mixing unicode variables with string variables without noticing, until you start getting runtime exceptions as above. Django also uses unicode when it retrieves data from database with the ORM.

To avoid problems you should either, work with strings all the time or work with unicode all the time. If you use unicode all the time, Django will know how to render html by using the DEFAULT_CHARSET setting (default to UTF-8) . I’ve decided to work with unicode all the time.

Now some unicode management statements:

x = “hello” # this is a string
y = u”hello” # this is unicode built from the default encoding
z = unicode(“hello”, ‘utf-8’) # this is unicode built from utf-8 encoding

Notice that if the default charset is ‘utf-8’, both y and z will be the same. Otherwise they will be different. A way to express the encoding in python code is using this line at the beginning of the file:

# -*- coding: utf-8 -*-

When I say “built from utf-8 encoding” I mean that in order for the system to recognize which chars you are processing it needs to know how were they encoded.

h = x.decode(‘utf-8’) # this is unicode built from utf-8
j = y.encode(‘utf-8’) # this is a string encoded using utf-8

decode is the way to transform a string in unicode. encode is the opposite.

a = “adiós” # a string containing non-ascii chars
unicode(a) # this raises exception

When you try to convert a string into unicode, you can pass in the encoding or not. If you don’t do it, python will try ASCII encoding. As the variable ‘a’ contains non-ascii chars, it is not possible to encode properly and it raises an exception. This behaviour can be changed with another optional parameter:

unicode(a, errors=’ignore’) # this produces “adis”

But if the string would have contained just ascii char, this would have worked and you wouldn’t notice. If you don’t pay attention to this you end up with runtime exceptions faced by your users.

Best practices are:

Always define string literals as unicode sequences: x = u”some_constant”
Join strings with the % operator:
joined = u”whatever=%s,ok=%s” % (the_value, the_other_value)
Don’t think that a char is always a byte. Some chars need 2 bytes so avoid things like:
third_char = some_string[3]
Don’t cast unicode to string with the str funcion (use decode):
my_hopefully_str = str(some_var)
If some_var is unicode, str with try to encode using ASCII and that will raise an exception if it contains non-ascii chars. str function is nice to serialize objects even if they contain non-ascii chars because it will serialize hex numbers.
Don’t compare string to unicode:
“adiós” is not equal to u”adiós”
Some functions are not able to work with unicode because they need chars:
hashlib.md5(my_string).hexdigest()
base64.b64encode(my_string)
urllib.urlencode(dictionary_containting_unicode_values)
These two functions need a string. Make sure you encode the unicode to the string before calling all these methods. If you pass in unicode, they will try to use ASCII encoding to transform them into string.
Python SimpleCookie object can’t work with unicode either. So when you set cookies like this:
request.COOKIES[k] = v # v has to be a string, not unicode.
Same when you use the ‘set_cookies’ method.
If you pass in unicode you will get this exception: translate() takes exactly one argument (2 given)

When you make a string or unicode joining variables like this:

x = var1 + “_” + var2

If var1 or var2 is unicode, the system will try to decode all other parts. Remember than decoding a string which contains non-ascii chars is not funny. Joining with the % operator is the same but the sintax makes you aware of the problem because you are writing that ‘u’ at the beginning.

Here you go some utility functions to use when you want to make sure that a variable is a string or that is is unicode:

# -*- coding: utf-8 -*-

def __if_number_get_string(number):
    converted_str = number
    if isinstance(number, int) or \
       isinstance(number, float):
        converted_str = str(number)
    return converted_str

def get_unicode(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode
    return unicode(strOrUnicode, encoding, errors='ignore')

def get_string(strOrUnicode, encoding='utf-8'):
    strOrUnicode = __if_number_get_string(strOrUnicode)
    if isinstance(strOrUnicode, unicode):
        return strOrUnicode.encode(encoding)
    return strOrUnicode

I don’t know the historical reasons to work with unicode types because string types with the chars encoded as UTF-8 is very powerful and free of surprises. Maybe it is because some human languages don’t fit into UTF-8. However this is the way it is nowdays.