Carlos Ble

Carlos Ble

I am a professional software developer, I solve problems.

I also teach and mentor developers to build better software.

Developing software since 2001.

Can I help you?

  • Do you need high quality tailor-made software?
  • Need training on TDD, clean code or refactoring?
  • Do you need a technical consultant?
  • May I pair with you to write better code?

Events

Upcoming training courses:

  1. TDD - [en Español] - 6 y 7 Octubre
    Gran Canaria
  2. TDD - [in English] - October 20, 21 & 22
    London, UK
  3. TDD - [en Español] - 29, 30 y 31 de Octubre.
    Madrid, Spain

Conferences:

  1. I'll be at the Agile Testing Days 2014
  2. I'll be at the London Test Gathering Workshops.

Understanding Python and Unicode

So you are getting wrong symbols in your webpage, maybe missing some chars or you are getting this runtime exceptions:

  • TypeError: decoding Unicode is not supported
  • UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

And you definitively don't understand why. It is time to learn how to fix it forever!

The information in this post might not be precise but it is short and direct for you to understand at least as I do. Because you don't want to read all those unicode howtos, do you?  (but you should!)

----------------------------------

Python supports the string type and the unicode type. A string is a sequence of chars while a unicode is a sequence of "pointers". The unicode is an in-memory representation of the sequence and every symbol on it is not a char but a number (in hex format) intended to select a char in a map. So a unicode var does not have encoding because it does not contain chars.

Why is encoding needed? In order to print text or write text to a file, you can not write unicode because it would be a  sequence of hex numbers. You have to replace those hex number by chars. The way you choose to do that conversion is the encoding. You can use the ASCII "alphabet" to encode the unicode, or can use UTF-8, etc. The problem is that ASCII is not enough rich to express some symbols like 'ñáéíóú'. You need UTF-8 to encode that unicode string because otherwise, you will miss some chars.

Python manages the string type as a sequence of chars encoded in UTF-8. I guess it should be possible to tell python to encode strings in other ways but all the installations I've seen work like this. So if you work all the time with string types, not mixing them with unicode types, everything will work perfectly and you will have no problems at all. However this is not always possible.If you want to write into a file you have to encode the text. Same if you want to send html to the browser. Both, the file and the browser need chars, not a sequence of numbers so encoding is needed.

Django for example, gives you all the variables sent by http as unicode variables.  This means that using request.POST or request.GET you will get unicode variables. From this point you can start mixing unicode variables with string variables without noticing, until you start getting runtime exceptions as above. Django also uses unicode when it retrieves data from database with the ORM.

To avoid problems you should either, work with strings all the time or work with unicode all the time. If you use unicode all the time, Django will know how to render html by using the DEFAULT_CHARSET setting (default to UTF-8) . I've decided to work with unicode all the time.

Now some unicode management statements:

x = "hello"  # this is a string
y = u"hello" # this is unicode built from the default encoding
z = unicode("hello", 'utf-8') # this is unicode built from utf-8 encoding

Notice that if the default charset is 'utf-8', both y and z will be the same. Otherwise they will be different. A way to express the encoding in python code is using this line at the beginning of the file:

# -*- coding: utf-8 -*-

When I say "built from utf-8 encoding" I mean that in order for the system to recognize which chars you are processing it needs to know how were they encoded.

h = x.decode('utf-8') # this is unicode built from utf-8
j = y.encode('utf-8') # this is a string encoded using utf-8

decode is the way to transform a string in unicode. encode is the opposite.

a = "adiós" # a string containing non-ascii chars
unicode(a) # this raises exception

When you try to convert a string into unicode, you can pass in the encoding or not. If you don't do it, python will try ASCII encoding. As the variable 'a' contains non-ascii chars, it is not possible to encode properly and it raises an exception. This behaviour can be changed with another optional parameter:

unicode(a, errors='ignore') # this produces "adis"

But if the string would have contained just ascii char, this would have worked and you wouldn't notice. If you don't pay attention to this you end up with runtime exceptions faced by your users.

Best practices are:

  1. Always define string literals as unicode sequences:  x = u"some_constant"
  2. Join strings with the % operator:
    joined = u"whatever=%s,ok=%s" % (the_value, the_other_value)
  3. Don't think that a char is always a byte. Some chars need 2 bytes so avoid things like:
    third_char = some_string[3]
  4. Don't cast unicode to string with the str funcion (use decode):
    my_hopefully_str = str(some_var)
    If some_var is unicode, str with try to encode using ASCII and that will raise an exception if it contains non-ascii chars. str function is nice to serialize objects even if they contain non-ascii chars because it will serialize hex numbers.
  5. Don't compare string to unicode:
    "adiós" is not equal to u"adiós"
  6. Some functions are not able to work with unicode because they need chars:
    hashlib.md5(my_string).hexdigest()
    base64.b64encode(my_string)
    urllib.urlencode(dictionary_containting_unicode_values)
    These two functions need a string. Make sure you encode the unicode to the string before calling all these methods. If you pass in unicode, they will try to use ASCII encoding to transform them into string.
    Python SimpleCookie object can't work with unicode either. So when you set cookies like this:
    request.COOKIES[k] = v # v has to be a string, not unicode.
    Same when you use the 'set_cookies' method.
    If you pass in unicode you will get this exception: translate() takes exactly one argument (2 given)

When you make a string or unicode joining variables like this:

x = var1 + "_" + var2

If var1 or var2 is unicode, the system will try to decode all other parts. Remember than decoding a string which contains non-ascii chars is not funny. Joining with the % operator is the same but the sintax makes you aware of the problem because you are writing that 'u' at the beginning.

Here you go some utility functions to use when you want to make sure that a variable is a string or that is is unicode:

  1. # -*- coding: utf-8 -*-
  2.  
  3. def __if_number_get_string(number):
  4. converted_str = number
  5. if isinstance(number, int) or \
  6. isinstance(number, float):
  7. converted_str = str(number)
  8. return converted_str
  9.  
  10. def get_unicode(strOrUnicode, encoding='utf-8'):
  11. strOrUnicode = __if_number_get_string(strOrUnicode)
  12. if isinstance(strOrUnicode, unicode):
  13. return strOrUnicode
  14. return unicode(strOrUnicode, encoding, errors='ignore')
  15.  
  16. def get_string(strOrUnicode, encoding='utf-8'):
  17. strOrUnicode = __if_number_get_string(strOrUnicode)
  18. if isinstance(strOrUnicode, unicode):
  19. return strOrUnicode.encode(encoding)
  20. return strOrUnicode

I don't know the historical reasons to work with unicode types because string types with the chars encoded as UTF-8 is very powerful and free of surprises. Maybe it is because some human languages don't fit into UTF-8. However this is the way it is nowdays.

Enjoyed reading this post?
Subscribe to the RSS feed and have all new posts delivered straight to you.
  • http://www.galotecnia.com Esaú Rodríguez

    Carlos thank you very much for this post. IMHO one of the worse things in Python/Django is charset errors due to the slight difference between string and unicode definition. Now you put on the light!.

  • http://sanacl.wordpress.com Luis Cañas Díaz

    Very useful article. Thanks :)

  • Carlos

    Thanks for this post, encoding is always a headache xD. What if I want to store data from a form. Maybe it has been copy-pasted from a document and it contains “strange” symbols… any suggestion?

  • http://www.mavencharts.es Oscar Moreno

    Hi Mate!! great info, very clear. There are some changes of str and unicode types in python 3, no more u”your_string” and str is unicode by default.

    “The biggest difference with the 2.x situation is that any attempt to mix text and data in Python 3.0 raises TypeError, whereas if you were to mix Unicode and 8-bit strings in Python 2.x, it would work if the 8-bit string happened to contain only 7-bit (ASCII) bytes, but you would get UnicodeDecodeError if it contained non-ASCII values. This value-specific behavior has caused numerous sad faces over the years.”

    This is also a good read:

    http://docs.python.org/py3k/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit

    Cheers!!

  • Pingback: Understanding Python and Unicode

  • http://carlosble.com Carlos Ble

    When you post a form, django gives you unicode variables. It tries to decode the sent data using the encoding info that goes into the HTTP headers (I guess), so you get unicode in your controller (ok, view in django).

  • http://www.tvprogramy.com.pl Antony Concepcion

    excellent post, very informative. I wonder why the other experts of this sector do not notice this. You must continue your writing. I’m confident, you have a huge readers’ base already!

  • http://www.basketfrmode.com Jonah Castleberry

    The brand new Zune browser is surprisingly beneficial, but not as great because the iPod’s. It performs nicely, but is not as swiftly as Safari, and features a clunkier interface. If you occasionally prepare on making use of the net browser that is not an situation, but should you’re preparing to browse the web alot out of your PMP then the iPod’s more substantial display screen and greater browser could be important.

  • http://alexdavid44.multiply.com/links/item/98/Pilates_Training find pilates classes

    Youre right Mark. With my military background I should have known better. How about FOs – Forward Observers? Sounds less intimidating and definitive.

  • http://www.youtube.com/redirect?q=http%3a%2f%2fwww.qualitytoysandhobbies.com%2flego-building-playsets.html Lego Sets

    hello there and thank you for your information – I’ve certainly picked up anything new from right here. I did however expertise several technical points using this site, as I experienced to reload the website a lot of times previous to I could get it to load properly. I had been wondering if your web hosting is OK? Not that Im complaining, but slow loading instances times will sometimes affect your placement in google and could damage your quality score if ads and marketing with Adwords. Anyway I am adding this RSS to my email and could look out for a lot more of your respective intriguing content. Ensure that you update this again very soon..

  • http://myweddingbands.jimdo.com/ antique diamond rings san diego

    great post!

  • http://kettlebelltraining12.wetpaint.com/page/Kettlebell+Workout enter the kettlebell for women

    Of course, this happens many times. The interesting point is when a not so well known wine beats some first growths when these famous crus cannot say we are not ready yet.Example : Sociando-Mallet winning over many classified at the tasting of the GJE in Las Vegas with the vintage 1982.

  • Mauro_Brazil

    Hi there. First, thanks for the post. Excelent explanation.

    I did not understand something:


    a = “adiós” # a string containing non-ascii chars
    unicode(a) # this raises exception

    You said that the example above raises an excepetion because it tried to “encode” the a variable with ASCII. But, unicode isn’t just a representation (not the char itself) ?? So, why when you use python’s unicode function it try to encode it?

  • http://www.carlosble.com Carlos Ble

    Hi Mauro,
    I meant, decode. The unicode function will try to decode the string, trying with ascii which fails.
    Hope it helps.

  • Mauro_Brazil

    Ok ! It’s OK now !

    Tks a lot !

  • Thales Angelino

    Thanks Carlos, your post helped me a lot!

  • http://www.carlosble.com/ Carlos Ble

    Excellent! that is part of the blog’s motivation, to help people. Thank you :-)