Defying Classification

by Malcolm Tredinnick

Fri 13 Jul 2007

Django Unicode Changes Post-mortem

Posted at 19:46 +1000

It's been a little over a week since I merged a large change into Django's trunk code to make it Unicode compatible. Prior to doing this, I had wondered how easy people were going to find working with the very slight changes this induced. Where were the hidden traps? How did various design decisions stand up in practice?

A little early to draw too many definitive conclusions, but here are a few observations from a week of general population usage, combined with feedback from the early adopters.

Design Decisions

(In what follows, I say "I" or "my" a lot. However, although I got to make the final decisions, it was often only after listening to a lot of opinions from people on the mailing lists — even people who posted something last year and it was in the archives — or on IRC or IM. This was in no way a solo effort and I'm not intending to grab credit that isn't mine.)

Right from the start, my strong desire was to be as backwards compatible as possible with existing code. There were some mutterings on the lists that we should only support Unicode, but that isn't very practical. Lots of existing code was going to be impacted by this change, so continuing to support bytestrings (Python's normal str object) as much as possible was a must, as far as I was concerned. I think with some judicious convenience methods and showing people how it worked, we managed to win over most of the loyal opposition. In any case, it's not a decision I regret.

From lots of bug fixing I had done prior to biting off this chunk of work, I felt comfortable claiming that there were enough problems on trunk with non-ASCII data that we could claim it didn't work reliably. So backwards compatibility there wasn't a goal. However, the intention was that people using ASCII data both before and after the changes should be affected as little as possible.

This approach, to a large extent, has worked out as planned. There have been a couple of people reporting problems where they were previously lucky enough to be using non-ASCII data (ISO-8859-1 in both cases) without failures. Sadly, that incompatibility can't be helped and a quick re-encoding to UTF-8 solves the problems. Python's support for encoding data into different representations is superb and very easy to work with once you get used to it.

The only place I thought there would be even potential problems was where something absolutely relied on the type of a string-like object being str, as opposed to being a basestring (str or unicode), in Python terms. In retrospect, this has turned out to be correct, but almost immediately a couple of people found the one place that matters: function keyword arguments must have str type. The effect is that if data is pulled from the database or form input and you want to pass it as key/value pairs to a function, the keys have the wrong type (unicode), initially. I don't see this as a real problem, since it's trivial to convert back in the few cases you need to and it's not a really common case, but it's one compulsory change that people have to make and I was really hoping for no backwards incompatibilities. Even now, I can't see how we could have realistically avoided that, though. The alternative would introduce inconsistencies about whether strings returned from Django functions were bytestrings or Unicode strings. We've gone with the "always return Unicode" approach.

It doesn't appear that any of the other places where we agonised over which way to go have real downsides. So all the design thinking has paid off in terms of a fairly transparent transition. It should also be mentioned that all the early testing from people using systems such as Windows in a native Russian environment (via some advance publicity from Ivan Sagalaev) turned up some things I would never have guessed would happen. By the time we unleashed these changes on the masses, most of the obvious and many of the subtle problems had already been found and fixed.

The Trap Of Convenience

There is one trap that a few people are falling into now that international character support has finally arrived. They have forgotten that they are still just using Python (CPython, in fact, since there are slight differences in string handling between IronPython, CPython and Jython) under the covers and they have to still code in that language and obey the rules.

This means that we are seeing requests for help where somebody is trying to return a Unicode object from a __str__ method. Python will interpret that as a sequence of bytes and return a str instance, usually with unexpected results. The problem is easily fixed, by using a __unicode__ method instead of __str__, but it's surprising how often I've seen this. Unicode handling in Python 2.x is not transparent! Bytestrings and Unicode strings are not equivalent on this level and you have to remember to use the right special method on classes to return the right type.

There have been a couple of other problems that are similar to this: basically somebody trying out non-ASCII data without doing all the necessary porting steps. Usually the errors are revealing enough that you can easily spot the mistake. Sometimes, all we can say is "please double-check your code."

Pleasant Surprises

From a development perspective things went both as well and badly as I expected. Most changes were routine, a couple were much fiddlier than I initially guessed, but ultimately solvable after a number of walks around the block to think through the design. Getting delayed translations working smoothly took approximately forever, it seemed. Working around a subtle Python 2.3 Unicode bug is another one that took ages to fix in all the places it might occur in the code (in fact, we're still nailing a couple of them).

In amongst all this, the really pleasant surprise was how easy it was to interact with all the database backends. I was expecting lots of pain here, since we need to be able to interact with legacy databases and databases where the user may not have control over the server's encoding. In the end, though, almost every backend understands that client (Django in this case) and the server can have different encodings and will both report the client encoding and allow it to be set. Most backends had automatic conversion to Unicode for retrieved strings, as well. Even the Oracle portion was easier than I initially feared, although the maintainers of cx_Oracle really need to join the 21st century and allow for converter callbacks to be attached to the retrieval functions so that we can always and naturally convert strings to Unicode (rather than having to pull the results and iterate over them doing encoding work in Python).

It's considered common-place to rant against poor design, but hang around the industry and the real world long enough and you learn that almost nothing is perfect, yet almost everything can be worked around after a bit of thinking. So it's a nice surprise when a whole group of related libraries work almost perfectly with respect to functionality you need. It doesn't increase stress levels too much to have to work around the very few rough edges that remain.

Topics: software/django