Tue 22 Apr 2008
Software Internationalisation
Posted at 23:19 +1000
Michael Trier and Brian Rosner released another episode of This Week In Django and I was asked to play the role of guest speaker. The focus was on internalisation in Django (and in general in software).
I realised after we recorded it that I neglected to mention a few things that are probably of note for developers trying to write international software. Consequently, here's a small follow-up with some additional things I thought of last night.
This isn't entirely Django-specific, since most of these techniques apply to software in general.
Firstly... thanks!
In general, I'm not a podcast listener, since I find them a bit of a slow way to receive information. Sometimes, though, if it's an interesting speaker or topic, I'll set aside the time to listen in the evening instead of reading in bed or dump it on the iPod when I'm going for a walk. For This Week In Django I've tended to make an exception and listen regularly (not so much for the ones I'm in, since listening to myself speak feels weird). In part, it's interesting to hear what other people are thinking about and I can let it play along in the background whilst I'm doing other work. It's a well produced show and Michael and Brian deserve a lot of credit for doing it week in and week out, chasing down things to talk about and doing a real service to people who find that format useful.
I'm always impressed by how fast Michael turns around the editing on these episodes. Less than 24 hours after we recorded a couple of hours of conversation over Skype, he has a nicely edited version available, in three formats, including screenshots in the AAC version and links in the show notes. There's always lots of dead air and inadvertent speaking over each other that needs to be edited out of something like this, so it's not just a matter of Michael dropping the recorded version into a link and then kicking back with a beer. This week would have been slightly challenging, too, since I was having a few hardware issues from what seems to be a bad USB bus and I dropped off the call a few times and even had to switch computers mid-stream, so it took a bit longer to record than the 100 minutes in the final show (plus it goes on forever from the looks of it; I may have spoken too much).
Writing Software For The International Audience
We All Make Assumptions
During the show, I was asked about what should developers be thinking about when trying to write software that can be used in languages other than their own. I can't remember the full answer, but I do know a few things I forgot to mention that all fall into the same general category:
We often implicitly build in cultural assumptions as well as linguistic assumptions when we're writing code.
How many times have you seen a form asking for first and last name that requires both and then combines then as first name followed by a space followed by the last name? That's a small problem, since it's not a failsafe way to combine names in a number of situations, plus we often tend to confuse "first" and "last" name with "given" and "family" names, when they aren't necessarily equivalent. There are a few ways around this: don't require both entries so that somebody can just use one for a combined name, or just have a "name" box and don't try to impose your own impression of when it's appropriate to use one or the other, are two options. Trying to build in something that is customisable by the data entry person for all cases would lead to a form that is almost unmanageable in complexity, so that's probably not an attempt at a solution.
Address entry is another area where it's not hard to mess up, although websites in general are getting better at this. There are still sites around that require you to enter a state or province and a postal or zip code. Both of these are problematic, since a number of small countries and administrative regions don't use those concepts (either at all or maybe just in practice). The year and a bit I spent living in Hong Kong was eye-opening here (which doesn't use either. Hong Kong functioned as the country name for practical purposes, although it's an administrative district of China): a couple of times I needed to register on sites that required information that simply didn't exist in my address.
One that I still see a fair bit today (and it's not an internationalisation issue as much as a cultural or domain-specific assumption) are websites that require a company name — such as for a conference registration. The company I trade under is my own name. That happens to make things a bit cheaper and is appropriate for one man operations where I live. It looks kind of stupid when a name tag has my name printed twice because I was forced to enter the company name. Similarly, if you don't work for a company, perhaps because you're between jobs, does this mean you're not welcome at, say, OSCON (to pick one high profile place that makes this blunder)?
Finally, dates can lead to some interesting cases. I mentioned in the Django show that there's clearly an issue with dates of the form 4/5/08, since being able to parse that correctly relies on knowledge of whether you're using North American style dates or "pretty much everywhere else" style. But that's just a simple example of date format complexities.
The subtle side to dates, however, is validation with respect to timezones. A few years ago whilst travelling in the US, I needed to make a last minute accommodation booking. There were problems with paying with my credit card on US sites (they required a US or similar billing address), so I was using an Australian site that let me book US hotels. Except that it insisted I couldn't book the night I wanted since it was in the past. Except, it wasn't. This was a really last minute booking and although it was already, let's call it Saturday in Australia, it was still Friday in the US, where I was wanting to check in. I needed a booking for the Friday night. The current date is not a universal constant. Most people remember that timezones change things around a bit with times of day (well, most people do), but they forget that dates are important, too. Particularly relevant if you live far east of the prime meridian, such as in New Zealand or Australia.
There are lots of other cultural assumptions made in software, but these should give you some idea of the things to watch out for. Sometimes it's a case of being careful to use terminology that is familiar to your users — phrases like "desktop" and "trash can" aren't universally understood — although that's often a localisation issue (the translators should try to pick appropriate translations). Still, not being overly technical in vocabulary isn't a bad goal in general.
Aside: The Huge List Of Places In Django
At some point in the past, Django started accepting contributions from far and wide for a directory called localflavor/. These were form entry and data validation functions for all sorts of regional-specific pieces of information. Things like US Social Security numbers, Australian postcodes, provinces in Spain, districts in Japan.
Not a bad thing, per se. It does add to the administrative burden for translators, however. The reason, again, is somewhat cultural. In this case, the cultural choice of written script. Not everybody uses the Roman alphabet. Japan, China, Korea, Georgia, Russia, Iran, India, ... there's a huge list of countries using other alphabets. Once India and China make a list together, you're talking about a significant portion of the world's population. Therefore, every single place name needs to be marked for translation so that translators in all the non-Roman script locales can translate, or, at least, transliterate, the names into their display script.
The point here is that even strings which don't really "translate" in a strict sense into any other language are still on the table when thinking about internationalisation. Because how is that string going to be displayed on a website in Japan?
Some Things Are Annoying Technical Problems
Okay, so you're a translator of a piece of software. Say, translating Django. And you see the string "April". Translate it as the fourth month of the year, right? Not necessarily. Translation files tend to combine multiple occurrences of a string into one, so that you only translate each string once. However, this can lose some context, and this is why it's a technical problem, because we sometimes need a way to keep the occurrences distinct for cases like the one I'm about to describe.
In Django, the string "April" is used both as the full name of the fourth month and as the Associated Press abbreviation for the fourth month. From what I gather (the AP style guide isn't freely available) AP style only shortens months longer than five characters. Now, this portion of AP style doesn't make a lot of sense in foreign languages, but the string needs to be translated in any case and it makes sense to use some kind of abbreviation for the month to make it consistent with the other abbreviations used in that place. Except the string for the full name of the fourth month and the string for the abbreviation have been combined (they're the same string when viewed without any context), so translators cannot separate them.
In some cases, this problem can be overcome by using a different string for one of the occurrences. The strings passed to ugettext() calls are just message identifiers; they don't have to be the untranslated messages themselves. Django can't do this because we also want to be able to operate without the internationalisation support loaded. Which means the message identifiers have to be the untranslated strings. Silly technology. Sabotaging us again.
Testing International Software
One item I should have mentioned in the podcast yesterday but which completely slipped my mind was how to test internationalisation. For something like Django, it's not too hard for a developer to set things up and temporarily change the preferred language in their browser (or in the shell when running a desktop application) and view the standard framework in an alternate language. For example, here's a picture of the Django admin interface in Georgian.
The question is, how to do this for an application you are writing for yourself? One trick of the trade here is to make a dummy translation. Do all the necessary work up to the point where you think a translator could start to do the work. That's a technical matter and doesn't require any translation abilities. Then you create the PO file (the file with all the message strings in it) or similar for other software. The trick, then, is to run through this file — possibly with the help of a script — and make the translated version of every string be the original version with, say, "XX" appended to the front and "YY" append to the back of the original. So the translation of "user" becomes "XXuserYY". The --msgstr-prefix and --msgstr-suffix options to xgettext can be useful here sometimes. But a quick script isn't hard to write, either.
Now you can view your application in the locale that you used for this dummy translation and if you see any strings that aren't prefixed and suffixed appropriately, that string was not marked for translation somehow. Either it's an unmarked string in your source, or it's a string that came from somewhere else that you are displaying without change. This won't catch all problems (for example, plural forms still need careful attention), but it can show up a number of obvious problems before you get too far in.
You can take this further and make sure you run in a locale that is quite different from your own in order to check things like date display issues and data entry issues. You don't have to speak German to understand that use day/month/year format for dates and one fifth is written as "0,2", rather than "0.2" (by the way, don't try this with Django just yet; not handling European decimal input is a known bug).
A Final Thought
Internationalisation is both easy and hard. Easy enough to understand and appreciate, hard to do well. However, that isn't an excuse for not trying. Writing software that tries hard to be appropriate for an international audience and to work well with translators is welcomed by the users. If it's not perfect but you've made a good start, the users will help you out with bug reports. They can appreciate the benefits to themselves of helping you bridge the last mile there. If you don't try at all, users will not even make it to step number one; your software fails for a broad user base right out of the box and they aren't really going to be motivated to help you.
Now, of course, all generalisations are false, but hopefully you can appreciate the concept here. Meet your users, your translators and your contributors halfway if you're a project lead. Lay the foundation so that other people can contribute to help improve things (unsurprisingly, this is not unique simply to internationalisation).
In the not too distant past when I wrote more software in C than I do now, there was an equivalency with endian issues and portable function usage in programs. If you tried hard to write code that was portable and which, when saving data or sending over the network, tried to be byte-order neutral, you would rapidly find yourself receiving patches from Solaris users and PowerPC users to fix the little problems they noticed. If you made no effort, you got the big raspberry from those users (except, maybe a brave few who helped out anyway), since you'd thumbed your nose at them in the initial code.
Get the ball rolling and others will help out.