Defying Classification

by Malcolm Tredinnick

Wed 26 Mar 2008

More Django Queryset Notes

Posted at 21:44 +1100 (last edited: 26 Mar 2008, 23:35)

Following on from some positive feedback to my recent post about writing portions of the ORM layer in Django, I'm encouraged to do another post about one particular implementation detail that took some time to puzzle through.

Sorry, general non-technical readers. This is another heavy Python post. Sorry, technical readers. This is another "how Malcolm thinks" post. It's not pretty. The good news is, there's no SQL in this post.

The Problem And Requirements

Django's querysets have a few features that allow them act as an efficient data structure, both in performance terms and in how they appear in code.

Firstly, they are lazy. Until you try to access the data represented by the queryset, it doesn't talk to the database. So you can build up a queryset incrementally. Or you can create one and pass it around to another function and, if it's not used, no performance penalty is paid. Secondly, they act as caches once you do access the data that a queryset represents. Repeated accesses incur no extra overhead. A queryset is a snapshot of data at the moment you first tried to access the data.

Finally, if we can get by with only retrieving an initial portion of the data from the database at any given time, the queryset will do that. The main place this is useful is when using a queryset as a boolean value. It should evaluate to True when it contains some data and False otherwise. To evaluate if the query contains some data, we retrieve the first few rows. We don't retrieve just the first row, because fetching one row at a time is very inefficient and usually you're going to be accessing the rest of the data anyway if you decide it does hold data. However, we don't automatically pull in all the data, so if your queryset contains 10,000 rows of data in the database, we can evaluate it as a boolean by only looking at the first few rows.

Realise that I'm talking about what querysets do in the queryset-refactor branch of Django's code here. The trunk code doesn't handle booleans efficiently at the moment. Trunk retrieves all the results at once. This also comes up when you create an iterator over the queryset: trunk retrieves all the results from the database and then creates an iterator over that list. The branch creates the iterator and only retrieves results as it needs them.

Cache Consistency And Parallel Iterators

If you're going to provide an iterator method on a Python object, it's probably a good idea to make sure it can support parallel iterators. That is, the following code should return the same results for both it1 and it2:

it1 = iter(MyObject)
it1.next()
it2 = iter(MyObject)
it2.next(), it2.next()
it1.next()

If you're looking at queryset-refactor's implementation of Queryset.__iter__ and wondering why it's a bit more involved than you might expect, this is why. Maybe you'll never write code that is quite this explicit in using parallel iteration, but it's not too hard to have outer and inner loops going over the same data at times, for example. Sure, it's not a bad idea to avoid this, but we shouldn't break if somebody does try it.

What The Profile Said

I was spending some time profiling the new code and going through the slightly tedious process of trimming slow pieces. That's still ongoing work, but some progress has certainly been made. One of the things that stuck out when comparing trunk code to the branch and their respective profile's, was how frequently Queryset.get() was being called to retrieve a single object and how much slower it was in the branch.

Hmm ... one of the most popular functions for the test suite — installing content types for each test calls get_or_create(), which calls get() — and it's massively slower. That was unexpected and necessary to fix.

The get() method looked like this:

# Original Queryset.get() implementation
def get(self, *args, **kwargs):
    clone = self.filter(*args, **kwargs)
    if not clone._order_by:
        clone._order_by = ()

    obj_list = list(clone)      # <-- Looks innocent, right?

    if len(obj_list) < 1:
        raise self.model.DoesNotExist, "..."
    assert len(obj_list) == 1, "..."
    return obj_list[0]

For brevity, I've trimmed comments and some long error strings and isolated the interesting line. It took me more than a couple of minutes to remember something that I'd worked out a long time ago debugging a similar issue and never used much since: Python's list() call is quite a complex beast.

From Sequence To List In Python

This is worth knowing if you're writing a class that implements both __iter__ and __len__ and are calling list() on instances of that type.

Python can obviously handle a sequence being passed to list(). When doing so, it would like to know, if possible and up-front, how much memory to allocate to the list. This avoids the need to append some elements, then extend the list's memory allocation, then append, then grab more memory, etc. So Python tries to work out if the sequence it has been given supports determining its own length. In Python 2.4 there is an internal attribute called __length_hint__ that was introduced, but it's marked as internal use only and wasn't in 2.3 in any case. I couldn't use it. So Python will call __iter__ and then call __len__ before consuming any of the iterator to see if it can tell how much space will be required for the pointers in the list structure (at the C level).

This was a bit of a disaster in the version of Queryset.__iter__ and Queryset.__len__ I had written, because the length method called the iterator method to pull in the results and then compute the length. And because of the parallel iteration requirement and because I was lazy, it was careful to use the normal code paths if it thought any of the original results had been retrieved (i.e. if somebody else had called __iter__ first). In fact, here's the old implementation:

# Original (slow) implementation
def __len__(self):
    return len(list(iter(self)))

Thus, Python's list() call fell right into the worst performance case for this implementation and we were calling list() in one of the most commonly called methods in my profiling test.

Updated: This isn't as clear as I intended because the problem isn't immediately visible in the code shown (it's partly hidden in __iter__). So some clarification: the problem here is that both __iter__ and __len__ were slow, independent of if the other one had been called or not. Since Python called both, it was at least twice as slow as it should have been. This is special for this particular class and caused by the caching requirement and my naïve implementation.

The Solution

The solution here was to take advantage of the fact that __len__ and __iter__ were methods on the same class and loosen the separation between their implementations a little bit. I decided they could both poke at the internal data structures in the class, since they were the only two methods needing the particular structures they were accessing and they were placed one after another in the code, so keeping them in sync should be both visually and technically reasonably simple.

Therefore, __len__ now doesn't use the iterator on the class to retrieve the results incrementally, since it knew it needed all of them. It pulls them all in at once and sets things up internally so that the __iter__ method knows that somebody else has already populated the cache and it should just read from the cache.

This introduced a nice final tweak: Since __len__ retrieves all the results (to count the number of rows in a queryset without pulling them back into Python, we have the count() method), it populates the cache — the queryset's requirements are to only retrieve results once, remember. Thus the get() method no longer needs to call list(), since a call to len() will tell us immediately whether or not we have multiple results (not having exactly one result is an error) and pull the results into the cache if we want to use them.

Putting all this together and rearranging a few lines of code so that the non-error path is fastest instead of wading through all the error checks, we now have:

# New implementation
def get(self, *args, **kwargs):
    clone = self.filter(*args, **kwargs)

    num = len(clone)
    if num == 1:
        return clone._result_cache[0]

    if not num:
        raise self.model.DoesNotExist("...")
    raise self.model.MultipleObjectsReturned("...")

The combined savings when viewing this method in isolation are about 35%. That doesn't equate to a 35% speedup overall, since get() is one method amongst many, but the combined changes did shave about 10% off the runtime of the test suite at the time. It also means that calling list() on a queryset isn't a massive performance hit that should be warned against. It's not that much slower than iterating over all the results in the sequence (there's some overhead, but it used to really suck).

Notice that nothing has really changed here except for some line rearrangements and replace a call to list() with a call to len(). The get() method still uses normal query construction techniques and things like that. It just uses the most efficient path to get at the single result, if there is one, and verify that it's the only result. It does this by using the fact that it's a method on the QuerySet class and so can be justified in using internal data structures like _result_cache in the above fragment.

Note: Motivation For This Post

The timing here is somewhat inspired by the most recent "This Week In Django" podcast. Michael Trier and Brian Rosner did a fairly comprehensive roundup of some of the many changes that dropped into Django during the recent code sprint at PyCon (and other worldwide locations). I don't want to ever post a "things that aren't right" article about TWID, since Michael and Brian do an excellent job of distilling all the available material each week and drawing attention to some of the high points. Small errors creep in, but that's because they're covering a lot of material that is sometimes only hours old at the time they record the show and they cannot read minds (yes, I don't know what their no doubt feeble excuse for that latter problem is, either... but it's what we have to work with).

For people who like podcasts as a way of gathering information and want a summary of Django developments each week, it's a nice technical coverage. Recommended.

This week the Michael and Brian show mentioned a couple of optimisations that I'd dropped into Django's queyrset-refactor branch and through no fault of their own, the explanation went a little off the rails. So this is some further explanation from the twisted mind that created that code. The get() method is not cutting any corners.

Topics: software/design, software/django