Defying Classification

by Malcolm Tredinnick

Tue 11 Mar 2008

Queryset Implementation

Posted at 23:49 +1100

I haven't done a Django-related post in a while. Mostly because I've been a bit depressed by the whole thing lately for various reasons and writing tutorials isn't going to fix that. So, instead, here's a bit of a brain-dump of stuff I've been working on.

Django Queryset Algorithm Design

Online Algorithms

Algorithms that solve a particular problem can be split into two classes. Those that need to know all the input data up front, before they can start working, and those which can construct the solution whilst seeing the input come in incrementally. The latter class are called online algorithms.

If you've spent time in a computer science course and taken a compiler class, you've seen some good examples of online algorithms. Context-free grammars are often parsed using LR(k) or, less often, LALR(k) algorithms. They process the input using a limited amount of lookahead and no backtracking, so can work on the input as it is streamed in. However, these are reasonably sterile examples.

More interesting cases are when somebody manages to take an existing algorithm that works in excellent time and/or space constraints and converts it to an online version without losing the performance characteristics. Often, something will be traded slightly, such as requiring a little more storage space, but the idea is not to blow things out of control. Graph planarity testing (which I think is only an online case if you first pre-process the input data into a tree structure) and suffix tree construction are two good examples of tricky algorithms to handle massive input data structures.

The relevant point here is that anything can be converted into an online algorithm if you are prepared to save all the input data and just rerun the algorithm over the result at each step. The problem of course is that space now increases by O(n) (where n is the input size) and runtime blows out even more (an O(n) time algorithm becomes O(n^2) and other complexities are even worse). Keeping a grip on the runtime and space requirements needs a little more thinking.

The Online Algorithm In Practice

Constructing a queryset in Django requires an online algorithm. Querysets act via chained methods: you call a method on a queryset and it returns a new queryset, on which you call the next method and so on. So a line of code like

SomeModel.objects.filter(...).filter(...).order_by(...).distinct()

is four distinct method calls at the query construction level and when, for example, you're processing the first filter() call, you have no idea what is coming in the future.

The other constraint with querysets is that Django returns a new queryset after each method call. This isn't a bad idea and it's fairly common behaviour in chained method APIs. It means you can write code like this and have it behave as expected:

qs1 = SomeModel.objects.filter(....).order_by(...)
qs2 = qs1.exclude(....)
qs3 = qs1.select_related()

# Each queryset can be evaluated in isolation.
print list(qs1)
print list(qs2)
print list(qs3)

On the implementation side, it means that you have to pay attention to how expensive the copying or cloning operation becomes. In a language like Python that uses copy-by-reference by default, it requires a fair bit of attention to making sure that the right data structures are copied sufficiently deeply so that when we modify the copy, we don't end up changing the behaviour of the original in detrimental ways. Changing the behaviour of the original isn't strictly forbidden, as I'll discuss below, but you need to know the effects and have verified they won't harm the result.

So there's some fun to be had when writing the internal algorithms, trying to get the trade-off between data structures that support the online development of the query and data structures that can be copied fairly quickly. I guess there's also a third axis there, which is the overall code complexity and how that helps keep bugs out (or hides bugs in the worse cases). I'm willing to claim I've found a solution to this trade-off in the current queryset code I've written in Django (the stuff on the queryset-refactor branch), but it's far from the only solution and it's not my ideal solution. The main problem is that the ideal solution from a theoretical, algorithmic perspective fails pretty badly on the "overall code complexity" axis and that's not a good result in Open Source software that will, ideally, have many people willing to debug it. Write code that only you can understand and only you will end up debugging it.

Some Challenges

There are a lot of situations where calling a method on a queryset requires some rearrangement of the existing proto-query. Two cases that crop up very frequently and will serve as good illustrations are table alias management and inserting something into the "where clause" of an SQL query.

Alias Management

Adding a filter to a queryset — which means calling the filter() or exclude() methods, amongst other things — means turning an expression such as tag__parent__name='fred' into a series of table joins and a constraint on the query. We might be able to use some existing table joins here, for example if there was already a filter against tag__parent__public=True, we'd already have the joins in place to get to the tag table and then to the parent table (which might be another copy of the tag table in the query).

Since the idea is not to have any tables in the query that we don't need, table reuse is very important. Every extra join costs a bit of performance at the server. We also need to make the joins between tables as efficient as possible: only using inner joins except when we have to use a left outer join because we can "prove" that the result might contain null rows on the left of the join.

That's all fine. It's possible with a bit of care to keep that stuff under control. It rapidly becomes a good idea to internally work only with table aliases and not worry too much about the tables themselves because you often end up with the same table appearing more than once in a query — by necessity — and we should make sure to give each version of the table a different name. Again, has to be done, but it's not unmanageable.

The real excitement starts when you want to merge two querysets together. Something like

Entry.objects.filter(public=True) | Entry.objects.filter(tag__name='fred')

In the general case here, each of these starts out life as a separate queryset and then we merge them into one queryset before querying the database, to improve performance. The problem is that each side of this "or" join will have its ow n list of aliases and they will clash. To avoid problems with maximum alias lengths and inscrutable queries, every queryset starts constructing aliases that look like T1, T2, T3, etc. So both sides are likely to have a T1 alias referring to a different table. Thus, merging queries requires the ability to relabel aliases so that every T1 on the right side, say, becomes T5 and everywhere that T1 is referred to is relabelled appropriately.

The same problem arises when creating nested queries, since the inner query shouldn't have aliases that clash with the outer query (unintentionally; it can use aliases from the outer query when required). There are some other cases as well when relabelling is required. So the ability to rename aliases across an entire queryset is important and not just a fringe case. Thus I needed to come up with a way to keep aliases identifiable in all the places they occurred.

Some of the past week was spent chasing down a couple of these renaming problems and finding more efficient ways to identify what needed relabelling and whether or not we were using the smallest number of aliases (table joins) required. In the end, I actually punted on one case where an extra table can creep into a particular type of update query. It gives the correct result and doesn't happen very often, so it wasn't worth the extra complications and overhead that would be paid by every single query to identify it. Not the most pure solution, but one of those necessary trade-offs to code simplicity (or non-complexity... I doubt anybody's going to call the code I've written "simple", unfortunately). It's always very frustrating to spend a few days (three or four in this case) chasing around some algorithms and trying things out, only to decide it's just not worth it and having to put in something that is less than ideal. C'est la vie.

There's one remaining problem with alias renaming, concerning alias names in nested queries, which are only needed in a couple of cases. That's actually going to be quite easy to fix when I get around to it. So I'm fairly happy with the end result in this particular sub-issue.

The "Where" Tree

Inserting constraints into a where clause is another area that hides some surprisingly annoying subtleties. There are two main difficulties. The first is the online nature of the filter addition. The where-clause is stored internally as a tree where leaves are the constraints and the internal nodes are the connectors (AND or OR) between the constraints. A subtree is then a bracketed section of the resulting where-clause. This goes well until you are trying to exclude some results and left outer joins make an appearance (which is fairly common).

In an ideal world, I would like to write this sort of query:

SELECT ...
FROM tableA LEFT OUTER JOIN tableB
    ON (tableA.id = tableB.id AND tableB.name != 'fred');

That becomes ugly very quickly once you start trying to merge in extra filters that may involve both conjunctions and disjunctions simultaneously in the final result (at the time you add them, you don't know that there's a different sort of connector coming up). Trying to move the right constraints into the right join condition was becoming really fiddly and not particularly robust or easy to understand. So the above query is now written like this:

SELECT ...
FROM tableA LEFT OUTER JOIN tableB
    ON (tableA.id = tableB.id)
WHERE (NOT tableB.name = 'fred') OR tableB.name is NULL;

I've added parentheses just to emphasise the scope of the NOT modifier here. The extra comparison against NULL is because otherwise those rows would be left out. SQL does not use boolean logic; it's a ternary system. There's true, false and NULL. In formal logic, we say it doesn't obey the "law of the excluded middle", or that (not not A) != A. In practice, it means that just because something doesn't contain "fred" in the name field doesn't mean it is "not equal to 'fred'". It might be NULL! Yes, if your brain hurts right now, that's normal. You get used to this logic after a while, but having some decent test coverage saves your bacon a lot, too. (If you're wondering why the extra comparison is needed in the second example an not the first, it's because SQL filters on join conditions before doing the join and filter on where conditions after doing the join, so null rows only appear when filtering in the where-clause.)

The real trouble starts once somebody adds a new constraint to this proto-query and they now want to also exclude the cases where name could be NULL, probably indirectly. So we need to remove the existing NULL comparison... possibly. What we really need to do is remove it if and only if we added it as part of the automatic transformation, not if it was added explicitly. I don't want to accidentally return a bad result if somebody specifies a contradictory query (which would be always empty) for some reason.

Right now, I'm not sure that problem is completely solved. There are some cases where it's possible the extra constraint is being inserted unnecessarily. I've been playing around this evening with a different approach to this problem so that I can tag the extra constraint and identify it when something else is added, if needs be. It's not quite working yet, but I'm gradually losing in. Ideally I'd like to be able to prove that this situation simply cannot arise, but that's proving too hard for my tiny brain at the moment, so building in a little paranoia to the algorithm seems a safer approach, at least until I can solve this one way or the other.

The Copying Problem

The new query class in the queryset-refactor branch is a lot more involved than the existing (trunk) code. It's a lot more powerful. The downside of this is that it has a lot more in the way of internal structures. Copying the queryset class, which means copying the query class it carries around with it, is not cheap. Because the tree representing the where-clause can be rearranged by additional filters, we need to make a fresh copy of that each time. We need to make sure that the mapping from alias names to tables and joins is copied afresh, because relabelling aliases in one query shouldn't accidentally relabel aliases in another query (which could cause collisions).

A couple of months ago I did a fairly intensive optimisation pass over the new code, tweaking the algorithms in a bunch of places. Right now, I'm doing a another in-depth pass to tweak the data structures. Mostly this involves working out which data structures can be shared and which need deep, independent copies each time. A couple of cases look like they can be split up so that the mutable information is copied deeply and the stuff it refers to can be passed around by reference.

Earlier I hinted that changing the data in one queryset that led to changing the query data in another queryset wasn't bad per se. This is an important point for optimisation purposes. For example, it doesn't matter too much if a query holds more joins than it needs, providing it knows not to include those joins in the final query. Thus, the reference count of how many times the query uses an alias is a property of the query (aliases with a refcount of zero aren't in the final output for that query). But the join information itself (the table involved, the columns to join) can be shared by all copies. The join information is a much more complex data structure than the reference counts. Working out how to split those two pieces of information up and what effect it will have on performance is work in-progress.

Fortunately, it turns out that the test suite for Django's core is quite a good torture test for the queryset code. It sets up a lot (tens) of models that are each used for just a short period of time in a single test. So there's not a lot of reuse of querysets or models and some of the real optimisations only kick in once you are reusing the same model in multiple querysets. So I can run the core tests and profile them and see the real "initial overhead" of creating a queryset. If I can make that stuff fast (it's already not too shabby, but is measurably slower than trunk), then the repeated use optimisations will only help things. The difference here is noticeable when I run the same profiles whilst mirroring my blog, which has only a few models and lots of chance to reuse the same queries.

Some Interesting Finds

Finally, a couple of tidbits of information I've discovered along this adventure.

  1. As part of trying to find a sweet spot in the data structure management, somewhere between string fragments and managing everything in relational algebra format, I've been reading quite a lot of presentations and source code about how the various databases optimise queries. The good news is that most of the popular Open Source databases seem to work very similarly. Not too surprising, since a lot of this is solved technology, but nice to see. I don't entirely understand everything MySQL is trying to do, but I suspect that's a consequence of the playing off of the main server against the different storage backends. Plus, I've just found more presentations about Oracle optimisations and PostgreSQL optimisations than MySQL ones.

  2. SQLite really is a very nifty little piece of software. Five years ago, it was pretty minimalistic. But these days, it supports a very wide section of the genuinely useful SQL constructions. Small, fast and correct, all in one package is enormously helpful when trying to debug query results. I'll tend to debug a complex query in SQLite and then look at execution plans in PostgreSQL and MySQL to check I haven't made any bozo error.

  3. The decision of the Django Oracle team in Boulder, Colorado to support Oracle 9i and later has turned out well. Oracle 9 appears to be the first version that added support for all the standard SQL syntax for join constructs. So creating the output strings is much more portable than if we had to write things like tableB(+), which I remember from having used Oracle 8 in the past.

Topics: software/design, software/django