Tue 13 Jun 2006
SQL Puzzle: A Solution
Posted at 11:02 +1000
I previously offered up an SQL puzzle as food for thought. In this post, I want to discuss a couple of solutions.
Background
Recall that we are wanting to select all classes that contain a particular subset of students. To make the following understandable, it suffices to remember that we have three tables: Classes, Students and Reln_Class_Student. The first two have id fields and the latter has class_id and student_id foreign keys to the obvious tables. The original post has all the details.
I am going to go through how one might arrive at the solution for this problem. If you just care about the final answer I came up with, feel free to skip ahead.
If you want to follow along at home, I have made up a sample SQL file. This was created from a PostgreSQL dump which I edited to make it more generic. Hopefully, it should be pretty portable now, although the ALTER TABLE commands at the end might need some adjusting for your local system.
A faltering start
First off, there is an obvious attempt that falls short:
select distinct id
from Classes as C, Reln_Class_Student as R
where C.id = R.class_id
and R.student_id in (253, 289);
This will retrieve all the classes that contain at least one of the nominated students, not all classes that contain all of the students. Note that for all these queries, I am using a couple of student id values from the above data set that I know are in three classes together. Further, throughout this discussion, we will retrieve class and student id values as a proxy for whatever you might actually want to retrieve here. You might replace id with name or account_number in real life.
Thinking about what goes wrong with this query helps suggest a fix. Sometimes a class id will be selected exactly once by the where clause (if only one of the two students is in the class) and sometimes twice. We only want the latter case in our final answer. So we need to count the number of times each class id is selected.
A solution
A solution that does the job and (as can be seen in the email thread referenced later) fits in kind of nicely with Django's query structure is
select distinct id
from Classes as C, Reln_Class_Student as R
where C.id = R.class_id and
(select count(*)
from Reln_Class_Student as R1
where R1.class_id = R.class_id
and R1.student_id in (253, 289)) = 2;
This gives the right answer by checking that each R.class_id value occurs the right number of times. Obviously, if we are to generalise this, we need to remember to replace the value 2 with the size of the subset we are searching for, but that's pretty easy. Problem solved. Everybody's happy; well, not really...
Thinking about it the a few days later for no particular reason, I had a feeling this query wasn't the best. It just felt awkward. Eventually I realised that this was a standard problem, I had seen it before, and the machinery to carry out what I was doing in the nested query already existed in SQL. If I could write my database queries in English, it would be: I want to group the results of the first query by the class id and then filter on the groups having two elements. Woah.. lights go on! The word having is important, leading to an even better query.
The final(?) solution
select id
from Classes as C, Reln_Class_Student as R
where C.id = R.class_id
and R.student_id in (253, 289)
group by C.id
having count(*) = 2;
Remembering the somewhat infrequently used HAVING modifier in SQL makes everything easy. And fast, too. Timing it just now for the admittedly small data set I've been using, my original solution takes about 15 milliseconds on my desktop machine, the final solution takes around 0.3 milliseconds pretty consistently. Fifty times faster is nothing to be sneezed at, although looking at the output from PostgreSQL's query optimiser makes me think that if the dataset was going to be huge, there could be some changes in the way the execution happened, so I would want to test it on larger datasets.
[A final note for the SQL junkies: if I was just selecting the id value here, there is no need to join on the Classes table, which makes things faster still.]
Why?
You didn't think that was interesting? You fell asleep long before reading this far? Fair enough, too. This SQL stuff can be boring unless you are really interested in databases. But there is a point: In a day or two, I want to describe how to turn a query like this into something that can be used in the Django framework without relying too much on database specifics. By putting the details of how we get the SQL here, it will make the next post a bit shorter.
Finally a comment about the motivation for this post: In mid-May, a variation of this problem came up on the Django mailing lists. You can see the relevant thread in the Google Groups archives. After some initial confusion about what problem Jesper Noehr was trying to solve, we had a discussion on IRC and together we came up with the first of the two correct solution described above. It worked in the context and because we were looking at the problem from the point of view of how to do it in Django, we did not concentrate enough on making the database query nice and then porting it. Only when I was able to step away from the problem for a little bit did it become clear we had missed something obvious.
Topics: software/databases, software/django