Sat 13 Dec 2008
ETags And Modification Times In Django
Posted at 13:13 +1100 (last edited: 14 Dec 2008, 10:40)
When I see somebody who wants to demo for me their latest "REST based" web service, the first thing I ask is; "do you support ETags?" — Sam Ruby, 2006
This is the second part in a series on implementing fine-grained HTTP controls and developing RESTful services with Django. The first instalment was last Tuesday.
In today's piece, producing and consuming ETags, last modification times, and saving cycles whilst still ensuring everybody is happily using the latest data. This topic is really a lot easier than it looks from the length of this piece. I'm trying to be comprehensive and motivate the code, but if you only want the answers, skip ahead to anything that looks like Python code, by all means.
ETags
ETags or, more formally, Entity Tags, are a way to distinguish between versions of the same resource. There are two primary situations where ETags play a role:
Conditional GET: Determining if a GET request really needs the response to contain all the data, or if we can tell it "nothing has changed since last time."
Avoiding conflicting updates: If you PUT an update to a resource and I do the same thing, the server needs to ensure that we are both really updating what we think we are. If we start from the same document and your update lands first, my update shouldn't be accepted until I confirm I am updating the latest version.
There are other cases, such as pre-emptively ensuring that a POST creates a record that doesn't already exist. However, since, as mentioned in the first post, I'm going to skip the theoretical side, I'll take it as read that the reader knows why they want to use ETags and is here to understand how to do so in Django.
Although I'm discussing only ETags in the following section, those who've played in this space probably realise that modification times (the Last-Modified header, for example) also have a significant role here. I'll touch on those towards the end. Most of the principles are the same.
HTTP Headers
So that we're all on the same page, here's a quick summary of the HTTP headers that are used with ETags. All header names are case-insensitive, but I'll use the "traditional" capitalisation here.
- ETag: Used in a response to specify the ETag value of the entity.
- If-None-Match: Used to create a conditional GET request. Retrieve the resource if the current version does not match the given ETag value. In other words, "give me this if it isn't the version I already have." Can also be used in a POST to ensure that the POST does not happen if there's already a resource with that ETag (I don't cover that much here, but it's a possible use-case).
- If-Match: Used in a PUT request to indicate that the update should only happen if the latest version is the same as the sender's.
Although I won't refer to it in any of the following sections, the If-Range header is also theoretically relevant when working with ETags. However, client and server support for range requests isn't particularly good (which is a huge shame), so you don't see that being used a lot in practice. See the specification if you're interested.
Django's Default ETag Handling
If you are using Django's CommonMiddleware and have set USE_ETAGS in your configuration file (by default, USE_ETAGS=False), Django will add an ETag HTTP header to every outgoing request. However, this is only done if the response does not already have an ETag header.
Further, if you're using the ConditionalGetMiddleware, once the response from a particular view has been created, the ETag will be compared to the ETag on the incoming request, if any, and, if it matches a 304 HTTP response code ("Not Modified") is returned, rather than the full response.
The net effect here is that your code still has to do all the work to compute the response, but should it match what the recipient already has, some network bandwidth is saved. The full pile of response data doesn't have to be sent back, only the headers necessary to support a 304 response.
The default behaviour is useful, as far as it goes. It provides some minimal conditional GET support and is cache friendly. Tim Bray wrote a push-back piece last year stating his opinion that, amongst other things, that Django's ETags were entirely useless. I'd take some issue with that, since at the level the default ETag generation is operating, it cannot do anything other than examine the content — it simply doesn't have any domain-specific knowledge. So what we have by default is better than nothing and it's only the default, not the be-all and end-all.
Don't get hung up too much on this issue, though, whether you agree with Tim that's it less than useless, or don't mind the "here's something for almost free" approach. It's not particularly important either way and Tim's more details observations about ETags in that article are the same as what you'll see below. For fine-grained services, we can do much better. With better domain knowledge and knowledge of how we're creating the content for any particular view, we have more control and should utilise it. In particular, creating our own ETags, early bailout to return 304 responses and handling ETags on PUT requests to avoid conflicting updates.
ETag Checking Versus Caching
Django's default ETag setting and conditional GET handling is done as pretty much the last part of the whole request-response path. Django's internal cache handling (not HTTP caching, but memcached or file-based cache handling) is done at the other end, if you're wondering. With the exception of some incoming middleware (request) modifications, once the cache middleware is hit, if it finds something matching the request in the cache, that is returned immediately. The view isn't processed.
This is, of course, more or less what you would intuitively expect from caching: it avoids the need to regenerate the code. There's no real surprises there. However, I wanted to mention it in case anybody was a bit whip-lashed trying to work out if ETags were useful, given they were only generated by default right at the very end of the process, and wondering if caching contained similar behaviour.
Creating ETags
There's no set of rules for how to create an ETag. If you read the specification, you'll see that it talks about "strong" and "weak" ETags and that's really all. Here, I'm only concerned with strong ETags. That is, two different versions of the same resource must have different ETags.
When the CommonMiddleware creates an ETag for your content, it does so by creating an MD5 hash of the content body. Given that Django doesn't know what makes two copies of your response different, it can really only get the bytes and compare them. Remember that two different resources can have the same ETag, just not two versions of the same resource, so although there's a very small chance of the same MD5 hash being generated for two different pieces of content, the chances really are tiny of it happening for one particular resource.
That's all fine for Django trying to handle every possible case that can be thrown at it, but if you're developing a specific application, you no doubt already know how to quickly tell if two responses are going to be different. For example, if somebody is requesting an entry in a blog, you don't need to know all the data on that page. All you're likely to need to know is the last time any of that data was updated. The last edited time of the blog entry object and maybe the last time any related data used to populate the page was updated. For the general case, come up with whatever is the minimal information you need to know to determine if there's been a change. That is the information you use to generate an ETag. You could put smash it together in a string and then hash the string, for example.
By way of a concrete example, for my blog (and, if you're reading this via some re-syndication site, I live here), the page for any particular entry could have an ETag determined using only the time of the last edit (save) of that entry and the last modification time of the template file. The Flickr thumbnails in the sidebar are generated with a piece of static Javascript that's part of the template, so the data I send to your browser doesn't change, even though the displayed images do when your browser renders the page.
Early Etag Checking
Hmm... I'm over 1000 words into a piece about implementation details and I still haven't written any code. Vacation is over now. Suppose you have a function that knows, for a particular view, how to quickly determine an ETag value. For example:
def etag_for_blog_entry(request):
# Get the relevant entry, however you like.
# I'm calling a utility function to save space.
entry = get_object_from_request(request)
return "%s:%s" % (entry.last_edited.ctime(), entry.version)
I'm assuming there's some kind of version field (or revision number) in the model, just because I want to duck the problem of multiple saves in the same second to a database that only returns times to the nearest second. But the details are relatively unimportant. What is important is that, given an incoming request, I can quickly determine the ETag. Then I can write:
from django import http
def show_entry(request, ...):
etag = etag_for_blog_entry(request)
if etag == request.META.get("HTTP_IF_NONE_MATCH"):
return http.HttpResponseNotModified()
# (We really should also check modification time headers.
# I'm being lazy in this example).
# Otherwise, do normal processing.
...
# Finally, add in the ETag to the new response we're returning.
response = http.HttpResponse(....)
response.headers["ETag"] = etag
return response
This uses quick bail-out if we don't need to do the work and, if we do have to produce a new page, makes sure to add in the correct ETag value at the end (which the Django middleware won't override).
This type of pattern is easy enough to abstract. Ivan Sagalaev has already done exactly that. I rediscovered his work again whilst researching these articles (and, just in case I was asleep at the wheel, Ivan emailed me yesterday, effectively saying "look, over there! My ticket."). I'll soon commit something to Django that looks like the patch in ticket #5791. There was broad consensus that it was a good idea when Ivan brought it up in the past and we haven't gotten around to dropping in the support yet. I (or somebody else committing to Django) should fix that. There's a couple of tidy-ups to do, one related bug, that Ivan found, to fix first, and maybe some generalisation we can do (to make it also usable for PUTS and POSTS, as in the next section). The design is sound.
The idea behind Ivan's idea is to add a decorator to functions that provides functions to call to check ETags and last modification times and bail out early. So your main view only contains the processing code. In a later article in this series I'll have a lot more to say about possibilities for adding wrappers to handle a bunch of this.
The Lost Update Problem (ETags for PUT)
This is the second use of ETags that I mentioned earlier. As a producer of data, you might well think of it as "conflicting updates" — your update conflicts with mine. In a broader sense, from the third-party observer perspective, the problem appears as "lost updates". You updated first, I updated second and overrode your changes, so your update was lost. Confusion reigns.
This problem and some HTTP-based solution have been understood for quite a long time. I mean, every single version control system has to handle it, for a start. At the HTTP level, this note from 1999 provides a really clear explanation of the problem, approaches to solution and how HTTP ETags can be used to ensure no lost updates. I don't really need to expand on what's written there for the motivation behind using ETags.
In code, handling this is similar to the GET situation. We'll consider a PUT here, although there are cases where POST handling might need to do the same to ensure there is not an existing version. If you have a Django view handling a PUT request, it could do something like this:
def put_to_entry(request, ....):
etag = etag_for_blog_entry(request)
if etag != request.META.get("HTTP_IF_MATCH"):
# Your version is old and smelly. Please try again.
return http.HttpResponse(status=412)
# Handle the update and then return an OK (with no body)
...
return http.HttpResponse()
A few things of note here.
- I am using the same ETag checking function that I used for GET requests. That is a good goal, actually, since we should be generating the ETag exactly the same way each time and the best way to do that is to use exactly the same code each time.
- This function requires an ETag on the request. The tag might be optional in some protocols, so adjustments would need to be made for circumstances. In a lot of cases, though, allowing updates without requiring the submitter to confirm they were updating the current version sounds like a recipe for disaster.
-
It happens that Django has an
HttpResponseNotModifiedclass and I used in the conditional GET case. For response code 412 ("Precondition Failed"), there's no specific class, but passing in the status code to theHttpResponseconstructor is just as easy. We have no plans to add any more classes to the HttpResponse sub-class family in Django. -
The default response code for
HttpResponseis 200 and the final line of this function uses that.
Modification Times
In order to simplify an already fairly lengthy piece, I've concentrated on ETags so far. However, be aware that pretty much everything here maps across nicely to last modification times. A response can include a Last-Modification header, a request can send the If-Modified-Since and so on. In some circumstances, modification time might be easier to determine and sufficiently unique (that is, you're doing less than one update per second, guaranteed). You can use modification times without ETags, ETags without modification times or both together.
If both ETags and modification times are used together in a request (say, a request includes both If-Match and If-Unmodified-Since headers), all the preconditions must be satisfied. Django currently gets this very slightly wrong in the ConditionalGetMiddleware, since if the ETag matches, a 304 response is returned regardless of the value of an If-Modified-Since header. That's the bug Ivan found, that I mentioned earlier. It will be fixed soon-ish now that it's been found (possibly "found again"; I wouldn't guarantee that nobody noticed this before).
Future Posts In This Series
I sat down a few days ago and worked out where I'm going in this series so that it doesn't go on without end or wobble off in an incomplete state. The coming posts will be
- Output types. Generating different versions for different requests.
- Authentication and authorisation. Primarily, consuming HTTP Auth.
- Session concepts. Alternatives to server side sessions. Aiming towards server-side statelessness. This might have to come before the authentication post.
- URI resolving.
- Patterns and Idioms. Coming up with a coherent programming style and set of utility functions for RESTful implementations. This is kind of my take on a REST API, but not quite so bold. It was surprising, when I read Java's JAX-RS specification the other day (JSR-311), how similar a natural way of doing this in Django is to what the Java guys have come up with and I'm not above being inspired by a few ideas from there that I hadn't thought about.
That's a few weeks' worth of posting right there, but I'll try to keep them coming fairly regularly. Again, any topics I've missed that people feel might be important can be suggested via e-mail.
Topics: software/django/tutorials, technology/web