Hash and cache smash

Working around the egregious Firefox cache collision defect.  Properly.

It just occurred to me that the subject line of this post won’t work nearly so well if you’re Australian, or American, since natives of those nations pronounce "cache" in a novel colonial way; it should of course rhyme with "cash", which is how the Queen would pronounce it, if she knew what a cache was.

Anyway… those of you who have to deal with the complex and complicated world of efficiently serving up HTTP responses may know about an issue with Firefox (or possibly Mozilla, depending on how you like to classify it).  Probably the most relevant bug report  is here as Mozilla bug 290032, but here’s a summary:

  • Firefox stores cached data by hashing the URL to a value.
  • The hash algorithm is weak, and generates duplicate values for similar URLs.
  • The cache can only store one cached URL per has value.

The upshot is that some URLs are always reloaded from the server and never cached.  Now, this may not seem like a very big deal, and if you’re just a plain old Firefox user, you may not care if there’s a little bit more network traffic.  But for anyone running a high-traffic server, it can be a serious issue.  Every time a user reloads data from your server rather than their cache, you have a server hit and a bandwidth hit.  Multiply that by a large number of users and it becomes a problem worth considering.

Let’s look at the problem in more detail.  First, we could do with a way to generate hashes from URLs so that we can do some tests.  Here, in the Programming Language Of The Gods, is an implementation of the Firefox/Mozilla cache hash algorithm:

def hash(url):
        #Hash starts as zero
        h=0
        #Iterate through the characters of the URL
        for c in url:
                #Take the ASCII value of each character
                cv=ord(c)
                #Rotate the hash value left four bits,
                #as a 32-bit value
                bits1=(h&0xF0000000) >> 28
                bits2=(h&0x0FFFFFFF) << 4
                h = bits1 | bits2
                #XOR the ASCII character value with the
                #hash
                h = h ^ cv
        return h

There are various plavces online where you can find the fault with this algorithm summed up as:
the Firefox disk cache hash functions can generate collisions for URLs that differ only slightly, namely only on 8-character boundaries and the hash algorithm generates collisions for urls that are "similar enough". From the bug "similiar enough" seems to mean every 4 characters (or perhaps 8) the urls are the same.  In fact, it’s more complex than that.  Let’s take two example URLs provided by sembiance at stackoverflow.

ben@quad-14:~/working/hashing$ python
Python 2.6.2 (release26-maint, Apr 19 2009, 01:58:18)
[GCC 4.3.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from hashcheck import hash
>>> print "0x%08x" % hash("http://worldofsolitaire.com/decks/paris/5/12c.png")
0x40d8c8e9
>>> print "0x%08x" % hash("http://worldofsolitaire.com/decks/paris/5/13s.png")
0x40d8c8e9
>>>

Yow.  Both URLs generate the same hash, though they differ by two adjacent characters.  This means that only one of them can be cached by Firefox; if pages on the site regularly use both images, then at least one of them will be reloaded more often than it should. If you care to do the binary mathematics of the hash, you’ll see that the reason for the clash is that differences between two characters in the two URLs can be cancelled out by differences in later characters.

Someone else who suffers from this bug: Google, or more specifically, Google Maps.  Like most mapping sites that are based on "image tiles" (which includes Google Maps, Bing Maps, OpenStreetMap, NearMap, etc), there is a URL to load any individual 256×256 pixel "tile" image from their servers.  Since there are many[1] possible tiles, there are many very similar URLs to load them.  Let’s consider a couple of Google URLs that clash:

>>> print "0x%08x"%hash("http://khm2.google.com/kh/v=52&x=118346&y=80466&z=17")
0x198cc05f
>>> print "0x%08x"%hash("http://khm2.google.com/kh/v=52&x=118350&y=80470&z=17")
0x198cc05f
>>>

Because these two URLs clash, as with the other example above, only one of them can be cached by Firefox at any one time.  I picked these two also because they’re URLs for two images that are quite likely to appear together; they’re only 4 tiles away from each other horizontally and vertically, so it’s entirely possible to have them displayed together.

But if you crank up Firebug and watch Google URLs, you’ll see that they use a trick to work around the issue; they append a substring (as the s parameter) to their URLs.  This adds a substring of "Galileo" to each URL. This is sometimes (incorrectly) referred to as a "safeword" and described as a security measure that Google use to restrict access to their site; it’s nothing of the sort, as you can see by testing the URLs without the s parameter, or even with it set to something completely different.  The use of the substring is just to make simiular URLs sufficiently different that they don’t generate the same hash value in the Firefox cache system.

The length of the substring is generated from the x and y values using the algorithm <i>length=((3*x)+y)%8</i> (where % means modulo).  So if we use the URLs with the s parameter added, we get:

>>> print "0x%08x"%hash("http://khm2.google.com/kh/v=52&x=118346&y=80466&z=17&s=")
0xcc05d095
>>> print "0x%08x"%hash("http://khm2.google.com/kh/v=52&x=118350&y=80470&z=17&s=")
0xcc05d095
>>>

Oops. What happened?  The answer is that in both cases, the s parameter value is the same; 0 characters from the string "Galileo":

>>> ((3*118346)+80466)%8
0
>>> ((3*118350)+80470)%8
0

So there are still clashes. Google’s extra parameter reduces the number of collisions, but certainly doesn’t elimate them.  In fact, it’s extremely difficult to completely eliminate collisions[2] between URLs, but we can do better than Google’s approach.
At NearMap, we’ve been testing out a new way to generate substring, which uses the individual characters of the variant parts of the URL: the x and y and also the nmd (date) parameter that Nearmap have (and Google don’t). Here’s a Python sample implementation:

substring="Vk52edzNRYKbGjF8Ur0WhmQlZs4wgipDETyL1oOMXIAvqtxJBuf7H36acCnS9P"      #Nonrepeating A-Z, a-z and digits
def differenceEngine(s,a):
        """Use string a to return a deterministic but pseudo-random substring from within substring s"""
        result=""
        #Take the characters of a, which MUST all be digits
        offset=0
        for c in a:
                try:
                        v=int(c)
                except ValueError:
                        #This exception fired if the character is not a digit, which is wrong, but
                        #we should cope without crashing.
                        continue
                offset += v
                #Use the value of this digit as an offset into the string, and take
                #the character at that position.
                p=s[offset%len(s)]
                result += p
        return result

def substringFor(x,y,nmd=None):
        """Return the substring parameter for a URL, given the x, y and nmd values.
        The x and y should be integers, the nmd should be a string - if no date is
        included in the URL, pass an empty string or None."""
        arg=str(x)+str(y)+str((3*x)+y)
        if nmd:
                arg += nmd
        return differenceEngine(substring,arg)

To test this, I took the logs for a day’s worth of traffic on the NearMap site and analyzed the has collisions.  With the Google "Galileo" algorithm, around 10% of all URLs collided.  With this alternative approach, the collision rate is around 0.02%.

Okay, so this isn’t a simple and clear chunk of code that you can use to modify your own URLs, but it demonstrates some principles:

  • The hash algorithm is based on characters, so if you use the ‘add a substring parameter’ approach, base that substring on the characters of the URL, not parameter values.
  • Use as many different characters as possible in your substring so that it varies as much as possible; "Galileo" doesn’t allow for very many different substrings.
  • Build yourself a test setup (feel free to use the hash method above) that you can pass URLs through to spot collisions.  Use that to tune your URL construction code to minimize the collision rate.

[1]Many? The standard tiling system that all these sites use divides the world up into square tiles. At zoom level 0, the whole world is shown in one tile. At zoom level 1, the world is shown in 4 tiles, at zoom level 2, 8 tiles, and so on. So at each zoom level z, the number of tiles is 2z squared. By the time you’re at zoom level 21, where Google tend to stop, there are more than 4×1012 tiles. Which is many, by most standards.
[2]Actually, it’s impossible, since to completely avoid them, you’d need to modify all the URLs used in the world. Which you can’t. What I’m really aiming at here is to try and avoid collisions across the set of similar URLs used on one site.
[3]Paolo Emilio Sfondrati was the Secretary of the Inquisition at the time that Galileo was denounced for heresy and brought to trial. So probably Sfondrati would be an invalid string to replace Galileo