SQLAlchemy Magic

So I wrote a post over on the Mobify blog about reducing the memory requirements of SQLAlchemy when doing big queries. It’s interesting writing for a company blog instead of my own – I’d be a lot more quirky and use funnier examples if it were just for me. On the other hand, it’s good sometimes to have the time to dig deep into a subject and work out proper examples and actual stats, to make a better post.

Anyway, if you have to use SQLAlchemy to pull big data sets into Python code, you might find it interesting.

 

Advertisements

More Grimacing

Following on from my previous post about grimace, a Python package that allows generation of regular expressions using fluent syntax, I’ve added some more tricks.

In the previous post, I showed how one can use grimace to write regular expressions in this form:


RE().any_number_of().digits().followed_by().dot().then().at_least_one().digit()

All good, but do we really need to invoke each method and have all those parentheses cluttering up the line? Because grimace is written using functional principles, each method returns a new RE object, based on the previous one but modified. Thus in theory we must call a method because we need to do some work, not just return an attribute.

Python descriptors will save us. A descriptor is a class that defines one or more of the __get__ , __set__ or __delete__ methods. Like properties in C#, they allow code to be executed when a descriptor is accessed as through it were an attribute.

Previously in grimace, we had methods like:

def start(self):
    """Return a new RE with the start anchor '^' appended to the element list"""
    return RE(self, '^')

Now we can define an Extender class and use it to rewrite start().

class Extender(object):
    """An Extender is a descriptor intended for use with RE objects, whose __get__ method returns a
    new RE based on the RE on which it was invoked, but with the elements extended with a given
    element, set when the Extender is initialized. It exists so that methods like start() and end()
    can be invoked as attributes or methods."""
    def __init__(self, element=None):
        self.element = element

    def __get__(self, instance, owner):
        if isinstance(instance, RE):
            if self.element is not None:
                return RE(instance, self.element)
            else:
                return RE(instance)
        return None

#... and here's how start() is now defined in the RE class...
class RE(object):
    start = Extender('^')

So far so good. We can now use grimace as in this example:

RE().any_number_of.digits.followed_by.dot.then.at_least_one.digit

But what happens if a poor programmer gets confused and invokes digits or dot as methods? We can fix this easily. The result of getting digits or dot or any other Extender is a new RE object, so executing digits() will get that new RE and then try to call it. All we need to do is make a RE instance callable, by adding this __call__ method:


def __call__(self): return self

That’s all we need. It means that the execution of digit() becomes get the digit attribute, which is a new RE instance, then call it, which returns itself.

There is one more nicely functional trick. We can use an Extender to append more than just strings; we can also use it to append special objects, like the grimace Repeater that defines how many repetitions of a character are required. Because all objects in grimace are functional, they’re immutable, and that means any instance may be shared between any number of threads or other bits of code. So I can write the not_a method (which adds a Not instance to the regular expression, to invert the meaning of the next match) as:


not_a = Extender(Not())

That instance of Not() is then shared between every RE() object that uses it. Functional programming is an excellent tool.

(The documentation for grimace is now in the github wiki page)

Expressing Oneself, Fluently

I use regular expressions[1] in Python often enough that I know many of the character classes and syntax tricks.

I use regular expressions in Python seldom enough that I forget many of the character classes and syntax tricks.

This is annoying, but I have lived with it. Then I came across a little article in Dr Dobb’s Journal (which used to be great, many years ago, but is now a mere ghost of its former glory), in which Al Williams writes about (ab)using the C/C++ compiler to allow regular expressions to be written as:

start + space + zero_or_more + any_of("ABC") + literal(":") + group(digit + one_or_more)

rather than

^\s*[ABC]:(\d+)

I quite like the idea of fluent syntax (though I appreciate that it’s not necessarily so appealing if your native language isn’t English), but I spend marginally more time writing Python than C++ these days. Also, I like the idea of trying to build a fluent interface in a functional way. So, I started out by writing some examples of what I would want to be able to do:

start().end() would give the minimal empty string regex "^$". Easy enough.

Or how about

any_number_of().digits().followed_by().dot().then().at_least_one().digit()

You get the general idea: the fluent syntax describes the expression and results in a string that matches what it describes. One really good thing is that it avoids the “backslash plague” that can confuse those new to writing regular expressions in Python.

This now exists, and is on github at https://github.com/benlast/grimace and it will do the above and more. Time for some more intense code examples:

The grimace.RE object is our starting point; any method we call on it returns a new RE object. Let’s get the regex to match an empty string.

>>> from grimace import RE
>>> print RE().start().end().as_string()
^$

The as_string() call turns the generated expression into a string that can then be used as the argument to the standard Python re module. There’s also as_re() which will compile the regular expression for you and return the resulting pattern-matching object.


>>> #Extract the extension of a short DOS/FAT32 filename, using a group to capture that part of the string
>>> regex = RE().start().up_to(8).alphanumerics().dot().group(name="ext").up_to(3).alphanumerics().end_group().end().as_string()
>>> print regex
^\w{0,8}\.(?P\w{0,3})$
>>> #Use the re module to compile the expression and try a match
>>> import re
>>> pattern = re.compile(regex)
>>> pattern.match("abcd.exe").group("ext")
'exe'
>>>
>>> #We can do that even more fluently...
>>> RE().start().group(name="filename").up_to(8).alphanumerics().end_group() \
... .dot().group(name="ext").up_to(3).alphanumerics().end_group() \
... .end().as_re().match("xyz123.doc").groups()
('xyz123', 'doc')('xyz123', 'doc')
>>>
>>> #The cool example that I wrote out as a use case
>>> print RE().any_number_of().digits().followed_by().dot().then().at_least_one().digit().as_string()
\d*\.\d+
>>>

You can also split a complex regex over several lines:

# My python module
from grimace import RE

def is_legal_number(number):
    #Match a US/Canadian phone number - we put the RE() stuff in parentheses so that we don't
    #have to escape the ends of lines
    north_american_number_re = (RE().start()
      .literal('(').followed_by().exactly(3).digits().then().literal(')')
      .then().one().literal("-").then().exactly(3).digits()
      .then().one().dash().followed_by().exactly(4).digits().then().end()
      .as_string())
    number_re = re.compile(north_american_number_re)
    return number_re.match("(123)-456-7890") is not None

There is more to do: control over greedy matching, and ways to express some of the more complex tricks like backreferences and multiple matching subexpressions. And I’ll also package it properly for installation via pip. But for now, it’s available and it works.

(The documentation for grimace is now in the github wiki page)

[1] I can’t write any article on regular expressions without quoting Jamie Zawinski: Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

Wibbly Wobbly WBXML

Ok, so I couldn’t think of a title that’s as wilfully obscure as the usual ones.  Whatever.

For reasons of Commerce, I need to be able to generate WBXML messages within the guts of the mighty Python/Zope engine that powers the Mobile Phone Project[0].  What, I hear you ask, the blinking flip is WBXML?  Well, if you don’t know, you probably want to keep it that way, but you did ask.

WBXML is a binary encoding of XML.  XML is, of course, a textual encoding of data… some of which may originally have been binary.  So it’s sort of an extra level of complication added to something that’s already complicated, but hey, that’s what geeks do, isn’t it?  The reason that it’s a binary encoding is that XML is bulky.  Most of the time that bulk doesn’t matter that much; I’ll trade bandwidth, memory or CPU time for explicitness any day of the week.  But if you’re trying to pack XML over a slow, laggy, prone-to-being-interrupted-by-trees-or-birds wireless link to a phone, bulk is bad.  It’s even worse if you’re trying to pack an XML Service Indication (essentially, a pushed URL) into the tiny size of a single SMS message.  Hence the binaryness.

WBXML isn’t anything as simple as, say, a gzipped version of the XML stream.  Instead, it’s a carefully rigorous specification of how individual single byte values map to either XML or text strings.  For example, the XML <SI> maps to the binary value 0x05, and <INDICATION> maps to 0x06.  But it’s clever; if the HREF attribute of the INDICATION starts with “http://&#8221;, then the whole attribute-starting-http maps to 0x0C.  If the HREF starts with “http://www&#8221; then it’s mapped to 0x0D, saving another three bytes, and so on.  The more common the string, the more likely it is to have a fixed mapping.  There’s also a neat string-table option; commonly used string can be folded into single-byte offsets into a string table (in effect, any repeated string longer than three bytes is worth string-table-izing).

This is non-trivial stuff to knock up in a hurry, so it’s just as well that there’s the libwbxml open-source library to handle it all.  That library, however, is in C, and I’m working in Python.  There appear to be no published Python binding to libwbxml, so it was time to dust off my ancient experience of #include <Python.h> and get to it.

Here’s the C code that allows a Python call to libwbxml’s xml2wbxml function:

static PyObject *wbxml_xml2wbxml(PyObject *self, PyObject *args) {
/*A WB_UTINY is an unsigned char, so we can allow conversion directly from the Python string*/
WB_UTINY *xml;
WB_UTINY *wbxml;
WB_ULONG wbxmllen;
int status;
WBXMLConvXML2WBXMLParams    params;
WB_UTINY *errstr;
PyObject *result;
 
    /* Verify and read a string arg (xml) */
    if (!PyArg_ParseTuple(args, "s", &xml))
        return NULL;
 
    /* Pass that to libwbxml2 */
    params.keep_ignorable_ws = FALSE;
    params.use_strtbl = TRUE;
    params.wbxml_version = WBXML_VERSION_11;
    status = wbxml_conv_xml2wbxml(xml,&wbxml,&wbxmllen,&params);
    if(status == WBXML_OK) {
        errstr = NULL;  /*we return None to mean no error*/
    } else {
        errstr = (WB_UTINY *)wbxml_errors_string(status);
    }
 
    /* Build the return tuple of wbxml, error.
    The wbxml string is binary, so we need to convert it with a z#
    rather than a z.*/
 
    result = Py_BuildValue("(z#z)",wbxml,wbxmllen,errstr);
 
    /* Free the buffer that came back from the converter (thanks, Bob!) */
    wbxml_free((void *)wbxml);
 
    return result;
}

For details, I recommend you to the excellent on-line reference to the Python/C API (link goes directly to the section that explains conversion of values between C and Python).

[edit – removed dead link to the source for this… I don’t have it anymore, sorry!]

Right now, I don’t need to convert WBXML back to XML, so there’s no link to the reverse wbxml2xml function.  Like they say, open-source software scratches one’s own itches.  And if you did want to send the resulting SI in one or more SMS messages to a mobile, you’d also need to wrap the binary WBXML in a WSP (Wireless Session Protocol) header, which is another topic entirely[1].

[0] I suppose I should start calling this the Mobile Phone Content Project, since that’s a more accurate name.  Maybe when the tv adverts start…
[1] I do have working code to do this, so if anyone needs to wrangle WSP, feel free to drop me an email (or ask via blog comment) and I’ll share what I know.

A Matter Of Questions

The PlayStation Project revealed.  Another case study of Python and open-source tools on another Interesting Problem.

A Little Bit Of Background
E3 thunders towards us on the calendar[0] and mortal development teams quail before the onrushing juggernauts of deadlines.  Apart from our team, for whom the E3 deadline was a while back.  Now we have a whole new set of amusing comedy deadlines for beta tests, but that’s not important right now.  E3 is a big deal for us because it’s where the Playstation Project is now exposed to the nerveless, searching gaze of the Eye of Sauron… no, wait, I mean The Games Industry Press.  Same thing, smaller spiked helmets, as I understand it.

I’m a regular reader of Penny Arcade.  Not that I’m really a player in any sense these days; my favourite game is still Homeworld2, which is pretty much gathering dust on the shelves of most gamers’ rooms.  But I like Tycho’s writing, and Gabe’s style of art.  A while ago, they posted some comments from Geoffrey Zatkin, on the subject of new ideas for games.  Here’s a quote:

At PAX this year I was a judge for their “pitch your idea for a game” sit-in. I got to break a lot of hearts by telling the audience a very sad fact – that in my 8+ years as a professional game designer, not once has any boss of mine ever asked me for an idea for a new game. Not once. Again, unless you own the company, you get assigned a project (or jump ship to another company working on a game that sounds interesting). Sure, I’ve helped flesh out any number of games from concept to fully realized design. And that’s the hard part. Coming up with a good idea for a game is like coming up with a good idea for a novel. Everybody does that. But very few people have the discipline to sit down and write the book. The ideas are easy – the execution of the idea is the hard part.

But that’s what we’ve done; we came up with a new idea, something that genuinely hasn’t been done before[3], and we’ve done a massive great chunk of the work of executing that idea.  And it’s been interesting, in the best senses of the word.

It isn’t a first-person shooter.  It’s not a racing game.  No busty women swing over pits and solve puzzles.  It’s something a bit new, in a number of ways.  The game’s called Buzz![1], and it is, at heart, a music quiz.  There are nearly 1200 different clips of tracks involved, with a total of over 47,000 questions, in ten languages.  Our development partners do the clever 3D interface work (and a damn fine job they’ve done of it as well).  It’s been our job to gather, generate, edit, collate, audit, process and provide all of this data on which the quiz is based.  So, since this blog is (ostensibly) chiefly about techie things, I thought it might be interesting to explain the set of open-source software and tools that we’ve used to manage all of this.

Of Databases And Babel
Let’s start with the database engine (since that’s the core of it all).  The requirements that I drew up[2] were pretty much these:

  1. Open-source database.  There is no religious reason for this.  It’s a budget thing.
  2. That interfaces well with Python.  More on this below.
  3. And that supports Unicode properly.  I mean; has tools that support input and output of international characters sets.  And that does Unicode via the Python interface.
  4. Supports transactions.
  5. Accessible from Windows tools (ODBC, Python)
  6. Runs on Linux.  All our serious development servers run Fedora.

Given the above, it was pretty much a question of MySQL or PostgreSQL.  I chose MySQL for two reasons.  First; I’d used it many times before.  Second; the various Python/PostgreSQL packages all seemed to lack in one way or another, especially when it got down to sorting out Unicode.  It may be that there are neat solutions to any/all of the issues I found, but there seemed to be no great advantage to swapping a database I was familiar with for another I wasn’t.  MySQL (as of version 4.1.1) also has excellent Unicode support and speaks unto Python via the truly excellent MySQLdb package, courtesy of Andy Dustman, to whom I shall one day build a small shrine.

None of this would be much use if the general query tools for MySQL (like the Query Browser) didn’t work properly with international strings on Windows.  All our desktop machines run XP, and it’s critical that display, edit and copy/paste of strings in any language work.  Well, as long as you choose the right tools, they do, but that’s the nature of Unicode work for you.

The Language That Gets Everywhere
Why, you might ask, purely because it would help me move on to the next point, do you need a database that links with Python?  Thanks for asking.  Early on in the whole development process, I gave a lot of thought to the jobs that we’d have to do.  There were some guiding principles I followed.

  1. Whatever we think we’re going to be doing, it probably won’t happen the way we think it will.  Columns will change meaning.  Entities will be discarded and new ones appear.  Be flexible.
  2. We’re going to be gathering data from a zillion different sources.  Any data we capture is going to be dirty and need auditing and cleaning.
  3. There’s going to be so much data that any manual task applied to the thousands of records we’ll have might mean that we’ll run out of time.  Automate.

It would be nice, one day, to work again on the sort of project for which BigDesignUpFront would be applicable (that doesn’t, of course, mean that I’d do it; I’m pro-iterative myself).  Unfortunately, when you’re starting to gather data well in advance of knowing how that data will be used, you need to come up with something that’ll handle Big Change.  Thus “the database” in our little world doesn’t really mean the MySQL repository in which the data’s kept.  It’s MySQL plus thirty or so Python scripts and modules that Do Things To Data.

It’s always seemed to me that there are two approaches to database work.  I shall, for the purposes of discussion[4] class them as Database-centric and Code-centric development.

The database-centric approach was typified for me by a database developer I worked with at breathe, several years ago.  His approach to starting the working day was to (a) sit down at PC, (b) fire up a SQL Server client.  And that was him done; everything (and I mean everything) else was done from within the client.  List a directory?  Do some file copying?  It could all, apparently, be done from within the Database That God Gave Him.  A nice enough guy, but a tad fixated.

Of course, there are lesser examples of the approach and there’s much to be said for keeping the business logic next to the code in funky stored procedures.  But I never liked the languages in which they were written, nor the way in which the rectangular nature of the data forced its way into every corner of the design.  I am a dyed-in-the-wool code centric guy.  And, naturally, the code-centric approachs works thusly; your database holds your data.  Operations on that data are carried out by code, which fetches the data (whether that be into simple structures, objects that relate to the schema or objects that relate to whatever the hell they want to), Does Stuff to the data and rewrites the data back to the database.

The big point for me, though, that favours code-centricity is that it’s proven in the past to be more flexible in the ways that matter to me.  Maybe it’s a sign that I came to databases after high level programming languages (and to those after assembler).  Whatever; I like the code paradigm.  And in this case, it meant that I didn’t even attempt to create a vast and allencompassing ERD that encapsulated every last semantic of the data.  We just created a base set of simple tables that the data gathering team could begin to populate.  Over time and as prototypes of the game were created and revised, the schema grew to reflect the uses of the data; now it’s pretty complex.

So for this project, the phrase “the database” usually refers to the set of tables in the MySQL server plus the code that operates on it.  And that’s all in Python.

There were a couple of other options open to us.  We could have started in Access… but apart from the obvious scale issues, we needed a proper multi-user database that could be efficiently hosted on a remote server, with replication.  Also, I’m not a great fan of VB as a development language; for quick-and-dirty user interface jobs it’s great.  Slapping together forms to edit data?  Access can be the solution you need, using linked tables to get at the MySQL data where necessary.
We could have used Java for the main development, but working in a dynamic language has rather spoiled me for Java in odd ways.  Most of the supporting code for the database is all in a single Python module and the convenience of firing up a Python interpreter and being able to do ad-hoc processing at the command line was too good to pass up.  In fact, as we’ve approached the last few deadlines there have been many, many audits and sensibility checks all run from inside a Python interpreter (often Quasi).  The Java development cycle is just that bit too slow, and the definitions of objects a little too fixed to match that convenience.

The Swiss Army Chainsaw
There’re many audio files involved in this project, which means that sox has been a complete necessity at times.  Same rule applies as with the data; for a small company, even the simplest, fastest manual processing can be impossible when multiplied up by thousands of files.  Thus automation’s been key here, allowing all the samples to be crossfaded and normalised in batches.  For a lot of editing work, though, there is really no substitute for having the waveform visible on a PC and here, for once, proprietary tools have really won out.  Two of us in the company are musicians with home studios, and applications like Cool Edit or Wavelab have done a lot of the work.

So that’s it.  Perhaps it’s not an in-depth industry exposé of the use of open-source stuff to create the next Duke Nukem or Doom, but there’s a real project here resulting in a real game and that’s a snapshot of some of the ways we did it.  And are continuing to do it… there’s still work to do.  Back to Eclipse and QueryBrowser for me…

[0] RSS feeds and individual workloads being what they are, you may well be reading this after E3 2005 has come and gone.  In which case, try to imagine yourself back in the heady old days of May 2005.  Feel the period atmosphere.  Good, isn’t it?  Do they have flying cars yet in your time?
[1] Yes, the exclamation mark’s part of the name.  Like Yahoo!
[2] This was way back at the end of 2003.  It’s taken a long time to get this thing under way, and I might do another blog entry to explain why, and what happened along the way.
[3] True.  Yes, there have been quizzes.  Yes, there have been quizzes with clips of music (like the DVD Pepsi Challenge game, or the CD-based Spot The Intro).  But nobody’s ever done one with 1200 tracks on it.
[4] As opposed to the Porpoises of Discursion.  I had a small spelling checker issue that was too good to delete entirely…

Greater Than The Sum Of Its Parts

A little case study of Python and open-source tools on a big, complex and yet oddly routine sort of problem.

The Floofs project continues to grow, with distributors signing up in many different countries.  This means that we have the job of producing many, many different sizes of any given animation to suit their requests.  And every animation can have several different watermarked variants; web preview, WAP preview, distributors logo, no logo, etc., all of which may need to be regenerated if the source image is updated.  It all builds up into one huge asset management job: over 153 000 files at the last count.

So how does a geek approach a task like this?  Firstly from the standpoint that has stood us in good stead for many years and shall continue to be our watchword into the future. Do as little work as possible by automating everything.  The second principle, given that this is a self-funding project, is to avoid the use of fancy content-management “solutions” and build as much as possible from open-source software.

The overall job of finding that which has changed/is new, building that which needs to be built and uploading that which needs to be uploaded is an absolutely canonical task for make; no prizes for guessing that.  But the sheer complexity of the makefiles (and the need to keep several hundred of them up to date) seemd to imply a mammoth task of rule creation and macro generation.  Being (a) lazy and (b) a programmer at heart, I opted for a better solution.  Write a Python script that creates the Makefiles.

The overall Python script takes around twenty seconds to run per group of around nine animations.  Given that this is on a 1Gb dual-processor build server, that might give you an idea of the large number of targets and dependencies involved.  It turns out that it makes more sense to write very big explicit makefiles, in which all the dependencies and commands are laid out in full, than to play with clever Make rules; they save time for humans, but when it’s code writing code, there’s little to be gained.  Essentially, the script gathers a list of the source images and then builds a huge list of targets, dependencies and commands that’s finally sorted and spat out into a Makefile.

In order to make the process more flexible, several of the commands that make will eventually invoke are themselves Python scripts.  Consider the job of resizing an animated GIF.  In theory it’s simple; take the GIF apart into component frames, resize each one, then reassemble.  In practise[0], it’s more complex than that.  GIF frame-sequence compression works best when pixels remain the same between frames, so the resizing process needs to try and ensure that happens even if the set of colours used between frames varies (most single-frame image resizing tools don’t work too well on this).  Also, GIF in-frame compression doesn’t work well with fuzzy edges and gradients, so anti-aliasing can result in big images.  But then again, non-antialiased images look terrible.  So there’s a set of Python scripts designed specifically to handle the seemingly easy job of resizing images without also making them twice the file size[2].

All of this image manipulation must be command-line; there are nothing like enough resources (whether you count time, money or people who grok PSP) to do the work manually in a GUI tool like Photoshop.  So it’d all collapse if it weren’t for gifsicle and ImageMagick.

The first is the best command-line GIF manipulation tool, bar none.  Runs on Windows and Unix.  Free.  And damn good at everything (except resizing, at which it does a non-anti-aliased quick and dirty job).  But for exploding, optimizing, commenting or running a soft polishing cloth over your GIFs, nothing comes close.

The second is the sort of toolset that free software zealots ought to parade down the street as a shining example[1].  ImageMagick tools can perform operations on images for which you’d normally expect to have to fire up the Gimp, PhotoShop or PSP, but from the command line.  Which means that once you’ve sorted out your commands and source materials, doing 153 000 images is as easy as doing one.  Its support for animated GIFs is not as good as for static images, but given that gifsicle can explode a GIF into separate frames and then reconstitute the original after those frames have been modified, the combination of the two is all you need.  Really.

And finally, I’d be nowhere without the language with which the IT systems of Paradise are no doubt built; Python.  “Your mileage” (as the Americans like to say) “may vary”, but there are damn few languages that are so completely cross-platform, scalable, supported by decent IDEs and object-oriented.  The ftplib module’s been used to build all the uploaders.  The very funky paramiko module does the same for SFTP.  The only thing that let me down was… the damn PIL.  An imaging library that has some of the worst GIF support I’ve yet seen.  Yes, I know all about the GIF patent issues[3], but de-emphasising support for a de-facto standard because of ideological convictions doesn’t work in the real world.  GIFs are what we’re stuck with; one works with what one has, not what one would wish for in an ideal world.  Still, if that’s the only fly in the Python soup, then I’ll keep eating.

So there; that probably wasted less than five minutes of your day on a brief description of how we manage several hundred thousand images with one command line.  Now excuse me; I must go and type make and watch it do my job for me…

[0] In theory, there’s no difference between theory and practise, but in practise, there is.
[1] Though to be honest, I can live without any more free software zealots, thank you very much.
[2] Part of the secret is dead obvious; always scale down from a larger size to a smaller.  Always.
[3] The biggest issue being that they’re no longer an issue in any area.  And they never were a barrier to writing a decent GIF reader.