You Can’t Get There From Here

Joel Spolsky writes today on the subject of proposed changes to HTML4.  This is interesting, but he doth contradict himself, though indirectly. Actually, I’m being unfair to Joel – most of his requests are for client-side changes, not radical shifts in the nature of HTML and HTTP. But there are implication in there that get me to thinking… Yesterday I read his post on How Microsoft Lost The API War, interesting and well-reasoned thoughts on Microsoft past and future, in which he said:

Microsoft grew up during the 1980s and 1990s, when the growth in personal computers was so dramatic that every year there were more new computers sold than the entire installed base. That meant that if you made a product that only worked on new computers, within a year or two it could take over the world even if nobody switched to your product.

So in many ways Microsoft never needed to learn how to get an installed base to switch from product N to product N+1. When people get new computers, they’re happy to get all the latest Microsoft stuff on the new computer, but they’re far less likely to upgrade. This didn’t matter when the PC industry was growing like wildfire, but now that the world is saturated with PCs most of which are Just Fine.

That’s what allowed Microsoft products to become de-facto standards; you upgraded Windows and lo! you got all this new stuff, some of which you wanted but a lot of which came with it and became part of The Great Installed Base.  Most notable amongst these; IE itself.

But IE, like Windows, is out there.  It works; why change?.  There are (according to the user-agent logs I see from time to time) still plenty of people using older version of IE.  In the future, people will use it on phones, or other embedded devices that don’t get updated once bought.  The time when a new version of HTML could be introduced and spread like a better breed of grass over the plains of the world is past.  Sure, it’ll happen, but on nothing like the super-fast-blink-and-you-miss-it days of Internet Time, when whole new iterations of the Web standard could (it seemed) come and go in a month.  Remember Push?

But the relative stability that we have now is, in many ways, a Good Thing.  Paradigm shifts are like major tectonic events to those of us who have to develop close to the ground, in amongst the details; from far enough away in space and time, they’re impressive; possibly even a necessary part of the processes that give us the digital equivalents of the Himalayas, great established mountain ranges of standards and proven ways to use them.  Up close, a paradigm shifting can throw your whole project out of line.  Anyone else still have a Microsoft J++ t-shirt?

All Sweeping Generalizations Are Wrong

There are, I have decided, three types of enthusiast.

The first is the Passionate, a lover of their subject, brimming with a desire to impart that same appreciation to any and all who listen.  They approach their topic with love and derive enjoyment from learning all about it.  These are Good People; often, even though I have no interest myself in that which they appreciate, it’s worth listening to them to catch the wave of their enthusiasm, to be borne up by the joy they take in it.  Many such people occupy Python mailing lists, oddly enough.

The second is the Keeper Of Hidden Knowledge.  This beast likes to guard information, to keep it secret and concealed to maximize its value.  As long as you’re willing to pay the price (which may be simply to acknowledge their tremendous authority), what you need may be vouchsafed to you.  These are… a Mixed Blessing at best, but probably inevitable, like taxes or Kylie Minogue.

And then there’s the third type, the Hierarch; the sort of person who, when encountering another aficionado of his (or, rarely, her) area of expertise, enters into the sort of dance tomcats do on first meeting.  It may sometimes appear overtly friendly, yet the purpose is not to bond but to establish superiority.  The possessor of the most profound and deep knowledge wins and gains (at least in his own eyes) higher status.  Unfortunately, this sort of animal is all too frequently encountered online.  Recently, I’ve been following a discussion (that I shall not dignify here with a link) on the subject of guitars.  Guitars are dear to me; I delight in their playing and (to a lesser extent) in the pleasure of owning such beautiful, elegant and well-crafted instruments.  Into this discussion (on the correct methods of setting up for best playability), enters a new subscriber, keen to discover hints as to the best way to adjust his Floyd-Rose II floating bridge.  Within a few hours of his initial post, the Hierarchs descend, each attempting to out-do the other in arcane and increasingly irrelevant references to finer and yet finer points of string-related detail, mixed with slighting comments on the low status of the “newbie” who dares to enter their hallowed grounds in search of advice.

Not to put too fine a point on it, it’s cretins like this who hang out on Linux-related IRC channels and laugh at those who don’t yet grok the deeper mysteries of hdparm.  You can also find them in builder’s merchants, where they’ll attempt to compensate for being paid pitiful wages by sneering gently at anyone who’s a little unsure of the exact difference between grades of nail, or possibly bolt.  A pox upon them all, I say.

Hail Tobor

A snippet.  A Zope robots.txt, as a Python Script, that returns different values for test and production sites.  Handy if your development site is also open to the rest of the Net.  The date and time are in there so I can see the last time the script ran; I recommend caching these via an AcceleratedHTTPCacheManager.

request = container.REQUEST
RESPONSE =  request.RESPONSE
RESPONSE.setHeader('Content-type','text/plain')

# Return a string identifying this script. host = request.SERVER_URL.lower() if host.startswith('http://'): host = host[7:] print "#Robots.txt for host %s" % host print "#Generated "+str(DateTime())

#This is a list of elements that mark the host as being a #development server.  Edit to put your own in, or set #devServer according to any criteria you please. devMarks=['test','internal'] devServer = False for m in devMarks:   if host.find(m) >= 0:     devServer = True

if devServer:   #running on the development server, request no indexing   #at all.   disallow = "/" else:   #running on the production server,   #allow indexing of everything.   disallow = ""

print "\nUser-agent: *\nDisallow: %s\n" % disallow #Add in prints for global disallows right here.

return printed

The Binds That Tie

Since Googling finds nothing on this subject, I thought I’d save the results of a little investigation into Zope, caching and Page Templates (also known as ZPT or PageTemplates).

The Zope AcceleratedHTTPCacheManager is, in combination with Apache or Squid, an excellent thing.  It’ll set headers on your HTML pages to allow them to be cached by clients, proxy servers or (and most usefully), an Apache installation running in reverse-proxy mode as a “front-end” to Zope.  Thus repeated requests for the same data are served by Apache and don’t load your Zope instance; considering the difference in speed, that can be crucial.

But it’s not clear (or at least, I haven’t been able to find out) what aspects of a “call” to a page template allow it to be cached, and which cause the template to be re-rendered.  Reading the source for AcceleratedHTTPCacheManager helped, in that a comment at the end pointed me to http://www.web-caching.com/proxy-caches.html.  You might also want to take a look at RFC2616 for the low-level dope on how caches work (you want section 13).  The big question we want to answer is; “what is the key that is used to look up data from the cache?”

The short answer, for HTTP caches, is that it’s done on the URL[1].  Assuming you’re a Python person, think of a cache like a big dict, with the URL as the key and the result of the page as the value.  Something to bear in mind here is that cookies are not part of the URL, so if the contents of your template are affected by the values of cookies, you may well not want to cache it using an external cache.  Similarly, many caches won’t cache anything that contains an Authorization header (it’s not part of the URL), so if you need to log into your site via the standard browser username/password dialogue, that might well mean your templates won’t be cached.

Consider a typical GET request that contains a query, something like http://my.test.site/MyTemplate?parameter=value.  The parameter and the value are both part of the URL, so they’re part of the key.  However, if you use a POST request, then the parameters are not part of the URL.  This almost certainly means that your template will get invoked every time (which is the safest course of action).  Remember how Internet Explorer says something like “This page cannot be reloaded without resending the data”?  That’s why.  You might also (depending on how long you’ve been online) remember how Netscape used to say “Repost Form Data?” in the same situation.  That always took my award for the most obscure error message ever; three words, each of which meant nothing to the average user.  Incidentally, for an interesting summary of the rationales that led to GET and POST and the recommendations for which should be used, see this document; in short, it recommends using POST for requests to do critical things like ordering products or charging money, where a re-request might be a Bad Thing.  You’ll also learn the word “idempotence”; use it in conversation today!

As a fallback for misses on an AcceleratedHTTPCache, the RAMCacheManager is able to make much more use of the Zope environment in which the template’s executed as the key.  A text file along with the source says “The values in it are indexed by a tuple of a view_name, interesting request variables, and extra keywords passed to Cache.ZCache_set()”, which doesn’t help us very much.

Time for some empirical observation.  A ZopePageTemplate object is cacheable because it inherits from OFS.Cache.Cacheable.  In the ZopePageTemplate.py file (in lib/python/Products/PageTemplates), in the _exec method, I added a line to dump out the keyset used; this is visible when you run Zope via the runzope script in the bin subdirectory of the Zope instance.  For reference, here’s where it goes:

        # Retrieve the value from the cache.
        keyset = None
        if self.ZCacheable_isCachingEnabled():
            # Prepare a cache key.
            keyset = {'here': self._getContext(),
                      'bound_names': bound_names}
            #Add this line in to dump the keyset.
            print keyset
            result = self.ZCacheable_get(keywords=keyset)
            if result is not None:
                # Got a cached value.
                return result

The output you get is something like this, which came from invoking a Page Template on an instance of MyProduct with a URL like: http://www.mytest.site/Folder/Object/ScriptName/flash

{'bound_names': {'traverse_subpath': ['flash'], 'options': {'args': ()}, 'user': Anonymous User}, 'here': }

So, we can see that the cache key includes most of the stuff that will make a difference:

  • user; bear in mind that if you’re building a public-access website (as opposed to some Intranet product) most of your users will be Anonymous User
  • here; the object on which the page template is invoked
  • traverse_subpath; the route that acquisition took before finding the script name (in effect)

These are (as far as I can tell) in addition to the URL itself, so variations in query parameters or the path by which the template is invoked cause the cache to see the requests as different.

[1] Yes, I know – this is a summary, ok?  Strictly speaking, it’s done on the URIs, and with a bunch of caveats about case-sensitivity and the like; see RFC2616 3.2.3 for more…

Contains The Seeds Of Its Own Deconstruction

Unicode is both wonderful, and yet not.

It’s wonderful in that there is a worldwide (in effect) standard, widely supported, that allows for reasonably straightforward handling of strings in most any character set that one might need to consider.  The demons of complexity, however, crawl from their hiding places when it comes to dealing with the interface between, say, the Python implementation and the environment in which it might run; specifically, for me, when dealing with Unicode text files on Windows.

Let’s have a context; Windows is rather happy with Unicode, especially the more recent incarnations like XP.  Notepad will eat and spit out files in UTF8 and “Unicode” (which is actually UTF16) form, marking them with appropriate BOMs, Byte Order Markers.  These serve a dual function; they identify the exact encoding of a file and also allow Windows tools to recognise it as Unicode rather than plain text.  I have, as part of The Mobile Phone Project, to deal with such files.

The question is then, how do we sensibly handle such files?  The codecs module provides a neat open() that returns a file-compatible wrapper to read them in, provided that you know the encoding of the file in advance.  But this is not always possible; we all know that given a possible error, a user somewhere, sometime will make it, and provide me a file that is in UTF16, not UTF8.

Well, just like Notepad or Word, we look at the BOM.  The codecs module provides a set of BOM constants, all defined as Python strings.  Why strings and not Unicode types?  My guess is that the typical use of these is to match the start of byte strings, to detect the encoding, so it makes sense to have them in the same form as byte strings.  However, this isn’t quite enough – a BOM itself will convert to a valid Unicode character, but not a useful one (it’s a zero-width non-breaking space), so it’s enough to mess up many parsers.  We need the conversion to discard the BOM.

Thus we can start to write some code:

import codecs

#Not all BOMs have an appropriate equivalent codec.  However, #these are the BOMs encountered on Windows. BOMmap = {  codecs.BOM_UTF8 : 'utf_8',             codecs.BOM_UTF16_LE : 'utf_16_le',             codecs.BOM_UTF16_BE : 'utf_16_be'  }

maxBOMlen = max(map(lambda x: len(x), BOMmap.keys()))

def mapBOMToEncoding(data):     """Given a string in data, map any BOM found to an     encoding name.  Return a tuple of encoding, length of BOM.     Return None,0 if no match ocurred."""     for b in BOMmap.keys():         if data.startswith(b):             return (BOMmap[b], len(b))     return (None, 0)

This gives us a way to map a BOM found in a string to an encoding as well as the length of the BOM so that we can discard it.  Which makes for a simple function:

def encodedToUnicode(data, errors='strict'):
    """Given the data in a string, decode any BOM at the start to
    deduce the encoding, and return the Unicode string.  If the data
    does not hold a BOM, treat it as UTF_8.  UnicodeDecodeError exceptions
    may be thrown for invalid data.  The errors parameter is passed to the
    decode() function and may be the usual values ('string','replace','ignore')"""
    (encoding, offset) = mapBOMToEncoding(data)
    if not encoding:
    #If no BOM match was found, try utf8, since that
    #will eat ASCII properly.
        encoding = 'utf_8'
        offset = 0

    return data[offset:].decode(encoding, errors)

The above is a function suitable for use with data read from a builtin file, as in:

#Note the "rb", necessary on Windows to prevent text mangling.
f = open('IThinkItsUnicode.txt', 'rb')
#read all the data in one go
data = f.read()
f.close()

udata = encodedToUnicode(data)

But, of course, not all files are suitable to be yanked into memory in one go.  What would be useful would be a wrapper for a file object that detects the BOM, such as is already provided by the codecs module.  Here’s a function that attempts to detect the BOM and return a codecs.StreamReader that decodes the data on the fly.

def openBOMFile(filename, errors='strict'):
    """A wrapper for codecs.open() that returns a codecs.StreamReader for the
    file, as determined by BOM.  No BOM means we assume UTF8.  The errors parameter
    is as passed to decode() methods and defaults to 'strict'.
    Opening the file via this method will cause the first few bytes to be read
    immediately.  The file must be rewindable (allow seeking backwards from the
    current tell())."""
    encoding = None
    offset = 0
    file = __builtins__.open(filename, 'rb')

    #Get the first few bytes of the file to check the BOM.     data = file.read(maxBOMlen)     if data:         (encoding, offset) = mapBOMToEncoding(data)     if not encoding: encoding = 'utf_8'

    #seek to the first character after the BOM     #do the seek even if the offset is zero, otherwise     #we may lose the first byte of the file.     file.seek(offset)     (e,d,srf,swf) = codecs.lookup(encoding)

    #Generate and return an appropriate StreamReader.     sr = srf(file, errors)     sr.encoding = encoding     return sr

The codecs module provides a number of useful classes for reading and writing encoded files.  Howevere, the StreamWriters don’t write the BOM, so here’s a quick example of writing a UTF8 file that Notepad will handle.
First, here’s the method that does the writing; returning a codecs.StreamWriter object, with the BOM written.

def writeUTF8File(filename, mode='wb', errors='strict'):
    """Open the file for writing, write the UTF8 BOM and
    then wrap the file in a codecs.StreamWriter so that
    it can be used to spit out Unicode."""

#Ensure that the mode is binary. if not 'b' in mode: if mode.endswith('+'): mode = mode[0] + 'b+' else: mode = mode + 'b' file = __builtins__.open(filename, mode)

#We only write the BOM if the mode implies truncation and #writing; we can't do anything if the file already exists. if mode.startswith('w'): file.write(codecs.BOM_UTF8)

(e,d,srf,swf) = codecs.lookup('utf_8') # #Generate and return an appropriate StreamWriter. sw = swf(file, errors) sw.encoding = 'utf_8' return sw

And here’s a call to it. Be careful, if you copy-and-paste this; the “Москва” in the comment is a Unicode string and by default, PythonWin won’t like it, which is why I used Unicode escapes to build the literal.  Cyrillic is useful for test data, since all the characters are outside the eight-bit range (which is not true of most Western European sets). Incidentally, I noticed that the version of this post that was grabbed for the Artima Python Buzz pages replaced the Unicode with ‘? characters, indicating that it can’t handle Unicode web content.

    f = writeUTF8File('d:\\temp\\test.utf8','wb')
    #write 'Moscow' in Cyrillic/Russian (Москва)
    f.write(u'\u041c\u043e\u0441\u043a\u0432\u0430\r')
    f.close()

Unix systems, according to some apparently-biased sources, don’t tend to use BOMs – they would break conventions such as the initial #! syntax for shells; thus on Unix systems you need to be sure of the encoding you’re dealing with (in theory, the current locale defines the format of all input and output files, but few of us are lucky enough to have all our data and processing within a single locale).  That doesn’t, of course, mean you can’t use BOMs in your own data files; they may have been a Microsoft suggestion, but they’re part of the Unicode standard and help address practical problems.  The back-end systems that support The Mobile Phone Project are all Linux-based and use UTF8 (with BOMs) in throughout.
Having said that, according to Unicode.org, KDE from v1.89 and later versions of GTK (from 2) are Unicode-compliant, using UTF16 or internally.

One Of These Things Is Not Like The Other

I’ve been playing with Leonard Richardson’s useful BeautifulSoup module; a lazy, doesn’t-care, do-it-anyway parser for HTML.  This is, in turn, because I’ve been trying to knock up a little lookup application that will do translations using the WordReference site, but without the ads,  popups, popunders or the hassle of clicking on forms.  I’m after something that I can drop a word on and have it translated.  Both HTMLParser and htmllib choked on the output from a page such as these definitions of ‘cara’, so I turned to BeautifulSoup.

Which did almost exactly what I wanted – it ate the HTML and built me an object tree that I could then walk, filtering out what I didn’t need.  Unfortunately, it and I suffered from a small mismatch of worldview.  I use Unicode.  A lot.

I grabbed the webpage using urllib, something like:

uo = urllib.FancyURLopener()
uo.addheader('Accept-charset','utf-8,*')

f = uo.open("http://www.wordreference.com/es/en/translation.asp?spen=",urllib.quote_plus('cara'))

#Decode the response so we have a unicode string; we always get iso-8859-1, no matter what we ask for. response = f.read().decode('iso-8859-1')

#Finished with the request f.close()

(It’s actually a little more complex – you need to handle the character sets more flexibly, and override the user-agent so that WordReference doesn’t block you).
Anyway, that gets me a Unicode string in response.  I can then pass it to a BeautifulSoup object, with:

soup = BeautifulSoup.BeautifulSoup()
soup.feed(response)

But… calling soup.first() (or a number of other functions) can throw me the notorious UnicodeEncodingError.  Hmm.

It turns out that BeautifulSoup is, for want of a better term, Unicode-oblivious.  If you give it a Unicode data source all the internal strings get silently promoted, but there’s no specific Unicode handling in there.  This is not a bad approach, and would work very well, if it weren’t for the fact that the objects use str(), a lot.  Printing any BeautifulSoup instance invokes str() to return a string representation, which uses the default encoding, which is often ‘ascii’.

Implicit in the design of BeautifulSoup is the assumption that str() is a good way to represent/return the “value” of an object.  For a Tag object, __repr__calls __str__.  Given that the objects here are derived from a stream of characters, that’s not unreasonable, but it misses the point that __str__() is usually supposed to return a printable representation, in the default character encoding.  When the result is Unicode that can’t be converted to a string, that assumption breaks.

I think what would make more sense (from a Unicode point of view) would be to separate the value of the data from the representation of the data, so that one (for example) accessed the NavigableText.string data attribute (via a function wrapper) to get the value, but accepted that str() applied to an instance would do something like:

def __str__(self):
    """Return representation of self, omitting characters that can't be printed."""
    return self.string.encode(sys.getdefaultencoding(),'replace')

Value and representation.  Two things that can look the same, but aren’t.

Oh, and I still like BeautifulSoup very much; so much so that I’m using it, with a patch to avoid the problem, submitted to Leonard.

Swapping It All Back In

The blog is light today, and yesterday too.  This is not for lack of inspiration, or even of interesting Pythonic subjects… it’s because some “thieving little scrote” (to quote one of my colleagues) helped himself to two laptops from our office yesterday.  In broad daylight.  I think the British term is “brass neck”.  Anyway, I’ve spent most of the day in installing everything I need onto a shiny new Vaio.  Then I started running Python scripts… it’s then one discovers all those little extra packages that one installed and found ever so useful, and that one now needs to go and reinstall for the whole thing to work.

Thank the God of Sysadmins for backups…

A Hazelnut In Every Bite

For some reason I seem to be channeling British TV adverts of the past this morning… so far it’s been “Topic” (hence the, erm, topic of this entry) and earlier it was Kia-Ora (which was, as you may recall, too orangey for crows).

Bob/entrepum pointed me at the Enthought Python distribution a while ago, and it’s as Chock Full O’ Nuts as anything I’ve seen in a long while. You gets yer basic Python plus wxWindows, PIL, VTK, etc, etc. A complete suite of packages all together, aimed (it seems to me) at those using Python in a data-analysis kinda way (otherwise, why the Fortran-Python interface generator). Being a loosely typed sort of person myself, I could live without Traits but in general it’s a nice bundle.

I got an email from a friend asking for a couple of little utilities I’d mentioned to him, so I duly sent him the Python source. Then I had to point him at a distribution (he’s a systems guy, not a programmer, so he doesn’t install a language on his machines as step 3 after opening the box). Then I had to find him links to a couple of packages I’d used which weren’t in the distro. And so on, and so forth. This happened right before I had the time to type Enthought into Google and find their site, which made me appreciate how rounded a collection they’ve put together. As soon as they get 2.3.4 in there, I’ll probably replace ActiveState with it entirely. It’s the package you can install between meals, without losing your appetite.

Is There Anybody Out There?

Broadband, on the development where I live, is a slightly touchy subject, not something that one raises with the neighbours unless in the mood for a shared rant at the nature of lossy communication channels (and telecoms companies) Here, most houses are outside the BT official limit for ADSL (59 dB downstream attenuation, more often quoted as “5.5km” by the press); thus not everyone who wants to be wired up is wired up. Pity, since it’s the sort of affluent work-at-home part of Cheshire which ought to be prime broadband territory. Some of us, though (if you’ll allow me a moment’s smug grin) are connected, due to a policy of treating visiting BT engineers like royalty and plying them with tea and biscuits at the slightest provocation. My router tells me that my line’s downstream loss is 61dB with an SNR of 9dB – pretty marginal, but it works fine for a 512kbps connection. It’s routed around the house on CAT5 and 802.11g wireless.

Being a security-minded sort of person (I used to run systems and networks for an ISP; it stays with you) I, naturally, VPN my way in to the house network from the wireless zone. Mostly I leave the wireless connection open, so that any stray passing person in need of bandwidth can hop on and use it. A little Python script sits on the Linux server and uses SNMP to watch for visitors.

It’s never seen anyone. Ever. In six months.

Now, occasionally the malevolent forces of fate force me to travel to places like London where (according to the tales) the streets are paved with WiFi access. Whenever I’m there I have my laptop, wireless card plugged in and I keep an eye out for (a) other networks cards and (b) access points. In the metropolitan heart of the Smoke they’re legion; I’m sure scarcely a packet is sent without a collision. Yet step away from the major streets or business parks, away from the International Hotels[1] and conference venues… and it all falls away, the airwaves bereft of SSID broadcasts, the NetStumbler scans revealing nothing. Sometimes I wonder if the commentators on the Wonderful Wireless World Of WiFi are like a native tribe, reporting from the middle of their forest. All around them are trees, trees in every direction they look, their houses are built from trees and the whole world must, they believe, be covered in trees. Yet beyond the borders of their land are deserts.

[1] And what the heck is an International Hotel anyway? One that straddles a major political boundary, with the bar in one nation and the restaurant in another? What an odd term.

Dementors At Every Entrance

I had a need for lists that could hold only items of a certain type. A nice little problem, and one that’s cropped up before, not least in the arrays module. Unfortunately, that only stores certain simple types, indicated by a string typecode, and these don’t map to Python’s types in a way useful to me. So I whipped up a subclass of list. Until quite recently, as a couple of previous bits of code in this blog show, I have mostly been working with Python versions before 2.0, so this was an useful way to play some more with new style classes and their associated variations on a theme.

The list subclass is especially interesting, since there are so many different methods that do stuff to it. In order to make it “type safe” (that is, only allow it to contain items of a given type), I only needed to worry about those that changed the values in the list, specifically: __setitem__, __setslice__, append, extend and insert. __init__ of course also takes a sequence. So, we need to check all the ways that rogue values might get into a list, and there are a few gotchas.

Firstly, as I mentioned, there’s not very much reuse of methods in list; extend doesn’t call append, for example, nor do other methods call __setitem__. For good reasons of efficiency, of course; making every single modification of the list items go through __setitem__ method would make list easy to subclass, but since lists are used so much in all areas of Pythoning compared to the number of times one needs to subclass them, that would result in a sacrifice of efficiency for the sake of an exception. A Bad Idea, that. Thus they all need overriding.

Secondly, there’s the question of overloaded operators. If I have a regular list a and a type-safe list b, what should a+b yield? I elected to make life easy – if I really do want the result of adding a regular list to a type-safe one, there’s always extend, and I can always use the result of list arithmetic as the initializer for a new type-safe list, the equivalent of list() itself in its usage as a “cast”.

Finally, there’s the question of how to set the type to which the list items are to be constrained. A drop-in replacement for list needs to support initialization from an existing sequence, but there’s also a common usage where the type-safe list is created empty and then appended to. To support either, I allow __init__ to be given either a sequence (which is checked for type consistency and then the type of the first item used as the constraint) or a type (which is used directly as the constraint) or a class. The class option is useful for a class-safe variant on the idea (see below). I needed also to consider what to do when there was an empty initial sequence and chose to treat that as an exception; I need to get the constraint type from somewhere, and Python’s rule of thumb is to be explicit, rather than to make assumptions.

So, here it is. No doubt there are errors and inefficiencies galore…

class TypeSafeList(list):
    """List that limits members to that of a given type (doesn't require typecodes)."""

def _typeOf(self, x): """Return the type of x. Abstracted into a method so that it can be overridden.""" return type(x)

def __init__(self, parameter=None): """Generate a TypeSafeList that will hold only objects of the given type, initialised from the given sequence (first member) or of the given type. If parameter is iterable, we assume it's a sequence.""" try: iter(parameter) #Take the type of the first parameter - an empty list will raise #an exception. self._type = self._typeOf(parameter[0]) initlist=parameter except TypeError: #if it's not a sequence... #Verify that parameter is a type or a class. if isinstance(parameter, types.TypeType) or type(parameter) == types.ClassType: self._type = parameter initlist=[] else: raise TypeError, "Parameter must be non-empty sequence, type or class." #Validate initlist for type consistency. map(self._checkType, initlist) #Do parent init. list.__init__(self, initlist)

def _checkType(self, x): if type (x) != self._type: raise TypeError, "Value %s (%s) is the wrong type for this sequence; should be %s" % (str(x), str(type(x)), str(self._type))

def __setitem__(self, i, y): """Set entry i to value y""" self._checkType(y) list.__setitem__(self, i, y)

def __setslice__(self, i, j, seq): """Replace items i to j with the items from sequence seq.""" map(self._checkType, seq) list.__setslice__(self, i, j, seq)

def append(self, x): """Append a to the list.""" self._checkType(x) list.append(self, x)

def insert(self, i, x): """same as s[i:i] = [x].""" self._checkType(x) list.insert(self, i, x)

def extend(self, seq): """Append contents of sequence seq to the list.""" map(self.append, seq)

An interesting and useful fact: all the types in the types module are of the same type (types.TypeType). That is, type(types.BooleanType) == types(types.IntType) and so forth. And here’s a subclass that constrains members to be an instance of a defined class or subclass; a class-safe list. This is actually what’s most useful to me:

class ObjectList(TypeSafeList):
    """Type-safe list of objects; only members of the fixed class or subclasses allowed."""

def _typeOf(self, x): """Return the type of x. Abstracted into a method so that it can be overridden.""" return x.__class__

def _checkType(self, x): """Verify that x is an instance of the right class.""" if not isinstance(x, self._type): raise TypeError, "Object of type '%s' is the wrong type for this sequence; should be '%s'" % (str(x.__class__), str(self._type))

As the saying goes, the journey is the reward, and this was interesting to do. What more can one ask for of a Monday morning?