Down Amongst The Ones And Zeroes

For anyone who ever programmed in assembler, especially back in the days of the Z80, 6502 and 8086.
For anyone who ever had their eyes go wide the first time they saw a Macintosh.
For anyone who’s ever lain awake at night pondering on ways to save bytes or CPU cycles in some complex little algorithm.
If you’ve never read the stuff on Andy Hertzfeld’s, go there now.  Be warned, you’ll be reading for hours.

Packing It All In

Packing with Python, unpacking with Java

Two of the projects I’m working on right now are heavily J2ME-based.  The memory constraints in this environment are downright scary at times; takes me back to the days of the Z80, when men were real men and a sixteen-bit register counted as wide-open spaces.  But enough of the nostalgia.  The point of this entry is to make Useful Notes on using the power and simplicity of Python to prepare data for unpacking in the restricted environment of a J2ME MIDlet.  In this context, I’m talking about packing data into some binary format that takes up minimal space in the MIDlet (to save download time and installation space) and can be unpacked on demand, requiring minimal memory at runtime.

Why mess around with two languages like this?  I see it as a ‘right tools for the job’ approach.  There’s no real alternative to J2ME for what I want to achieve on the handset, but Python easily beats Java when it comes to complex data processing, especially string manipulation.  Thus I use Python for the CPU-heavy task of building the data.

Let’s take the Java side first and see how we can read stuff from a binary asset:

//Load up the metadata file and open a DataInputStream on it
DataInputStream dis = new DataInputStream(this.getClass().getResourceAsStream("data.bin"));
//Now read some basic types from it
Byte b = dis.readByte();
int i = dis.readUnsignedShort();
String s = dis.readUTF();

Obvious enough.  A DataInputStream expects the data in the file to conform to a more-or-less standard format.  Now let’s look at some Python code that can write the data in a compatible format so that Java can read it back.

import struct
#Note that we open in binary mode - forget this on Windows and you'll get unexpected results.
f = open('data.bin','wb')
#Write a byte, am unsigned short and an string in UTF format.
s = 'This is my string, print me yours'
#First write the string length (as a signed short)
#Now write the string itself as UTF8, modified to meet the Java UTF8 requirements.

You don’t have to do all the calls to struct.pack separately, they can be concatanated thus (see the struct module documentation for more details):

#This is equivalent to the above lines that write a byte and a ushort

The leading ! on the format string tells struct to pack in network byte order, which is what’s expected by a DataInputStream (and may be different to the byte order of your machine).
Writing the strings is a little more complex.  They must be preceded by a two-byte length and are supposed to be in a modified form of UTF8 that guarantees there will never be an embedded single NUL (zero) byte.  I’ve omitted any handling for this in the example above, since text string don’t tend to contain u’\x0000′.  Depending on your application, you may want to check for them and do an appropriate replace before converting the string to UTF8.

Storing strings like this is fine, but may take up more room than is needed if all you’re dealing with is 7-bit ASCII.  For one project, I borrowed an old WordStar trick.  If you ever dumped a WordStar file, you’d notice that there were no spaces in the text.  Instead, the top bit of every character that preceded a space was set[0].  Here’s Python code that writes out a series of ASCII strings, separating them by setting the top bit of the last character of each string.

#Assume we have an array of strings called 'stringsToWrite', and an open file 'f'
#Write the number of strings first as an unsigned short
for s in stringsToWrite:
    #Each string is stored as a sequence of ASCII bytes; the final one has
    #the highest bit set.
    #This code expanded for simplicity - there are tighter ways to write it.
    if not s:
        #Empty strings saved as a single NUL byte
        asc = s.encode('ascii')
        ln = len(asc)
        for i in range(ln):
            #Clear the top bit
            b = ord(asc[i]) & 0x7f
            if i == (ln-1):
                #last char, so force top bit to be set
                b = b | 0x80

And here’s the Java that reads them back in.

//Read the number of strings and dimension the array.
int count = dis.readUnsignedShort();
String strings[] = new String[count];

//Choose an appropriate length for the StringBuffer
StringBuffer sb = new StringBuffer(8);
Byte b;
//This code is explicit for simplicity - there are tighter ways to write this.
for(int i=0; i<count; i++) {
    do {
        b = dis.readByte();
        //Don't append NUL bytes - these mean empty strings.
        if (b!=0) {
            sb.append((char)(b & 0x7f));
    } while ((b & 0x80) == 0);
    strings[i] = sb.toString();
    sb.setLength(0); //clear buffer for next string

[0] You can find the true programmers in a crowd by asking them all to say what the most common character is in English text.  Non programmers will say “e”.  Programmers will say “space”. 🙂

Quoting Out Of Context

Transclusion For Fun, Profit & Failure

Via the always-entertaining Yoz Grahame‘s feed comes news of the latest issuance from Ted Nelson‘s Xanadu “project”; finally the wonders of transclusion are available to mere mortals.

Transclusion, in case you didn’t know, is:

Or any variation on any of those themes.  But, naturalmente, the word itself means more than the actual definition(s).  If you haven’t yet done so, go read the… educational[0] Wired article on Xanadu (and even my blog entry about meeting Ted, linked above from his name).

So in this context, transclusion means an essential part of transcopyright: “a copyright scheme where the copyright holder grants the public a permission to refer to any portion of a given document and publishes the document in a permanent location. This way, anyone can quote the document by referring to it and the reader’s browser will then go to the originator’s server for the original material so that a micropayment can be made to the original author of the material.”  And as such, I believe it suffers from two rather deep and fatal flaws.

Firstly, it’s utopian.  And just like Ted’s words mean more than the definitions, so does that one (hey, anyone can join in this semantic game).  By utopian, I mean that it’s part of a Huge And Grand Project that will fix Something That Is Deeply Wrong.  It’s a boil-the-ocean project; as long as the whole world changes to accommodate it, it will deliver.  But that’s the minor failing.  The worst part is that it’s based on an old-fashioned and outdated notion of what the Net and/or Web are.

In the beginning[1], there was this idea about interconnected content.  All sorts of ideas grew up surrounding it, such as “information wants to be free“, or “the Net interprets censorship as damage, and routes around it“.  One could publish anything, and link to anyone; all was simplicity and all was clear.

But it ain’t like that anymore, even if it once was.  Consider this; one of the recent updates to the Floofs website is to provide links to products on Handango for people who aren’t in the UK and who can’t buy via text message.  To do this, I’ve had to carefully pick apart the long, complex Handango URLs to find out what’s vital to pass and what isn’t[2].  And I know, for absolute sure, that those links are going to break when Handango roll out their shiny new look and feel to Handango Europe.  The key point here is that the source of my quotation/destination of my link is not under my control.

Let’s make an artificial example, because we can and it’s fun.  Suppose I were to put a quotable sentence here.  Let’s choose one of my favourites:

Against stupidity, the Gods themselves contend in vain

Now, I leave it a while, and it gets transcluded all over the shop.  And now I go and change the source, so that it says something else, such as “people who transclude other people’s comments are well known for having dirty underwear”.  All the documents that reference my quote change, magically, all over the world.  This goes on even now, with images.  Webmasters who find their images being relinked without their permission (and hence stealing their bandwidth) often modify the source image to show something either obscene or comical (occasionally both) so that the “thief’s” page is thereby defaced[5].

Of course, the above is a daft example; people aren’t inclined to edit their own quotes[3].  But you and I, as software techie people, know how precarious it is to build any sort of construct on assets controlled by other people.  I don’t want to have to maintain the URLs on my site indefinitely just for the convenience of anyone who’s transcluded my text[4].  What if I die?  Or lose my income and decide that webspace is a costly indulgence?  The web is not a single entity and does not maintain a stable state.  Bits of it are always vanishing.  The last link above (about image stealing) was to an eBay forum page that has now vanished, so I linked to the Google cache instead.  Content on the web is transient.

So, the Web has moved on from its beginnings, as have we all.  Sic transit gloria mundi and all that.  Nowadays it has been overrun and changed deeply by commerce; by content intended to sell you something right now, or give you the news of the moment, as it happens.  None of that will be around in six months; some of it will be gone by tomorrow.  And Xanadu is still based on a vision from the 1960s or 1970s, a vision that’s old and faded now, dimmed by the bright light of the flashing banner adverts that change every page view.  And no amount of proof-of-concept code can bring it back.  And micropayments? I didn’t even go there…

[0] I pause before choosing the adjective “educational” since feelings about Ted N run deep and strong and I wish to attempt to retain a precarious ambivalence on the subject.  Or rather, I’m scared of flames.
[1] Please feel free to choose your own definition of “beginning” when we’re talking about the Internet and/or World Wide Web.
[2] And Handango’s “Partner Support”… don’t get me started.
[3] The reader is invited to pause at this point and consider the nature of the politician and the spin doctor.
[4] It’s enough hassle making sure that old Google links to parts of don’t break when the site structure is changed.  At least I try…
[5] It’s worth reading 🙂

There Can Be Only One

Cheap Linux/Python Trick #101

Those who run lots of little Python scripts on their various servers know how it is; inevitably, your script will die at some obscure time (only whilst in development, naturally) due to some unforeseen exception.  You want to be able to restart it from something like cron, but you also want to be sure only one version is running at any one time.  In other, simpler terms; you want your script to check whether another instance is running and to shut down if so.

There are, of course, many ways to do this.  Here’s a Linux way I do it, which has proven to be more reliable in my environment than others.  It’s also very simple.  Your mileage may vary.  Void where prohibited by law.  No deposit or return.  All goods sold as seen.  Etc.

import sys
#Get the process id of this process
pid = os.getpid()
#Get the name of this script, as passed to the Python interpreter
myname = sys.argv[0]
#Run a ps command that includes the command lines of processes
ps = os.popen('ps --no-headers -elf|fgrep %s' % myname)
#Search for any processes that are running python and this script.  We check for
#python so that we don't bomb when, for example, someone's editing this script
#in vi.
for p in ps:
        if p.find('python')>-1 and p.find(myname)>-1:
                #If the pid of the process we've found is not the same as our pid,
                #then another instance is running.
                otherpid = int(p.split()[3])
                if otherpid  pid: sys.exit()

Feel free to contribute your own way to achieve the same thing; I’m sure there are a hundred different better ways to do this.