Packing It All In

Packing with Python, unpacking with Java

Two of the projects I’m working on right now are heavily J2ME-based.  The memory constraints in this environment are downright scary at times; takes me back to the days of the Z80, when men were real men and a sixteen-bit register counted as wide-open spaces.  But enough of the nostalgia.  The point of this entry is to make Useful Notes on using the power and simplicity of Python to prepare data for unpacking in the restricted environment of a J2ME MIDlet.  In this context, I’m talking about packing data into some binary format that takes up minimal space in the MIDlet (to save download time and installation space) and can be unpacked on demand, requiring minimal memory at runtime.

Why mess around with two languages like this?  I see it as a ‘right tools for the job’ approach.  There’s no real alternative to J2ME for what I want to achieve on the handset, but Python easily beats Java when it comes to complex data processing, especially string manipulation.  Thus I use Python for the CPU-heavy task of building the data.

Let’s take the Java side first and see how we can read stuff from a binary asset:

//Load up the metadata file and open a DataInputStream on it
DataInputStream dis = new DataInputStream(this.getClass().getResourceAsStream("data.bin"));
//Now read some basic types from it
Byte b = dis.readByte();
int i = dis.readUnsignedShort();
String s = dis.readUTF();

Obvious enough.  A DataInputStream expects the data in the file to conform to a more-or-less standard format.  Now let’s look at some Python code that can write the data in a compatible format so that Java can read it back.

import struct
#Note that we open in binary mode - forget this on Windows and you'll get unexpected results.
f = open('data.bin','wb')
#Write a byte, am unsigned short and an string in UTF format.
s = 'This is my string, print me yours'
#First write the string length (as a signed short)
#Now write the string itself as UTF8, modified to meet the Java UTF8 requirements.

You don’t have to do all the calls to struct.pack separately, they can be concatanated thus (see the struct module documentation for more details):

#This is equivalent to the above lines that write a byte and a ushort

The leading ! on the format string tells struct to pack in network byte order, which is what’s expected by a DataInputStream (and may be different to the byte order of your machine).
Writing the strings is a little more complex.  They must be preceded by a two-byte length and are supposed to be in a modified form of UTF8 that guarantees there will never be an embedded single NUL (zero) byte.  I’ve omitted any handling for this in the example above, since text string don’t tend to contain u’\x0000′.  Depending on your application, you may want to check for them and do an appropriate replace before converting the string to UTF8.

Storing strings like this is fine, but may take up more room than is needed if all you’re dealing with is 7-bit ASCII.  For one project, I borrowed an old WordStar trick.  If you ever dumped a WordStar file, you’d notice that there were no spaces in the text.  Instead, the top bit of every character that preceded a space was set[0].  Here’s Python code that writes out a series of ASCII strings, separating them by setting the top bit of the last character of each string.

#Assume we have an array of strings called 'stringsToWrite', and an open file 'f'
#Write the number of strings first as an unsigned short
for s in stringsToWrite:
    #Each string is stored as a sequence of ASCII bytes; the final one has
    #the highest bit set.
    #This code expanded for simplicity - there are tighter ways to write it.
    if not s:
        #Empty strings saved as a single NUL byte
        asc = s.encode('ascii')
        ln = len(asc)
        for i in range(ln):
            #Clear the top bit
            b = ord(asc[i]) & 0x7f
            if i == (ln-1):
                #last char, so force top bit to be set
                b = b | 0x80

And here’s the Java that reads them back in.

//Read the number of strings and dimension the array.
int count = dis.readUnsignedShort();
String strings[] = new String[count];

//Choose an appropriate length for the StringBuffer
StringBuffer sb = new StringBuffer(8);
Byte b;
//This code is explicit for simplicity - there are tighter ways to write this.
for(int i=0; i<count; i++) {
    do {
        b = dis.readByte();
        //Don't append NUL bytes - these mean empty strings.
        if (b!=0) {
            sb.append((char)(b & 0x7f));
    } while ((b & 0x80) == 0);
    strings[i] = sb.toString();
    sb.setLength(0); //clear buffer for next string

[0] You can find the true programmers in a crowd by asking them all to say what the most common character is in English text.  Non programmers will say “e”.  Programmers will say “space”. 🙂


One thought on “Packing It All In

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s