More Grimacing

Following on from my previous post about grimace, a Python package that allows generation of regular expressions using fluent syntax, I’ve added some more tricks.

In the previous post, I showed how one can use grimace to write regular expressions in this form:


RE().any_number_of().digits().followed_by().dot().then().at_least_one().digit()

All good, but do we really need to invoke each method and have all those parentheses cluttering up the line? Because grimace is written using functional principles, each method returns a new RE object, based on the previous one but modified. Thus in theory we must call a method because we need to do some work, not just return an attribute.

Python descriptors will save us. A descriptor is a class that defines one or more of the __get__ , __set__ or __delete__ methods. Like properties in C#, they allow code to be executed when a descriptor is accessed as through it were an attribute.

Previously in grimace, we had methods like:

def start(self):
    """Return a new RE with the start anchor '^' appended to the element list"""
    return RE(self, '^')

Now we can define an Extender class and use it to rewrite start().

class Extender(object):
    """An Extender is a descriptor intended for use with RE objects, whose __get__ method returns a
    new RE based on the RE on which it was invoked, but with the elements extended with a given
    element, set when the Extender is initialized. It exists so that methods like start() and end()
    can be invoked as attributes or methods."""
    def __init__(self, element=None):
        self.element = element

    def __get__(self, instance, owner):
        if isinstance(instance, RE):
            if self.element is not None:
                return RE(instance, self.element)
            else:
                return RE(instance)
        return None

#... and here's how start() is now defined in the RE class...
class RE(object):
    start = Extender('^')

So far so good. We can now use grimace as in this example:

RE().any_number_of.digits.followed_by.dot.then.at_least_one.digit

But what happens if a poor programmer gets confused and invokes digits or dot as methods? We can fix this easily. The result of getting digits or dot or any other Extender is a new RE object, so executing digits() will get that new RE and then try to call it. All we need to do is make a RE instance callable, by adding this __call__ method:


def __call__(self): return self

That’s all we need. It means that the execution of digit() becomes get the digit attribute, which is a new RE instance, then call it, which returns itself.

There is one more nicely functional trick. We can use an Extender to append more than just strings; we can also use it to append special objects, like the grimace Repeater that defines how many repetitions of a character are required. Because all objects in grimace are functional, they’re immutable, and that means any instance may be shared between any number of threads or other bits of code. So I can write the not_a method (which adds a Not instance to the regular expression, to invert the meaning of the next match) as:


not_a = Extender(Not())

That instance of Not() is then shared between every RE() object that uses it. Functional programming is an excellent tool.

(The documentation for grimace is now in the github wiki page)

Expressing Oneself, Fluently

I use regular expressions[1] in Python often enough that I know many of the character classes and syntax tricks.

I use regular expressions in Python seldom enough that I forget many of the character classes and syntax tricks.

This is annoying, but I have lived with it. Then I came across a little article in Dr Dobb’s Journal (which used to be great, many years ago, but is now a mere ghost of its former glory), in which Al Williams writes about (ab)using the C/C++ compiler to allow regular expressions to be written as:

start + space + zero_or_more + any_of("ABC") + literal(":") + group(digit + one_or_more)

rather than

^\s*[ABC]:(\d+)

I quite like the idea of fluent syntax (though I appreciate that it’s not necessarily so appealing if your native language isn’t English), but I spend marginally more time writing Python than C++ these days. Also, I like the idea of trying to build a fluent interface in a functional way. So, I started out by writing some examples of what I would want to be able to do:

start().end() would give the minimal empty string regex "^$". Easy enough.

Or how about

any_number_of().digits().followed_by().dot().then().at_least_one().digit()

You get the general idea: the fluent syntax describes the expression and results in a string that matches what it describes. One really good thing is that it avoids the “backslash plague” that can confuse those new to writing regular expressions in Python.

This now exists, and is on github at https://github.com/benlast/grimace and it will do the above and more. Time for some more intense code examples:

The grimace.RE object is our starting point; any method we call on it returns a new RE object. Let’s get the regex to match an empty string.

>>> from grimace import RE
>>> print RE().start().end().as_string()
^$

The as_string() call turns the generated expression into a string that can then be used as the argument to the standard Python re module. There’s also as_re() which will compile the regular expression for you and return the resulting pattern-matching object.


>>> #Extract the extension of a short DOS/FAT32 filename, using a group to capture that part of the string
>>> regex = RE().start().up_to(8).alphanumerics().dot().group(name="ext").up_to(3).alphanumerics().end_group().end().as_string()
>>> print regex
^\w{0,8}\.(?P\w{0,3})$
>>> #Use the re module to compile the expression and try a match
>>> import re
>>> pattern = re.compile(regex)
>>> pattern.match("abcd.exe").group("ext")
'exe'
>>>
>>> #We can do that even more fluently...
>>> RE().start().group(name="filename").up_to(8).alphanumerics().end_group() \
... .dot().group(name="ext").up_to(3).alphanumerics().end_group() \
... .end().as_re().match("xyz123.doc").groups()
('xyz123', 'doc')('xyz123', 'doc')
>>>
>>> #The cool example that I wrote out as a use case
>>> print RE().any_number_of().digits().followed_by().dot().then().at_least_one().digit().as_string()
\d*\.\d+
>>>

You can also split a complex regex over several lines:

# My python module
from grimace import RE

def is_legal_number(number):
    #Match a US/Canadian phone number - we put the RE() stuff in parentheses so that we don't
    #have to escape the ends of lines
    north_american_number_re = (RE().start()
      .literal('(').followed_by().exactly(3).digits().then().literal(')')
      .then().one().literal("-").then().exactly(3).digits()
      .then().one().dash().followed_by().exactly(4).digits().then().end()
      .as_string())
    number_re = re.compile(north_american_number_re)
    return number_re.match("(123)-456-7890") is not None

There is more to do: control over greedy matching, and ways to express some of the more complex tricks like backreferences and multiple matching subexpressions. And I’ll also package it properly for installation via pip. But for now, it’s available and it works.

(The documentation for grimace is now in the github wiki page)

[1] I can’t write any article on regular expressions without quoting Jamie Zawinski: Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

Memento Mori

A discussion at work took place about design patterns, and it prompted me to go back and look at something. One of the behavioural patterns is Memento, which (from Wikipedia) “provides the ability to restore an object to its previous state (undo)”. The implementation described uses three objects: a caretaker, an originator and a memento object that the caretaker can use to restore the originator to a previous state.

From a functional programming viewpoint, I find this overly complex. Consider a system in which an object follows the functional principle of immutability. To “change state”, a caller calls a method on that object, but the original object is unchanged: the method returns a new object with a different state. There’s then no need to have the complexity of undo – the previous state is available via the original instance.

Of course, this is already the case with immutable strings in many languages. Consider this C#:

    string s1="This String is Mixed Case";
    string s2=s1.ToLower();
    //Oops - we need to undo the ToLower() operation.
    s2=s1;  //This is effectively an "undo"

The only principle to keep in mind is to save a reference to a previous version of the object if there is a potential need to undo it.

Benefits: huge saving of complexity, and therefore of testing. I find that the testing load is often forgotten when developers assess how much work is involved in an implementation, unless TDD is in the blood of the team.
Downside: potential extra use of memory to keep hold of previous objects. With intelligent sharing of expensive resources between objects, this can be minimal.