iguana | Prototyping a Flex-based tokeniser in Python using ctypes (Reply)

At work at the moment I'm rewriting a parsing routine for a C codebase, which currently uses Lemon, a LALR(1) parser generator. Lemon's great for converting a grammar into a nice routine for your code to call, but I've found the grammar has quickly expanded to a near-unmaintainable state to cover all our use cases.

As part of an abandoned rewrite of the parser, I replaced the hand-coded tokenisation with the Flex lexical analyser, which I found I rather liked, even if its lack of Unicode support might leave me having to rewrite that part by hand at some point.

With Lemon out of the way, I realised that it might not be a bad idea to move prototyping my self-devised algorithms (later superceded by the rather nifty Shunting Yard algorithm) and processing code into Python rather than C to speed up development. However, Flex doesn't provide any Python bindings.

Enter ctypes, a built-in Python module which allows direct access to DLLs and shared libraries. While I've used Cython in the past to allow Python to interact with C code, today I felt I could get going more quickly with ctypes' more direct interface.

First, let's write the Flex file:

Caution: I'm using this code for prototyping only; I make no guarentees about thread safety, re-entrancy, memory management, or anything else.

tokeniser.l:

%{
typedef int (*tokfunc)(char * data);

extern void * tokinit( tokfunc f, char * text ) {
    yyscan_t scanner;
    yylex_init_extra(f, &scanner);
    yy_scan_string(text, scanner);
    return scanner;
}
%}

%%

[0-9]+      { ((tokfunc)yyextra)(yytext); }
\*          { ((tokfunc)yyextra)("MULT"); }
\+          { ((tokfunc)yyextra)("ADD"); }

.           { }

%%

I'm using the standard calculator example here, not least because I intend to pass the output tokens to an implementation of the Shunting Yard algorithm for further processing.

Let's walk through what this file is up to:

%{
typedef int (*tokfunc)(char * data);

extern void * tokinit( tokfunc f, char * text ) {
    yyscan_t scanner;
    yylex_init_extra(f, &scanner);
    yy_scan_string(text, scanner);
    return scanner;
}
%}

This section is automatically included in the generated C file, and it includes a helper function for Python to use. As far as I can tell, ctypes isn't able to perform yylex_init_extra(f, &scanner); since yyscan_t is an opaque type, so this function will do it for us.

It initialises the scanner for the input text and gets it ready to go. It also defines a tokfunc function type and tells Flex that it should make this function available (as yyextra) in the tokenising rules' code blocks. This function will be our entry point back into Python.

%%

[0-9]+      { ((tokfunc)yyextra)(yytext); }
\*          { ((tokfunc)yyextra)("MULT"); }
\+          { ((tokfunc)yyextra)("ADD"); }

.           { }

%%

This section is the main rules section of the tokeniser. The %% are the start and end markers, the items on the left are regular expressions (usually something I'm a bit icky about, but it makes sense in this situation) while the code blocks on the left are what the parser does when it matches a regular expression.

You can see that for each token, we're making a function call to yyextra, which is where Flex hides the function that we handed it in yylex_init_extra. This is our entry point back into the python code, as we'll see soon. This could be extended to any number of arguments depending on exactly what information you want to communicate back to the python code. For our purposes, one string argument is sufficient.

The empty catch-all expression at the end prevents unmatched bits of the input from being printed to stdout, which is otherwise the default behaviour.

Okay, so far so good. Let's compile this beastie.

Makefile:

all: _tokeniser.so

_tokeniser.so: tokeniser.c
	gcc -fPIC -shared -o _tokeniser.so tokeniser.c

tokeniser.c:
	flex -R --outfile=tokeniser.c --header-file=tokeniser.h tokeniser.l

We use flex to convert tokeniser.l to C source code, and then compile with gcc using the -fPIC and -shared flags to generate a shared object that ctypes can work with. Since we already have a tokeniser.py, we can't name the shared file tokeniser.so because Python will try to import this rather than the python file, so we output to _tokeniser.so instead. Run make to do all the magic here.

Now we can write some Python!

tokeniser.py:

import ctypes

import tokens # some basic classes for onward parsing

def tokenise(s):
    output = []

    TOKFUNC = ctypes.CFUNCTYPE(ctypes.c_int, ctypes.c_char_p)
    def tokfunc(data):
        if data == 'MULT':
            output.append(tokens.Operator('*'))
        elif data == 'ADD':
            output.append(tokens.Operator('+'))
        else:
            output.append(tokens.Value(int(data)))
        return 0
    tokfunc_c = TOKFUNC(tokfunc)

    tokso = ctypes.CDLL('./_tokeniser.so')

    scanner = tokso.tokinit(tokfunc_c, s)
    tokso.yylex(scanner)

    return output

if __name__ == '__main__':
    print tokenise('1 * 2 + 3')

Now this one's a bit more complicated. First we import the ctypes module along with a separate module of mine which deals with the tokens the Shunting Yard algorithm will use.

At the bottom of the file is the code that will be invoked when we run python tokeniser.py -- in this case printing out the result of a simple test cast.

Leaving the best until last, in the middle is the main tokenise() function, which creates and invokes a Flex scanner, then returns the tokens it's seen. It also creates the function that we'll be telling Flex to call. ctypes' syntax here is a bit messy but essentially we define a function type whose return type (ctypes.c_int) and parameters (ctypes.c_char_p) match the tokfunc typedef in tokeniser.l, then wrap our own tokfunc with that so that ctypes knows how to convert the parameters and return values to and from C.

We import our compiled Flex parser using ctypes.CDLL(), giving it a relative path so it knows exactly where to find it, and then call tokinit and yylex, the first of which we defined ourselves in tokeniser.l, and the second of which Flex gives us for free.

When yylex is run, Flex wibbles through the input and runs the code block for any tokens it finds. In our case, this means calling tokfunc. tokfunc itself has a look at what data our Flex rules gave it, and acts accordingly; in this case it creates an Operator or Value instance and appends it to an output array.

When yylex is finished, our output array will be fully populated, and we can return it!

$ rm -f tokeniser.c tokeniser.h && make
flex -R --outfile=tokeniser.c --header-file=tokeniser.h tokeniser.l
gcc -fPIC -shared -o _tokeniser.so tokeniser.c

$ python tokeniser.py
[Value(1), Operator('*'), Value(2), Operator('+'), Value(3)]

References:

1. Python library documentation for ctypes
2. Cypes tutorial by "theller"
3. Flex user guide