This post was in my drafts folder for quite a while and I suddenly felt the urge to publish it. A while back wrote a DSL built on Python for some college-related work. The goals for this exercise included syntax similarity and a runtime that is based in Python. So, here are a few notes on the experience.

Tokenizer

Arguably, one of the most prominent yet contentious points about the Python syntax is it’s whitespace sensitivity. What do you use? 1 tab, 2 spaces, 4 spaces, 2^16 spaces!! A sensible editor configuration usually shields you from the quirks of this, but the point is that, indentation styles are not standard, some prefer tabs others spaces.

Python’s tokenizer module is pretty handy to help handle the indentation problem. It does this by converting your source file to stream of tokens with two special tokens – namely INDENT and DEDENT marking the starts and ends of indented blocks respectively. In my experience, these markers are equivalent to the block delimiters in other languages – usually braces ({, }).

Example

Let’s say we want to introduce a special language construct to swap the values held in two variables with a x <=> y infix operator.

Here is an example. Consider the following segment of code,

def swap(a, b):
    a <=> b

On tokenizing this piece of code – as a string stored in s,

import tokenize
import io
b = io.BytesIO(s)
tokenize.tokenize(b.readline)

We get,

2,0-2,2:    NAME    'def'
2,3-2,7:    NAME    'swap'
2,7-2,8:    OP  '('
2,8-2,9:    NAME    'a'
2,9-2,10:   OP  ','
2,11-2,12:  NAME    'b'
2,12-2,13:  OP  ')'
2,13-2,14:  OP  ':'
2,14-2,15:  NEWLINE '\n'
3,0-3,1:    INDENT  '\t'
3,1-3,2:    NAME    'a'
3,3-3,4:    OP  '<'
3,4-3,5:    OP  '='
3,5-3,6:    OP  '>'
3,7-3,8:    NAME    'b'
3,8-3,9:    NEWLINE '\n'
4,0-4,0:    DEDENT  ''
4,0-4,0:    ENDMARKER   ''

As you can see, the module takes care of the whitespace problem, delimiting blocks clearly. An additional benefit of using this module is that it,

  • Extracts string literals
  • Handles code comments

Parsing

The parser would simply generate equivalent Python Abstract Syntax Tree for constructs in my language. Though, PEGs are powerful, I found writing a custom recursive descent parser was way more convenient, understandable and helped in printing sensible error messages.

The ast module is a handy tool for handling ASTs.

While it is ideal to build an AST from ground up, my strategy for generating was a bit hack-ish. I used string interpolation to generate Python source code that was then ast.parsed.

For example, to generate code that swaps the values in two variables a and b.

src='''
{0}, {1} = {1}, {0}
'''

import ast
module = ast.parse(src.format('a', 'b'))

eval is your friend

Now that you have your Python AST what do you do with it? Well, you compile it to generate a code object that can be evaluated.

code = compile(module, '<generated-src>', 'exec')

Namespaces

By manipulating the appropriate namespace dictionaries, you can change the globals and the locals of the executed code. This is useful for injection of required values and extraction of execution results.

Example

Setup the dictionaries,

env_globals = {}
env_locals = {}

Inject parameters,

env_locals['a'] = 10
env_locals['b'] = 1

And execute.

eval(code, env_globals, env_locals)
print env_locals