DSL in Python
This post was in my drafts folder for quite a while and I suddenly felt the urge to publish it. A while back wrote a DSL built on Python for some college-related work. The goals for this exercise included syntax similarity and a runtime that is based in Python. So, here are a few notes on the experience.
Tokenizer
Arguably, one of the most prominent yet contentious points about the Python syntax is it’s whitespace sensitivity. What do you use? 1 tab, 2 spaces, 4 spaces, 2^16 spaces!! A sensible editor configuration usually shields you from the quirks of this, but the point is that, indentation styles are not standard, some prefer tabs others spaces.
Python’s tokenizer
module is pretty handy to help handle the indentation problem. It does this by converting your source file to stream of tokens with two special tokens – namely INDENT
and DEDENT
marking the starts and ends of indented blocks respectively. In my experience, these markers are equivalent to the block delimiters in other languages – usually braces ({
, }
).
Example
Let’s say we want to introduce a special language construct to swap the values held in two variables with a x <=> y
infix operator.
Here is an example. Consider the following segment of code,
def swap(a, b):
a <=> b
On tokenizing this piece of code – as a string stored in s
,
import tokenize
import io
b = io.BytesIO(s)
tokenize.tokenize(b.readline)
We get,
2,0-2,2: NAME 'def'
2,3-2,7: NAME 'swap'
2,7-2,8: OP '('
2,8-2,9: NAME 'a'
2,9-2,10: OP ','
2,11-2,12: NAME 'b'
2,12-2,13: OP ')'
2,13-2,14: OP ':'
2,14-2,15: NEWLINE '\n'
3,0-3,1: INDENT '\t'
3,1-3,2: NAME 'a'
3,3-3,4: OP '<'
3,4-3,5: OP '='
3,5-3,6: OP '>'
3,7-3,8: NAME 'b'
3,8-3,9: NEWLINE '\n'
4,0-4,0: DEDENT ''
4,0-4,0: ENDMARKER ''
As you can see, the module takes care of the whitespace problem, delimiting blocks clearly. An additional benefit of using this module is that it,
- Extracts string literals
- Handles code comments
Parsing
The parser would simply generate equivalent Python Abstract Syntax Tree for constructs in my language. Though, PEGs are powerful, I found writing a custom recursive descent parser was way more convenient, understandable and helped in printing sensible error messages.
The ast
module is a handy tool for handling ASTs.
While it is ideal to build an AST from ground up, my strategy for generating was a bit hack-ish. I used string interpolation to generate Python source code that was then ast.parse
d.
For example, to generate code that swaps the values in two variables a
and b
.
src='''
{0}, {1} = {1}, {0}
'''
import ast
module = ast.parse(src.format('a', 'b'))
eval is your friend
Now that you have your Python AST what do you do with it? Well, you compile
it to generate a code
object that can be evaluated.
code = compile(module, '<generated-src>', 'exec')
Namespaces
By manipulating the appropriate namespace dictionaries, you can change the globals
and the locals
of the executed code. This is useful for injection of required values and extraction of execution results.
Example
Setup the dictionaries,
env_globals = {}
env_locals = {}
Inject parameters,
env_locals['a'] = 10
env_locals['b'] = 1
And execute.
eval(code, env_globals, env_locals)
print env_locals