1 The Subset of Python
2 Syntactic Analysis – Parsing
3 Python Code Trees
4 Scheme Code to Produce
5 The Compiler Rules
6 The Compiler (and a Compiler Framework)
7 Do It!
Version: 1.0

Code is Data – A Python to Scheme Compiler for a Subset of Python

The night before a 5 minute (!) talk on easily capturing the modern approach to program transformation in Scheme, I spent a half hour or so writing a couple dozen lines of Scheme to compile the subset of Python used in the first few Python programs of the introductory programming course CSC 108 to Scheme.

All the code for the compiler is here, in fact the entire semantic analysis and code generation program (i.e. everything after the syntactic parsing phase) is displayed in The Compiler (and a Compiler Framework) – take a quick look now! The reaction from one student taking the traditional Compilers and Interpreters course was:

“I loved how you took a Python program that converted Python into S-Expressions, then were able to run that same program in Scheme. That kind of stuff is never even considered in [any] course in this school.”

Now for the details read on.

1 The Subset of Python

The first CSC 108 example Python program to compile contained number literals, arithmetic, and variable initialization and access:
  students = 132
  seats = 196
  empty_seats = seats - students
  lab_capacity = 32
  labs_needed = students / lab_capacity

The second introduced function definition with a parameter, return, function call, and the print statement:
  def f(x):
  return x ** 2
  
  def g(x):
  return x + 5
  
  number = f(3)
  print number

The third introduced multi-branch conditional:
  my_average = 87.5 / 112 * 100
  
  print "My average on the BIO test was", my_average
  
  if my_average >= 90:
      print "Awesome!"
  elif my_average >= 70:
      print "Doing fine"
  elif my_average >= 50:
      print "So so"
  elif my_average < 50:
      print "Uh oh!!"

This was enough for a serious (but still 5 minute) demonstration.

2 Syntactic Analysis – Parsing

Parsing is a necessary but relatively uninteresting step in modern code analysis and transformation, because any reasonable parser produces a good enough representative syntax tree to work with. Modern computers are fast enough, with orders of magnitude more memory than source code size, that if different forms of the tree are useful one simply post-processes the initial tree.

So a simple Python program (py2sexp.py – about a page long) suffices, calling Python’s parser to read python source code text and produce a tree object, then traversing the tree writing it out in Scheme’s literal notation for tree-structured data (S-Expressions). The details of this are not important for the rest of the discussion, nor is diving into Python’s parser to understand the form of the trees it produces – instead it’s simpler to examine the output when run on the sample programs.

3 Python Code Trees

The parsing program turns the third sample program into the following data (in S-Expression notation, copied and pasted here and named p3 here for use in Do It!).
  (define p3
    '(Module
      (Stmt
       (Assign
        (line 1)
        (AssName (line 1) "my_average" "OP_ASSIGN")
        (Mul
         (line 1)
         (Div (line 1) (Const (line 1) 87.5) (Const (line 1) 112))
         (Const (line 1) 100)))
       (Printnl
        (line 2)
        (Const (line 2) "My average on the BIO test was")
        (Name (line 2) "my_average"))
       (If
        (line 3)
        (Compare (line 3) (Name (line 3) "my_average") ">=" (Const (line 3) 90))
        (Stmt (Printnl (line 4) (Const (line 4) "Awesome!")))
        (Compare (line 5) (Name (line 5) "my_average") ">=" (Const (line 5) 70))
        (Stmt (Printnl (line 6) (Const (line 6) "Doing fine")))
        (Compare (line 7) (Name (line 7) "my_average") ">=" (Const (line 7) 50))
        (Stmt (Printnl (line 8) (Const (line 8) "So so")))
        (Compare (line 9) (Name (line 9) "my_average") "<" (Const (line 9) 50))
        (Stmt (Printnl (line 10) (Const (line 10) "Uh oh!!")))))))

PLT Scheme forms and functions, e.g. define, are linked to the online PLT Scheme Reference. In many cases, there will be a link in the margin there referring to a discussion in the PLT Scheme Guide, which newcomers to Scheme should look at instead.

I studied this (and the results for the first and second program) to understand how Python’s native parser represents Python code. For example,

  (Assign
   (line 1)
   (AssName (line 1) "my_average" "OP_ASSIGN")
   (Mul (line 1)
        (Div (line 1) (Const (line 1) 87.5) (Const (line 1) 112))
        (Const (line 1) 100)))
shows that the initialization of a variable is represented as an Assign tree with three children:
  • a line number child

  • an AssName (their name for it – not mine) subtree

  • an expression subtree for the value to assign

Similarly, the Assign subtree has a structure, but rather than writing it out in words it can be summarized as
  (AssName (line <number>) <variable-name-string> "OP_ASSIGN")
where the “<...>” naming is a common convention from the Syntactic Theory of Languages (e.g. seen in the parsing component of CSC 324) indicating that “...” does not appear literally, i.e. is not a fixed name in the pattern – it varies.

In this notation, a Python initialization statement is
  (Assign
   (line <n1>)
   (AssName (line <n2>) <variable-name-string> "OP_ASSIGN")
   <value-expression>)

4 Scheme Code to Produce

The Scheme statement to produce for a Python initialization statement is
  (define <variable-name-symbol> <compiled-value-expression>)
where the <variable-name-symbol> symbol is the result of turning the string <variable-name-string> into a symbol by calling Scheme’s string->symbol:
  (string->symbol <variable-name-string>)
The <compiled-value-expression> is the result of compiling <value-expression>.

The compiler contains this translation rule as
  (`(Assign ,_
            (AssName ,_ ,<variable-name-string> "OP_ASSIGN")
            ,<expression-value>)
   `(define
      ,(string->symbol <variable-name-string>)
      ,<expression-value>))

The “<...>” convention could be made understandable to Scheme with a bit more Scheme code outside the scope of this discussion

where the back-tick (“`”) will mean quasiquote, indicating to Scheme a partially literal expression but where comma (upside-down back-tick) means unquote to return to a computed value. The underscore (“_”) indicates an unused value (that can be different for each occurrence of _).

5 The Compiler Rules

Most of the rules for the other types of Python statements used in the three sample programs are as easy or easier. The two more elaborate ones are not much harder, and the difficulty is in understanding the meaning of compiling the statement from Python to Scheme, not in writing it down. Scheme makes the coding part easy if you know exactly what the application is supposed to do (and if you don’t, then it is hard to code in any language).

Unfortunately, many programmers don’t have a proper specification (were not given one by the application designers, or did not make one if also in charge of design), and instead work backwards writing a program that ends up implicitly (but quite obscurely when read as such) being the only precise specification. And that “specification” contains any unintended behaviour of the code.

Here is the compiler specification (assuming the explanation of the notation, e.g. “,”, what to do with it – see the discussion of rewrite – but not why or that it compiles Python to Scheme):
  (`(Module ,<s>) <s>)
  (`(Const ,_ ,<c>)
   (if (or (number? <c>) (string? <c>))
       <c>
       (error <c>)))
  (`(Compare ,_ ,<ls> ,<op> ,<rs>)
   `(,(string->symbol <op>) ,<ls> ,<rs>))
  (`(Assign ,_ (AssName ,_ ,<var> "OP_ASSIGN") ,<val>)
   `(define ,(string->symbol <var>) ,<val>))
  (`(Name ,_ ,<n>) (string->symbol <n>))
  (`(Stmt . ,<s>) `(begin . ,<s>))
  (`(Sub ,_ ,<e1> ,<e2>) `(- ,<e1> ,<e2>))
  (`(Div ,_ ,<e1> ,<e2>) `(/ ,<e1> ,<e2>))
  (`(Mul ,_ ,<e1> ,<e2>) `(* ,<e1> ,<e2>))
  (`(Add ,_ ,<e1> ,<e2>) `(+ ,<e1> ,<e2>))
  (`(Power ,_ ,<e1> ,<e2>) `(expt ,<e1> ,<e2>))
  (`(Return ,_ ,<e>) <e>)
  (`(CallFunc ,_ ,<f> ,<a>) `(,<f> ,<a>))
  (`(Function ,_ ,<f> (,<v>) ,<s>)
   `(define (,(string->symbol <f>) ,(string->symbol <v>)) ,<s>))
  (`(Printnl ,_ . ,<es>)
   `(begin (for-each (λ (e) (display e) (display " "))
                     (list . ,<es>))
           (newline)))
  (`(If ,_ ,<c> ,<then> . ,<else>)
   `(if ,<c> ,<then> ,(if (not (empty? <else>))
                          `(If _ . ,<else>)
                          `(void))))

Can you write a better specification?

It has to be one that someone who doesn’t know what compiling Python to Scheme means (the computer doesn’t) can be shown to follow blindly (the computer does) and get the correct result.

6 The Compiler (and a Compiler Framework)

Making the rules run is straightforward in Scheme, building on a common pattern matching library and making a function rewrite that captures the idea of “do something everywhere in some structured data until nothing changes”.

Wrap the rules in match-lambda which turns them into a function: trying each of them in turn:
  (define python->scheme-rules
    (match-lambda
      (`(Module ,<s>) <s>)
      (`(Const ,_ ,<c>)
       (if (or (number? <c>) (string? <c>))
           <c>
           (error <c>)))
      (`(Compare ,_ ,<ls> ,<op> ,<rs>)
       `(,(string->symbol <op>) ,<ls> ,<rs>))
      (`(Assign ,_ (AssName ,_ ,<var> "OP_ASSIGN") ,<val>)
       `(define ,(string->symbol <var>) ,<val>))
      (`(Name ,_ ,<n>) (string->symbol <n>))
      (`(Stmt . ,<s>) `(begin . ,<s>))
      (`(Sub ,_ ,<e1> ,<e2>) `(- ,<e1> ,<e2>))
      (`(Div ,_ ,<e1> ,<e2>) `(/ ,<e1> ,<e2>))
      (`(Mul ,_ ,<e1> ,<e2>) `(* ,<e1> ,<e2>))
      (`(Add ,_ ,<e1> ,<e2>) `(+ ,<e1> ,<e2>))
      (`(Power ,_ ,<e1> ,<e2>) `(expt ,<e1> ,<e2>))
      (`(Return ,_ ,<e>) <e>)
      (`(CallFunc ,_ ,<f> ,<a>) `(,<f> ,<a>))
      (`(Function ,_ ,<f> (,<v>) ,<s>)
       `(define (,(string->symbol <f>) ,(string->symbol <v>)) ,<s>))
      (`(Printnl ,_ . ,<es>)
       `(begin (for-each (λ (e) (display e) (display " "))
                         (list . ,<es>))
               (newline)))
      (`(If ,_ ,<c> ,<then> . ,<else>)
       `(if ,<c> ,<then> ,(if (not (empty? <else>))
                              `(If _ . ,<else>)
                              `(void))))
      (<e> <e>)))
A rule to return the value if none of the rules match was also added, for rewrite to use the rules properly.

The rewrite function takes a "rule" function and some data and recurses through the data calling the rule to replace parts of the data, again and again until nothing more can change. It’s a standard bit of functional programming (made a bit simpler by my own library’s define-recursive and if-differ capturing two concepts common in my code) Just note how concise it is for such a useful concept.

Can you add define-recursive and if-differ to your library in your favourite language? In most languages, which only allow defining call-by-value operations, the answer is no. For example, the obvious implementation of if-differ as a call-by-value function usually behaves incorrectly when the argument expressions have side-effects.

  (define-recursive (rewrite rule s) (on s)
    (define with-subparts-rewritten
      (if (list? s) (map rewrite s) s))
    (define with-also-rule-self
      (rule with-subparts-rewritten))
    (if-differ with-subparts-rewritten with-also-rule-self
               (rewrite with-also-rule-self)))

The Stratego language was built around variations of this idea (called "Term Rewriting"). Not surprisingly, Stratego is not used much because it has few libraries for the breadth of tasks one wants to accomplish in most modern applications, especially when best programmed in other styles.

One should not make new languages: instead make a language expressive enough that more concepts can be written in the language and put into a library to be used exactly where desired.

7 Do It!

To compile the third program:
  > (define p3-as-scheme (rewrite python->scheme-rules p3))

To see it (remember it was written by a compiler, to be fed to a Scheme compiler or interpreter, not read by humans):
  > (pretty-print p3-as-scheme)
Rather than show you the code displayed as output, here it is copied and pasted back in as input and run:
  > (begin
      (define my_average (* (/ 87.5 112) 100))
      (begin
        (for-each
         (λ (e) (display e) (display " "))
         (list
          "My average on the BIO test was"
          my_average))
        (newline))
      (if (>= my_average 90)
        (begin
          (begin
            (for-each
             (λ (e) (display e) (display " "))
             (list "Awesome!"))
            (newline)))
        (if (>= my_average 70)
          (begin
            (begin
              (for-each
               (λ (e) (display e) (display " "))
               (list "Doing fine"))
              (newline)))
          (if (>= my_average 50)
            (begin
              (begin
                (for-each
                 (λ (e) (display e) (display " "))
                 (list "So so"))
                (newline)))
            (if (< my_average 50)
              (begin
                (begin
                  (for-each
                   (λ (e) (display e) (display " "))
                   (list "Uh oh!!"))
                  (newline)))
              (void))))))
The output is as expected:
  My average on the BIO test was 78.125
  Doing fine

The compiled-to-Scheme code could of course be saved to a file to run later, or run in-place without copy-paste by reflection with eval:
  > (eval p3-as-scheme
          ; run as if run on its own
          (make-base-namespace))
  My average on the BIO test was 78.125
  Doing fine