Some Thoughts On Machine Code Organization

Charles Eric LaForest, July 2010

A post on ProgrammingArt got me thinking about the fundamental organization of code. I don't expect to come up with anything new in here, but I wanted to work it out. In the end, I clarified a little my understanding of the relationship between recursive vs. iterative vs. threaded machine code.

For convenience, I'll just lift the problem statement (Ex. 1.3 from SICP) from the original post:

Define a procedure that takes three numbers as arguments and returns the sum of the squares of the two larger numbers.

Here is the first version of the code given, the iterative one:

(define (sum-square-largest x y z)
  (cond ((and (> y x) (> z x)) ;; y and z are largest
         (+ (* y y) (* z z)))
        ((and (> x y) (> z y)) ;; x and z are largest
         (+ (* x x) (* z z)))
        ((and (> x z) (> y z)) ;; x and y are largest
         (+ (* x x) (* y y)))))

And here is the recursive version:

(define (sum-square-largest x y z)
  (cond ((and (> y x) (> z x)) ;; y and z are largest
         (+ (* y y) (* z z)))
        (else (sum-square-largest y z x))))

Both versions are not necessarily good solutions (they will fail if all parameters are equal, for example) but they perform the same computation, which is what matters.

Let's first sketch what these look like when plainly translated to machine code. Let's assume a generic MIPS-like machine:

Iterative:

       sub $1,$2,$4  # x - y
       sub $1,$3,$5  # x - z
       and $4,$5,$5  # if x is larger than either, yields a positive number
       bgt $5,next1  # and thus y and z are not the largest
       mul $2,$2,$4  # y * y
       mul $3,$3,$5  # z * z
       add $4,$5,$5  # y^2 + z^2
       jr  $31       # Return to caller

next1: sub $2,$1,$4  # y - x
       sub $2,$3,$5  # y - z
       and $4,$5,$5  # if y is larger than either, yields a positive number
       bgt $5,next2  # and thus x and z are not the largest
       mul $1,$1,$4  # x * x
       mul $3,$3,$5  # z * z
       add $4,$5,$5  # x^2 + z^2
       jr  $31       # Return to caller

next2: sub $3,$1,$4  # z - x
       sub $3,$2,$5  # z - y
       and $4,$5,$5  # if z is larger than either, yields a positive number
       bgt $5,next3  # and thus x and y are not the largest
       mul $1,$1,$4  # x * x
       mul $2,$2,$5  # y * y
       add $4,$5,$5  # x^2 + y^2
       jr  $31       # Return to caller

next3: We should never get here

Recursive:

I've been kind/lazy in the recursive version and eliminated the tail-call recursion. This eliminates the whole register save/restore and parameter passing overhead which isn't relevant to the computation anyway. It alters the locations of x, y, and z but that doesn't matter either.

begin: sub $1,$2,$4  # x - y
       sub $1,$3,$5  # x - z
       and $4,$5,$5  # if x is larger than either, yields a positive number
       bgt $5,next   # and thus y and z are not the largest
       mul $2,$2,$4  # y * y
       mul $3,$3,$5  # z * z
       add $4,$5,$5  # y^2 + z^2
       jr  $31       # Return to caller

next:  add $1,$0,$4  # x to temp
       add $2,$0,$1  # y to old x ($1 = y)
       add $3,$0,$2  # z to old y ($2 = z)
       add $4,$0,$3  # x to old z ($3 = x)
       j   begin     # Do it again with shuffled x, y, and z  

We can easily see that the recursive version is necessarily slower since it has to shuffle the parameters after each unsuccessful comparison, whereas this shuffling is implicit in the iterative version. The shuffling was determined at compile time and represented as the appropriate register accesses.

However, to my eye, the recursive version conveys the meaning of the solution better: it makes obvious that the comparison is to be applied to each possible combination of parameters until a match is found (there *has* to be one, given an ordered set of possible values).

We do know that the recursion will never need to repeat more than three times, so lets unroll the recursive code three times and see what we get:

Unrolled Recursive:

       sub $1,$2,$4  # x - y
       sub $1,$3,$5  # x - z
       and $4,$5,$5  # if x is larger than either, yields a positive number
       bgt $5,next1  # and thus y and z are not the largest
       mul $2,$2,$4
       mul $3,$3,$5
       add $4,$5,$5  # return value
       jr  $31       # Return to caller

next1: add $1,$0,$4  # x to temp
       add $2,$0,$1  # y to old x ($1 = y)
       add $3,$0,$2  # z to old y ($2 = z)
       add $4,$0,$3  # x to old z ($3 = x)

       sub $1,$2,$4  # y - z
       sub $1,$3,$5  # y - x
       and $4,$5,$5  # if y is larger than either, yields a positive number
       bgt $5,next2  # and thus z and x are not the largest
       mul $2,$2,$4
       mul $3,$3,$5
       add $4,$5,$5  # return value
       jr  $31       # Return to caller

next2: add $1,$0,$4  # y to temp
       add $2,$0,$1  # z to old y ($1 = z)
       add $3,$0,$2  # x to old z ($2 = x)
       add $4,$0,$3  # y to old x ($3 = y)

       sub $1,$2,$4  # z - x
       sub $1,$3,$5  # z - y
       and $4,$5,$5  # if z is larger than either, yields a positive number
       bgt $5,next3  # and thus x and y are not the largest
       mul $2,$2,$4
       mul $3,$3,$5
       add $4,$5,$5  # return value
       jr  $31       # Return to caller

next3: we should never get here

Notice something? There's a lot of repeated calculations operating on the same registers. We can factor out these fragments into subroutines and tie them back together with subroutine calls, resulting in threaded code with a definitely reduced size and improved clarity:

Threaded Recursive:

permute: add $1,$0,$4  # a to temp
         add $2,$0,$1  # b to old a ($1 = b)
         add $3,$0,$2  # c to old b ($2 = c)
         add $4,$0,$3  # a to old c ($3 = a)
         jr $31

test:    sub $1,$2,$4  # a - b
         sub $1,$3,$5  # a - c
         and $4,$5,$5  # if a is larger than either, yields a positive number
         jr $31

calc:    mul $2,$2,$4
         mul $3,$3,$5
         add $4,$5,$5  # return value
         add $31,$0,$0 # pop stack 
         jr  $31       # Return to caller (of 'do-all')

do-one:  jal test
         bgt $5,next
         jmp calc      # tail-call
next:    jr  $31

do-all:  jal do-one
         jal permute
         jal do-one
         jal permute
         jmp do-one    # tail-call
         jr  $31

OK, yes, popping the stack in 'calc' is a hack, but it is much clearer than inserting another 'bgt $5' test after each call to 'do-one', and keeps the correspondence to the original unrolled recursive code. A better approach would be to eliminate 'do-one', change the test to a 'blt $5' after each call to 'test' and branch to a single instance of 'calc' at the end. Alternatly, not reusing register $5 opens up other possibilities.

Imagine writing *all* your code like this out of simple little subroutines tied together by implicit data locations, and you've got yourself the basis for a very nice programming system. If your implicit data is a stack, you've got Forth or Joy. If it's a list, you've got Lisp or Scheme. If it's a dictionary, then you're probably heading towards something like Lua or Python. etc, etc...

But these are all dynamic, interpreted languages to a large extent. How could we write the simpler recursive or threaded code but get the faster iterative code under the hood?

Well, one big feature these languages give us is the ability to define code at runtime. So I'll assume a function called "compile" which takes its arguments, places them in the source code given to it, and creates whatever internal representation used by the language (bytecodes, for example).

This is Lisp-ish to match the original code, but I don't know Lisp. Don't expect this to be actually correct, but only to convey the meaning.

First, we create a function which compiles particular instances of the test and sum-of-squares calculation:

(define (compile-sum-square-largest-test x y z)
    (compile ((and (> y x) (> z x)) ;; y and z are largest
          (+ (* y y) (* z z)))))

We can then map that function onto each permutation of the arguments and give the output as the guts of a conditional expression:

(define (compile-sum-square-largest x y z)
    (cond (map compile-sum-square-largest-test (permutations (x y z)))))

And finally, we wrap it up under the function name and arguments we want and "execute" on-the-fly the result of the compilation just done:

(define (sum-square-largest x y z)
    (execute (compile-sum-square-largest (x y z))))

Of course, we could also do all this ahead of time to avoid compiling everytime we called the sum-square-largest function.

Still, it would be a good thing if the recursive expression somehow automatically reduced to the iterative machine code. It might be possible for a compiler to do it by transforming the register permutations into register renaming, via Static Single Assignment (SSA), before register allocation.

Like I said earlier, it wouldn't surprise me that this approach has been solved already, but I thought it was at least a nice little exercise.


fpgacpu.ca