Blogging about our lives online.


px Language - Speed Tests

In order to more fully understand how I'm going to implement the translator from my imaginary programming language, I have written a few direct translations from px to assembly. My thinking is that if the language doesn't rely on the abstractions provided by the "compile to C" interpreted languages, I should be able to end up with a systems level language with the clean syntax of python, ruby, et al.

My first example is to calculate the sum of a large list. Here are two versions in python.

UPDATE!!! After trying a few more versions, I realized that my loop counters were off in every example. I had made the silly python mistake of forgetting the range of range. range(0, 123456789) = 0,1,2...123456788. And I had subsequently propagated that mistake into my C and Assembly versions, producing the wrong answer in every case, sometimes very fast! Lesson learned: Check your math!

# - calculate 1 + 2 + ... + 12,345,678 = listsum

listsum = 0
for f in range(0,12345678):
    listsum = listsum + f
print "The result is " + str(listsum)

listsum = sum(range(0,12345678))
print "The result is " + str(listsum)

And here's what it might look like in px.

print listsum buildlist 12345678

But, of course, no library functions exist yet, so I'll spell out how the functions listsum and buildlist would be defined.

n... buildlist size, next 0:
    next next + 1
    n next if next < size
n listsum list...:
    break if not list
    n n + list
    continue True

buildlist is defined as taking one argument, size and producing a list, one item at a time. Listsum consumes these items and produces a final value when the whole list has been read. All items are passed as two 32-bit values, [type/status] and [value/pointer]. Each operation checks the status flag to check if it is receiving a valid object and what that object is exactly. The buildlist function uses this fact to push a false object onto the stack when it reaches "size". This tells listsum that there are no more values to be read. In assembly this is what the code structure looks like in simplified form:

list_size:  .int 12345678
l_sum:     .int    0

push $1
push list_size

pop list_size
pop edi            #using edi as the status flag
mov 0, ecx    #and ecx for the counter
    compare list_size, ecx
    cmovae 0, edi   #if above, set edi to 0 (false)
    add 1, ecx
    push edi
    push ecx
        pop eax
        pop edi
        cmp 0, edi
        je end_listsum
        add eax, l_sum
        jmp buildlist
    print l_sum

In the actual implementation, I changed the add to support a quadword because the results get large quite fast, but this gives you the idea.

Is It Fast?

Well, I knew it would be fast, but this is a pretty large list, the code between buildlist and end_listsum has to run over 12 million times to calculate the sum, this is what I got when comparing the two python versions and my assembly version:

$ time python
The result is 76207876467003

real 0m15.379s
user 0m14.819s
sys 0m0.557s

$ time python
The result is 76207876467003

real 0m8.123s
user 0m7.476s
sys 0m0.647s

$ time ./listsum
The result is 76207876467003

real 0m0.275s
user 0m0.273s
sys 0m0.000s

That's pretty blazing fast if you ask me! But I was put in my place when I ran the C version which produced these results:

time ./f_sum3
The answer is 76207876467003.000000

real 0m0.123s
user 0m0.120s
sys 0m0.000s

I used a floating point double for the list sum, so maybe that's why C was faster. Still, if we use C as the baseline we have:

C       1
px      2.2
Py2     66.04
Py1     125.03

Not bad for a start. Now I'm off to dig through the compiled C code to see how it should be done...

UPDATE #2 - I've realized that there is a quite simple algorithm to solve this particular problem and that is: (n/2) * (n+1). This simple equation eliminates the loop completely making the calculation trivial. The updated python version runs in 0.038s and the C version runs in 0.002s producing the, now correct sum, 76207888812681.


  1. I figured out the integer type I should have been using "uint_64_t" and the proper format string for printf is %llu. Turns out this one runs even faster, computing the value in 0.064 s. Much of the speed gains seem to be related to accessing the stack directly, rather than pushing and popping the stack. The compiler also byte-aligns the stack before using it. After implementing these optimizations, my assembly version clocks in at 0.095 and that includes the listsum checking at every step if it's receiving valid data. Sweet!

  2. As another test, I pulled all the heavily used data off the stack and kept them in the registers. The result 0.047s! I've beat the C version, but only because there are very few variables in play. The 8 bytes of the sum stay in the EDI and ESI registers, The status stays in the ECX register and the loop counter stays in the EAX/EDX. Strangely, it performs faster when the total loop count is on the stack vs. in EBX, maybe it's a processor feature for static values?