Microbenchmarks comparing = and eq? in Racket

Comparing = and eq?

The other day I wanted to compare the difference in the execution time between = and eq? in Racket when both arguments are fixnum, i.e. small integers. I used two almost equal microbenchmarks. The strange thing is that in one case when I made the change the time difference was almost 3 seconds and in the other it was only 0.3 seconds. So I started to investigate more ...

The goal of the experiment was to see what happens if you have a program with something like

(lambda (vec)
  (when (= (vetor-length vec) 3)
    (something-interesting-with vec ...)))

As (vector-length v) it is a fixnum, it is the same to use eq? and =, but = is nicer, it is more idiomatic. Is it possible to improve Racket optimizer to make this change during compilation? How much time can we gain?

Using eq? is faster because it simply compares pointers, or in this case the value of the fixnum. (Pointers have even addresses, the finums are represented internally as an odd "pointer", the number n is represented by 2*n+1). Then the implementation of eq? is reduced to something like a = in C or its equivalent in assembler.

Instead = has to consider that the values to compare may been fixnum, or flomums or bignum or complex or ... and convert from one to another if necessary. So it seems that it will take a little longer to run.

To measure this, we will compare how long it takes to run:
(for ([i (in-range 1000000000)])
(= x 3))))
(for ([i (in-range 1000000000)])
(eq? x 3))))

using a value of x that is a fixnum. In this case, we will make x to be (vector-length vec), where vec is a vector of length 3.

The problem is that the value calculated inside the for is discarded. That always makes me nervous in microbenchmarks. So let's try also with for/and, that doesn't calculate anything interesting in this case, but at least it formally look's at the result of the expression that is inside the for.

The program

Let's go directly to the full program:

#lang racket (define-syntax-rule (gc/time label expr) (begin (collect-garbage) (collect-garbage) (collect-garbage) (display label) (time expr)))	A macro to measure the time. The recommendation is to call three times the garbage collector to clean everything, including finalizers, if any is present. I think it is not necessary in this case, but it doesn't hurt.
(define vec (if (zero? (random 2)) (vector 1 2 3) (vector 3 2 1))) (set! vec vec)	We define vec as a vector of length 3. We use an odd way to confuse the optimizer. Actually, I guess that using just set! is enough to confuse the optimizer. Because set! marks the variable as mutable and that disables almost all optimizations.
(for ([r (in-range 7)]) (gc/time "=: and: " (let ([x (vector-length vec)]) (for/and ([i (in-range 1000000000)]) (= x 3)))) (gc/time "eq: and: " (let ([x (vector-length vec)]) (for/and ([i (in-range 1000000000)]) (eq? x 3)))) (gc/time "=: plain: " (let ([x (vector-length vec)]) (for ([i (in-range 1000000000)]) (= x 3)))) (gc/time "eq: plain: " (let ([x (vector-length vec)]) (for ([i (in-range 1000000000)]) (eq? x 3)))) )	We repeat the measurements 7 times, to get lucky. With for/and and for, with = and eq?. Each copy defines its own x, so that the information discovered in one measurement does not affect the other measurements.

Results

The first and most important thing to remember measuring the time from the command line using
racket file.rkt
and not inside DrRacket. (DrRacket has many features enabled to make it easier to debug, and consumes a lot of memory, so it changes the times of the measurements.)

For this kind of measurements, I prefer that the number of repetitions is big enough to make each measured take between 5 and 10 seconds. Normally less than 5 seconds is very noisy (less than 1 second is vvveeerrryyy noisy (less than 100 ms is ridiculously noisy)). More than 10 seconds is boring. With 1000000000 repetitions I get a reasonable range on my computer. (Note: With a faster computer, you may have to use a larger number. But in Racket 32 bit the largest fixnum is 1073741823, and it's better to nest two fors than using a counter that is a bignum. Otherwise a large part of the time would be spent in bignum calculations and that would mask the time difference between eq? and =.)

These are the results, using Racket 6.3, sorted in each column from lowest to highest:

	= + for/and	eq? + for/and	= + for	eq? + for
time (ms)	10561	10421	7504	4665
(sorted)	10608	10421	7535	4695
	10640	10467	7566	4695
	10655	10483	7597	4696
	10858	10593	7613	4727
	11060	10764	7894	4789
	11607	10795	8065	4852
average (ms)	10855.6	10563.4	7682.0	4731.3
diff: = vs eq?		292.1		2950.7
		2.7%		38.7%
sigma (ms)	374.4	158.7	196.5	61.2

Time comparison, sorting the measurements

Starting the time axis from 0

When we compare the versions that use for, the difference between = and eq? is 2950.7ms on average (it´s a 38.7%). It is much larger than the sigmas, so clearly there is a improvement.

When we compare the versions that use for/and, the difference between = and eq? is 292,1ms on average (it's a 2.7%). It's almost the same size as the sigmas, so we should call a statistician for an analysis. I get p = 0.093 with two tails, and p = 0.047 with one tail. It's not enough for a Nobel prize, but it's ok for a blog article. I swore I didn't not repeat all this 10 times. (Note: To reduce the p we should increase the amount of repetitions, I guess it would be enought to get 30 aproximately, this is left as an exercise.)

As I said, I was very surprised that in one case the difference is 293.1ms and in the other case it's 2950.7ms. What's going on?

Decompiling

To understand what happens we need to compile and decompile the file, to see what is the real code that runs in each case. (Note: Actually, Racket has a JIT stage with some additional optimizations, so there could be more surprises. There is a package to look at the code in assembler, but it is more difficult to interpret the assembler code, except for very short functions..)

The idea is to run
    raco make file.rktkt
    raco decompile filerkt
Or to read everything easily:
    raco make file.rkt
    raco decompile file.rkt > file-dec.rktd

OK, the decompiled code is long and quite incomprehensible. The decompiler tries to transform back the internal structures to something that looks enough like a Racket so that you can understand it, but there are still some parts where the internal structures are visible. Furthermore, almost all variable names are lost, so the decompiler invents one for them using the internal variable type and number.

To understand better what is happening, I cleaned up the automatic version, I put cute names to variables, and I fixed the format and some details. My hand decompiled version has a lot of creativity, but I think it keeps the spirit of the automatic version enough to understand what is the difference between the program that uses for and for/and.

Header and main loop

The program structure after compiling, decompiling and creatively cleanup is:

#lang racket (require (only-in racket/unsafe/ops [unsafe-fx+ fx+] [unsafe-fx< fx<])) (define (print-values v) v) (define (#%checked x) x) (define (#%sfs-clear x) x) (define-syntax-rule (#%let/closed name vars body ...) (let name vars body...))	Totally fake header. It's doesn't look like the decompiled version at all, but it has all the necessary definitions to make possible to compile the rest of the program.
(define-syntax-rule (gc/time label expr) 'looong-expanded-version...)	The definition of the macro is expanded and occupies many lines in the decompiled program. We don't need it now, so let's replace it creatively by this.
(define vec (if (zero? (random 2)) (vector 1 2 3) (vector 3 2 1))) (print-values (set! vec (#%checked vec)))	The print-values is necessary here because low level Racket doesn't show the expressions calculated on the module, unlike the high-level behavior. The #%checked is there to check that vec has already been defined, and otherwise generate an error. The compiler does not put too much effort into the mutable variables, so instead of deciding whether or not it is necessary to check it, the compiler adds a check just in case.
(define (lift_eq_plain) ...) (define (lift_=_plain) ...) (define (lift_eq_and) ...) (define (lift_=_and) ...)	The expressions that we want to time are lifted and they appear as definitions. And each one is explicitly expressed as a lambda, but I hid them in a define. For some reason, they are in reverse order. This is the most interesting part, so I'll leave the details for later ...
(print-values (let for-loop ([ret (void)] [r 0]) (if (fx< r 7) (begin (#%sfs-clear ret) (begin (collect-garbage) (collect-garbage) (collect-garbage) (display "=: and: ") (time (lift_=_and)) (collect-garbage) (collect-garbage) (collect-garbage) (display "eq: and: ") (time (lift_eq_and)) (collect-garbage) (collect-garbage) (collect-garbage) (display "=: plain: ") (time (lift_=_plain)) (collect-garbage) (collect-garbage) (collect-garbage) (display "eq: plain: ") (time (lift_eq_plain)) (for-loop (void) (fx+ r 1)))) ret)))	This is the main loop, where we repeat everything 7 times. The macro that enclosed the code to measure the time is expanded and it calls the lifted functions that are defined above. Actually, time is also a macro that is expanded. But to get a code that is not so long, I unexpanded it creatively. The condition of the main loop appears here with a fx+ and fx<, but they really unsafe-fx+ and unsafe-fx<. The problem is that the decompiled code with so many unsafe is horrible. So I changed them creatively to fx+ and fx< and I put a require that renames them in the fake header. The #%sfs-clear is used to clean the variable. This cleanup is not visible from Racket and is only used when the compiler realizes it will never use that variable again. In this case the variable is (void) so it doesn't matter if it is cleaned or not. If it were a variable with a large value, for example (make-vector 1000), this clean up is important for recursion to not exhaust all the memory with variables that are no longer going to be used. The (let for-loop (...) ..) is also expanded in a letrec and a lambda. I unexpanded it because for me it is easier to understand. Where it says (void), it is actually a #<void>, because the code has the value instead of the function call.

Up to now, we explained the meaning of all functions defined in the fake header, and the difference between the true version and header's version. It remains to explain the macro #%let/closed.

I also hope it is clear that I made several changes that simplify the code, but they correspond to things that are expanded in the actual decompiled version.

Comparing the versions with for/and

Let's compare the versions that use for/and, changing = by eq?

Version with = and for/and	Version with eq? and for/and
(define (lift_=_and) (let ([x (vector-length (#%checked vec))]) (let ([t1 (= x 3)]) (if t1 (let ([t2 (= x 3)]) (if t2 (let ([t3 (= x 3)]) (if t3 (let ([t4 (= x 3)]) (if t4 (#%let/closed for-loop ([x_ x] [t4_ t4] [i4_ 4]) (if (fx< i4_ 1000000000) (let ([t5_ (= x_ 3)]) (if t5_ (let ([i5_ (fx+ i4_ 1)]) (if (fx< i5_ 1000000000) (let ([t6_ (= x_ 3)]) (if t6_ (let ([i6_ (fx+ i5_ 1)]) (if (fx< i6_ 1000000000) (let ([t7_ (= x_ 3)]) (if t7_ (let ([i7_ (fx+ i6_ 1)]) (if (fx< i7_ 1000000000) (let ([t8_ (= x_ 3)]) (if t8_ (for-loop x_ t8_ (fx+ i7_ 1)) #f)) t7_)) #f)) t6_)) #f)) t5_)) #f)) t4_)) #f)) #f)) #f)) #f))))	(define (lift_eq_and) (let ([x (vector-length (#%checked vec))]) (let ([t1 (eq? x 3)]) (if t1 (let ([t2 (eq? x 3)]) (if t2 (let ([t3 (eq? x 3)]) (if t3 (let ([t4 (eq? x 3)]) (if t4 (#%let/closed for-loop ([x_ x] [t4_ t4] [i4_ 4]) (if (fx< i4_ 1000000000) (let ([t5_ (eq? x_ 3)]) (if t5_ (let ([i5_ (fx+ i4_ 1)]) (if (fx< i5_ 1000000000) (let ([t6_ (eq? x_ 3)]) (if t6_ (let ([i6_ (fx+ i5_ 1)]) (if (fx< i6_ 1000000000) (let ([t7_ (eq? x_ 3)]) (if t7_ (let ([i7_ (fx+ i6_ 1)]) (if (fx< i7_ 1000000000) (let ([t8_ (eq? x_ 3)]) (if t8_ (for-loop x_ t8_ (fx+ i7_ 1)) #f)) t7_)) #f)) t6_)) #f)) t5_)) #f)) t4_)) #f)) #f)) #f)) #f))))

Version with = and for/and

Version with eq? and for/and

(define (lift_=_and)
  (let ([x (vector-length (#%checked vec))])
    (let ([t1 (= x 3)])
      (if t1
        (let ([t2 (= x 3)])
          (if t2
            (let ([t3 (= x 3)])
              (if t3
                (let ([t4 (= x 3)])
                  (if t4
                    (#%let/closed for-loop ([x_ x] [t4_ t4] [i4_ 4])
                      (if (fx< i4_ 1000000000)
                        (let ([t5_ (= x_ 3)])
                          (if t5_
                            (let ([i5_ (fx+ i4_ 1)])
                              (if (fx< i5_ 1000000000)
                                (let ([t6_ (= x_ 3)])
                                  (if t6_
                                    (let ([i6_ (fx+ i5_ 1)])
                                      (if (fx< i6_ 1000000000)
                                        (let ([t7_ (= x_ 3)])
                                          (if t7_
                                            (let ([i7_ (fx+ i6_ 1)])
                                              (if (fx< i7_ 1000000000)
                                                (let ([t8_ (= x_ 3)])
                                                  (if t8_
                                                    (for-loop x_ t8_ (fx+ i7_ 1))
                                                    #f))
                                                t7_))
                                            #f))
                                        t6_))
                                    #f))
                                t5_))
                            #f))
                        t4_))
                    #f))
                #f))
            #f))
        #f))))

(define (lift_eq_and)
  (let ([x (vector-length (#%checked vec))])
    (let ([t1 (eq? x 3)])
      (if t1
        (let ([t2 (eq? x 3)])
          (if t2
            (let ([t3 (eq? x 3)])
              (if t3
                (let ([t4 (eq? x 3)])
                  (if t4
                    (#%let/closed for-loop ([x_ x] [t4_ t4] [i4_ 4])
                      (if (fx< i4_ 1000000000)
                        (let ([t5_ (eq? x_ 3)])
                          (if t5_
                            (let ([i5_ (fx+ i4_ 1)])
                              (if (fx< i5_ 1000000000)
                                (let ([t6_ (eq? x_ 3)])
                                  (if t6_
                                    (let ([i6_ (fx+ i5_ 1)])
                                      (if (fx< i6_ 1000000000)
                                        (let ([t7_ (eq? x_ 3)])
                                          (if t7_
                                            (let ([i7_ (fx+ i6_ 1)])
                                              (if (fx< i7_ 1000000000)
                                                (let ([t8_ (eq? x_ 3)])
                                                  (if t8_
                                                    (for-loop x_ t8_ (fx+ i7_ 1))
                                                    #f))
                                                t7_))
                                            #f))
                                        t6_))
                                    #f))
                                t5_))
                            #f))
                        t4_))
                    #f))
                #f))
            #f))
        #f))))

The first thing we notice is that they are almost the equal. This is not very surprising. So the difference in time is only due to the change of = by eq?.

It is the equivalent to analyze either versions. The automatic decompiled version has two lambdas in each panel. I liked the MIT notation that hides the lambdas, so one of them is hidden inside the define. I hid the other lambda in a named let.

During the first expansions, for/and becomes a letrec and lambda (and some ifs obviously). The lambdas are inlined, so the final code looks similar to a loop unrolling.

The first 4 cycles are in the white part. It's visible how it checks 4 times if (= x 3). It isn't visible that it checks if (fx< i 1000000000) because the value of i is known, then it becomes something like (fx< 1 1000000000) that is clearly #t, so (if (fx< 1 1000000000) tb fb) disappears. So the counter i starts from 4 rather than starting from 0.

The cyan part makes the other 999999996 loop iterations. Again, due to inlining, each function for-loop tests the conditions 4 times before calling itself recursively. This reduces the number of actual function calls, that are expensive.

The for-loop function has 3 arguments, which is 1 more than expected:

The counter i_, which obviously must be there.
The value t4_ to return if we are just at the last iteration.
The value x4_ that is equals the value of x. When for-loop is called recursively the value of x_ is equal to the previous value of x_ that was x. (Got that?)

This is a very nice trick. Instead of using the value of x that is defined outside, the function uses the value of x_ that is an argument, but they actually have the same value. With this change the function for-loop doesn't have to use any external local variables, it just uses its arguments and internal variables. This makes it a closed closure (what normal people call function) and it can be implemented more efficiently.

In the handmade decompilation I defined the function for-loop using a #%let-loop/closed which is a macro that simply becomes a named let. Actually, as it is a closed closure, it is defined by a special structure %#closed. The idea is that as it doesn't dependent on the local variables, so it can be implemented as if it were a separate function defined with a define. With this it possible to create fewer closures and everything goes faster. There are some details that still do not understand ... But the decompiler inserts the definition here instead of putting it in a separate define, so I'm not lying too much in this part.

Comparing the versions with for

Let's compare the versions that use for, changing = by eq?

Version with = and for	Version with eq? and for
(define (lift_=_plain) (let ([x (vector-length (#%checked vec))]) (begin (= x 3) (= x 3) (= x 3) (= x 3) (= x 3) (#%let/closed for-loop ([x_ x] [t5_ (void)] [i5_ 5]) (if (fx< i5_ 1000000000) (begin (= x_ 3) (let ([i6_ (fx+ i5_ 1)]) (if (fx< i6_ 1000000000) (begin (= x_ 3) (let ([i7_ (fx+ i6_ 1)]) (if (fx< i7_ 1000000000) (begin (= x_ 3) (let ([i8_ (fx+ i7_ 1)]) (if (fx< i8_ 1000000000) (begin (= x_ 3) (let ([i9_ (fx+ i8_ 1)]) (if (fx< i9_ 1000000000) (begin (= x_ 3) (for-loop x_ (void) (fx+ i9_ 1))) (void)))) (void)))) (void)))) (void)))) t5_)))))	(define (lift_eq_plain) (let ([unused_x (vector-length (#%checked vec))]) (#%let/closed for-loop ([t10_ (void)] [i10_ 10]) (if (fx< i10_ 1000000000) (begin (#%sfs-clear t10_) (let ([i11_ (fx+ (#%sfs-clear i10_) 1)]) (if (fx< i11_ 1000000000) (let ([i12_ (fx+ i11_ 1)]) (if (fx< i12_ 1000000000) (let ([i13_ (fx+ i12_ 1)]) (if (fx< i13_ 1000000000) (let ([i14_ (fx+ i13_ 1)]) (if (fx< i14_ 1000000000) (for-loop (void) (fx+ i14_ 1)) (void))) (void))) (void))) (void)))) t10_))))

Version with = and for

Version with eq? and for

(define (lift_=_plain)
  (let ([x (vector-length (#%checked vec))])
    (begin
      (= x 3)
      (= x 3)
      (= x 3)
      (= x 3)
      (= x 3)
      (#%let/closed for-loop ([x_ x] [t5_ (void)] [i5_ 5])
        (if (fx< i5_ 1000000000)
          (begin
            (= x_ 3)
            (let ([i6_ (fx+ i5_ 1)])
              (if (fx< i6_ 1000000000)
                (begin
                  (= x_ 3)
                  (let ([i7_ (fx+ i6_ 1)])
                    (if (fx< i7_ 1000000000)
                      (begin
                        (= x_ 3)
                        (let ([i8_ (fx+ i7_ 1)])
                          (if (fx< i8_ 1000000000)
                            (begin
                              (= x_ 3)
                              (let ([i9_ (fx+ i8_ 1)])
                                (if (fx< i9_ 1000000000)
                                  (begin
                                    (= x_ 3)
                                    (for-loop x_ (void) (fx+ i9_ 1)))
                                  (void))))
                            (void))))
                      (void))))
                (void))))
          t5_)))))

(define (lift_eq_plain)
  (let ([unused_x (vector-length (#%checked vec))])
    (#%let/closed for-loop ([t10_ (void)] [i10_ 10])
      (if (fx< i10_ 1000000000)
        (begin
          (#%sfs-clear t10_)
          (let ([i11_ (fx+ (#%sfs-clear i10_) 1)])
            (if (fx< i11_ 1000000000)
              (let ([i12_ (fx+ i11_ 1)])
                (if (fx< i12_ 1000000000)
                  (let ([i13_ (fx+ i12_ 1)])
                    (if (fx< i13_ 1000000000)
                      (let ([i14_ (fx+ i13_ 1)])
                        (if (fx< i14_ 1000000000)
                          (for-loop (void) (fx+ i14_ 1))
                          (void)))
                      (void)))
                  (void)))
              (void))))
        t10_))))

The first thing we noticed is that they are shorter than the previous versions. Let me count ... The former versions had 39 lines and these have 34 and 20.

Let's consider first the version that use for and =.

In the for/and we have to use an if to see if the result (= x 3) is #f, in the version with for the value of (x = 3) is ignored. So the if disappears and is replaced by a begin. That change saves time and therefore the version with for and = is faster.

Now the first 5 cycles are outside the for-loop function so the counter i starts from 5. In addition, the for-loop includes 5 cycles in each recursion in the cyan part. I guess there is an additional inlining because it's smaller. I don't think this changes the runtime too much.

The for-loop function has again 3 arguments:

The counter i_, which obviously must be there.
The value t5_ to return if we are just at the last iteration, which is not very useful because it's always (void).
The value x_ that is equals the value of x, to use the same trick that we used in the other functions to get a closed closure.

Looking carefully, the function checks 1000000000 times that (= x 3) does not generate an error, and when it's completely sure, it returns (void). This is faster than previous versions that checked that (= x 3) doesn't generate an error and that the value was not #f.

Let's look now at the version that use for and eq?.

Now the firs 10 cycles are outside the function for-loop so i starts from 10. In addition, the for-loop includes 5 cycles in each recursion in the cyan part. I guess there is even more inlining because the funtion is even smaller. (10 = 2 * 5?)

As eq? never never never fails (when called with two arguments), then (eq? x_ 3) can be omitted, along with the begin that was needed to hold the (= x 3). Moreover, as x is never used, the compiler marks it with a special flag so that the result is not saved. This may not save much memory in this case, but it can be useful if x were a large vector or something.

Unlike the previous functions, for-loop has only 2 arguments:

The counter i_, which obviously must be there.
The value t10_ to return if we are just at the last iteration, which is not very useful because it's always (void).
The value of x is not used, so it's not necessary to add an extra argument x_.

Actually, this function just counts to 1000000000 and returns (void). Obviously faster.

Conclusions

Instead of changing the = for eq?, we could change it for unsafe-fx=. I suppose that eq? and unsafe-fx= do the same calculation, which is equivalent to a = in C or it's version in assembler. I like unsafe-fx= a little more. Generalizing the same idea, we could change the < for unsafe-fx< and in a similar way many other comparators.

In a realistic case, if we calculate (= x 3) the program probably will do something with the result, so it's much more realistic the improvement of the first comparison using for/and than the change in the second using for.

By changing = for eq? we obtained a 2.7% reduction in a microbenchmark. In a more realistic case I expect the remaining calculations to be more interesting and take more time, so I guess that the improvement will be reduced to 1% or 0.5%, after changing all the operations for its unsafe analog. This 1% improvement is similar to the improvement obtained in some handmade previous experiments, so it sounds good, but I have no systematic measurement.

The Implementation of = is more complicated than the implementation of eq? because it has to consider many types of numbers and their conversions. But the time difference is small. I guess the first comparison is related to the case were both arguments are fixnum. This makes sense because the way fixnum are represented, because otherwise the implementation has to look if the argument are actual pointer (even) or fake "pointers" (odd) that represent fixnum. I should look at the implementation ...

Dear future reader: These measurements were made using 6.3 Racket. In a future version, the optimizer may change automaticaly = to unsafe-fx= under the hood, when it's safe and the change makes sense. It's too late include this in version 6.4, but it will probably be in version 6.5. The idea is to be able to write nice code that runs as fast as the ugly code. So, if you repeat the measurements and the run time with = and eq? are equal, probably the optimizer is already making the change automatically.

To think about

Measure everything many more times and see if the statistics improve. (I vote for at least 30 times.)

Run the decompiled version, to see what happens in a "second round" of compilation. (It seems that the runtimes are more or less equal, I had no patience to wait for it to finish.)

Try some variants, for example put the vector-length inside the cycle. (The analysis is more complicated because the optimizer learns that v is a vector and begins to optimize vector-length.)

    (let ([v vec])
      (for/and ([i (in-range 1000000000)])
        (= (vector-length v) 3))))

Gustavo

7 de febrero de 2016