Dive into Assembly Optimisation for Multiplication

Posted on Sat 16 October 2021 in Post

This is a bit of research into how multiplying a variable and a constant integer is compiled into assembly x86 instructions. We will look at how the instruction set changes as the constant integer changes.

Setup and Scripts

We start with a mulk.c file which contains a function that takes a parameter, and returns the parameter multiplied by a constant. This constant can be chosen at compile time using a make file. We can then script a loop to run this make file for a range of values.

Here are the scripts:

mulk.c

long mulk(long a) {
    return a * K;
}

Makefile

# Set up a default value for K.
# To use other values, run make with
# # the preferred value of K.
# make k=5
# make k=7
# etc.
k?=32
# Allow the user to type K=5 or k=5
K?=$(k)

all:
        @gcc -O1 -S -DK=$(K) mulk.c -o mulk$(k).s
        @cat mulk$(k).s

This make file simply compiles the code into assembly, then outputs the assembly code.

Notes on Assembly

For the following snippets of Assembly, I will include on the processing for multiplication. Remember that in Assembly, the %rdi register holds the first parameter passed to the function, which will be a from the C code. Also the %rax register holds the return value, which will be the result of the multiplication. Using this make file, I can type the command make k=1 to compile with the constant integer 1, and I can replace 1 with any other integer.

Multiplying by Small Integers

// For k=1

movq    %rdi, %rax

Since multiplying by 1 returns the input value, this can be done by moving the input value to the output plain and simple.

// For k=2

leaq    (%rdi,%rdi), %rax

What leaq does is evaluate a simple arithmetic expression which contains up to 4 values and denoted by brackets () in the form Displacement ( Base, Index, Scale). This expression equates to = D + B + ( I * S ). leaq specifically then moves the result to the second parameter register which is %rax in this case. This is contrasted by the movq command which evaluates the expression, then goes to that address in memory and moves the value located there to the second register. The simple arithmetic expression is normally used for array indexing, but also works as a short hand for simple arithmetic while being fast in hardware.

Evaluating this leaq gives %rax = %rdi + %rdi or %rdi * 2`.

// For k=3

leaq    (%rdi,%rdi,2), %rax

Another leaq. This one evaluates to %rax = %rdi + (%rdi * 2). An interesting question here is comparing leaq with addq, because here we have introduced a multiply by 2, and a point of this post is about how multiple is slow and can be optimised, so why do we have leaq multiplying. It is possible that leaq actually adds %rdi twice to compute (%rdi * 2) to avoid multiplying. You may think that this would be the same as using addq 2 times, but it is actually different. Because leaq is for calculating addresses, it is actually done in the decode phase in some modern CPUs thanks to superscalar processors. This is out of the scope for this blog post, but the point is that leaq is very fast in hardware.

// For k=4

leaq    0(,%rdi,4), %rax

Another leaq which evaluates to 0 + 0 + (%rdi * 4).

// For k=5

leaq    (%rdi,%rdi,4), %rax

5 uses leaq as well but only multiplies by 4 and then adds itself to get *5.

// For k=6

leaq    (%rdi,%rdi,2), %rax
addq    %rax, %rax

This is the first expression to use more than one command. We see the leaq used for multiplying by 3, and then this result is added to itself to double this, make it 3 * 2 = 6. If we look into why it has done this instead of using leaq 0(,%rdi,6), we can make the guess that using leaq to multiply by 6 could take 6 steps, and using leaq to multiply by 2 and add 1 would take a total of 3 steps. Then adding this result to itself is a further 1 step for a total of 4 steps, which is less than 6 steps.

Hypothesis

It seems that first, the compiler favours leaq very highly. This is backed up by modern CPU hardware optimising leaq instructions, and these instructions taking place during decoding which can make their processing cost appear 'free'. Second, it seems that leaq using large numbers to multiply increase the number of steps to compute it, leading to the process seen in k=6 where leaq is used to compute half of the multiplication, and then adding the result with it self to double it completes the multiplication by 6.

We are doing multiplication here, yet while there is a specific one line command in assembly for multiplication known as imul (integer multiply), we haven't seen it. With a basic understanding of binary arithmetic, it makes sense that adding two binary numbers together is fast. Simple half adders and full adders can be creating using a few logic gates per bit, and the process of adding is very close to XORing two registers. While multiplication of two registers is harder because there is no simple way to do it.

My hypothesis as to why the compiler is not using imul is because it costs more time or clock cycles in the CPU hardware, and using addition, shifts and lea would be faster. Therefore we can assign a cost value to each command where imul is a high cost and addq, leaq, sarq and salq have low costs. If we the total number of these commands and compare it with the values for k which the compiler starts to use imul, we can evaluate the ratio of the costs of these commands.

Note: salq and sarq (shift arithmetic left/right) are used to bitshift registers. This can multiply by powers of 2 similarly to how adding a zero to the end of a decimal number will multiply it by 10.

Testing with Larger Integers

I created a bash script that would compile with values of k from 1 to 80. It prints out the number of times each command appears in the assembly code for each value of k.

!/bin/bash

for number in {1..80}
do
        echo ===================$number
        echo MUL =
        make k=$number | grep -c imul
        echo \nlea =
        make k=$number | grep -c lea
        echo \nadd =
        make k=$number | grep -c add
        echo \nsub =
        make k=$number | grep -c sub
        echo \nsal =
        make k=$number | grep -c sal
done

This reveals that the first time imul is used is for k=46

// For k=46

imulq   $46, %rdi, %rax

Using the results for all values between 1 and 80, I calculated the maximum number of commands used. I found the maximum was 3 commands. So I investigated the code for values that are one command away from 46, which would be 46/2, since addq can be used to quickly double a register.

// For k=23

leaq    (%rdi,%rdi,2), %rax
salq    $3, %rax
subq    %rdi, %rax

Assembly for k=23 uses 3 commands, meaning appending an addq %rax, %rax would use a total of 4 commands.

Based on this finding, I believe the compiler is trying to use up to 3 of these simple commands to optimise the multiplication, and using imul when the minimum number of these simple command is greater than 3.

Explaining this Behaviour

So based on our assumption that the compiler is assigning costs to these commands, where addq, leaq, sarq and salq could have costs of 1, then imul would have a cost of between 3-4 as this is where the compiler chooses to use imul instead of 4 simple commands. I started to research the hardware costs of these commands.

In old CPUs that are simple, this cost would come in the number of clock cycles it takes to evaluate these commands. In modern CPUs, calculating the number of clock cycles probably won't work as modern CPUs would have parallel processing and other optimisations that allow out-of-order processing.

I found a document assigning costs to assembly commands written by Agner Fog. from the Technical University of Denmark. Found at this link: https://www.agner.org/optimize/instruction_tables.pdf

For an old CPU like the Intel Pentium, this document shows the clock cycles for each command.

This shows that mov takes 1 clock cycle. It also specifies that add, sub and lea take 1 cycle. It does not include sal or sar. For imul it states 11 clock cycles, making imul cost 11 times as much as the other simple instructions

This supports the idea that imul costs more than the simple expression resulting in the compiler using multiple simple expressions before using imul. However, the size of this extra cost is much more than we predict, being 11* instead of 3-4* greater. To work around this, we need to look at tables for modern CPUs which are conveniently provided in this document.

For a modern CPU, I looked at Intel's 10th gen IceLake CPU released in 2019. Because of complex CPU optimisations to support out-of-order computation, the instructions are not measured in clock cycles but instead measured in latency. Latency here refers to time in core clock cycles making it effectively the same as measuring clock cycles, while accounting for speed improvements due to parallel processing and other optimisations. This is because it measure how long the CPU must wait for the result of the calculation before it can return it, the same idea as clock cycles in the Pentium


IceLake mov costs

IceLake lea costs

This shows that mov, lea, add, sub, sar and sal all cost around 1 in terms of latency. Compare this is the forms of imul.


IceLake imul costs

All the forms of imul cost 3-4 in terms of latency. This supports the measurement of the compiler valuing imul as 3-4 times slower than the other simple commands.

Conclusion and Findings

Look at the assembly code created by the compiler, it chooses to use up to 3 simple commands of mov, lea, add, sub, sar and sal. If more than 3 of these commands are required, the compiler will instead use imul. Researching the speed at which a modern CPU will process these commands shows that imul does in fact cost 3-4 times as much as the simple commands. It is interesting to not that for an old CPU like the Pentium, this cost was 11 times as much, and modern CPU hardware has reduced this down to only 3-4. We can also explain this as the cost of the simple commands in terms of time (latency) or clock cycles have stayed at a value of 1 cycle, while it is imul which has become cheaper to compute. This points to the simple commands being limited to 1 cycle as this is the fastest anything can run, while innovations such as parallelism and out-of-order calculation has made imul more efficient.