Dive into Assembly Optimisation for Multiplication
Posted on Sat 16 October 2021 in Post
This is a bit of research into how multiplying a variable and a constant integer is compiled into assembly x86 instructions. We will look at how the instruction set changes as the constant integer changes.
Setup and Scripts
We start with a mulk.c file which contains a function that takes a parameter, and returns the parameter multiplied by a constant. This constant can be chosen at compile time using a make file. We can then script a loop to run this make file for a range of values.
Here are the scripts:
mulk.c
long mulk(long a) {
return a * K;
}
Makefile
# Set up a default value for K.
# To use other values, run make with
# # the preferred value of K.
# make k=5
# make k=7
# etc.
k?=32
# Allow the user to type K=5 or k=5
K?=$(k)
all:
@gcc -O1 -S -DK=$(K) mulk.c -o mulk$(k).s
@cat mulk$(k).s
This make file simply compiles the code into assembly, then outputs the assembly code.
Notes on Assembly
For the following snippets of Assembly, I will include on the processing for multiplication. Remember that in Assembly, the %rdi register holds the first parameter passed to the function, which will be a
from the C code. Also the %rax register holds the return value, which will be the result of the multiplication. Using this make file, I can type the command make k=1
to compile with the constant integer 1, and I can replace 1 with any other integer.
Multiplying by Small Integers
// For k=1
movq %rdi, %rax
Since multiplying by 1 returns the input value, this can be done by moving the input value to the output plain and simple.
// For k=2
leaq (%rdi,%rdi), %rax
What leaq
does is evaluate a simple arithmetic expression which contains up to 4 values and denoted by brackets ()
in the form Displacement ( Base, Index, Scale)
. This expression equates to = D + B + ( I * S )
. leaq
specifically then moves the result to the second parameter register which is %rax
in this case. This is contrasted by the movq
command which evaluates the expression, then goes to that address in memory and moves the value located there to the second register. The simple arithmetic expression is normally used for array indexing, but also works as a short hand for simple arithmetic while being fast in hardware.
Evaluating this leaq
gives %rax = %rdi + %rdi
or %rdi * 2
`.
// For k=3
leaq (%rdi,%rdi,2), %rax
Another leaq
. This one evaluates to %rax = %rdi + (%rdi * 2)
. An interesting question here is comparing leaq
with addq
, because here we have introduced a multiply by 2, and a point of this post is about how multiple is slow and can be optimised, so why do we have leaq
multiplying. It is possible that leaq
actually adds %rdi
twice to compute (%rdi * 2)
to avoid multiplying. You may think that this would be the same as using addq
2 times, but it is actually different. Because leaq
is for calculating addresses, it is actually done in the decode phase in some modern CPUs thanks to superscalar processors. This is out of the scope for this blog post, but the point is that leaq
is very fast in hardware.
// For k=4
leaq 0(,%rdi,4), %rax
Another leaq
which evaluates to 0 + 0 + (%rdi * 4)
.
// For k=5
leaq (%rdi,%rdi,4), %rax
5 uses leaq
as well but only multiplies by 4 and then adds itself to get *5
.
// For k=6
leaq (%rdi,%rdi,2), %rax
addq %rax, %rax
This is the first expression to use more than one command. We see the leaq
used for multiplying by 3, and then this result is added to itself to double this, make it 3 * 2 = 6
. If we look into why it has done this instead of using leaq 0(,%rdi,6)
, we can make the guess that using leaq
to multiply by 6 could take 6 steps, and using leaq
to multiply by 2 and add 1 would take a total of 3 steps. Then adding this result to itself is a further 1 step for a total of 4 steps, which is less than 6 steps.
Hypothesis
It seems that first, the compiler favours leaq
very highly. This is backed up by modern CPU hardware optimising leaq
instructions, and these instructions taking place during decoding which can make their processing cost appear 'free'. Second, it seems that leaq
using large numbers to multiply increase the number of steps to compute it, leading to the process seen in k=6
where leaq
is used to compute half of the multiplication, and then adding the result with it self to double it completes the multiplication by 6.
We are doing multiplication here, yet while there is a specific one line command in assembly for multiplication known as imul
(integer multiply), we haven't seen it. With a basic understanding of binary arithmetic, it makes sense that adding two binary numbers together is fast. Simple half adders and full adders can be creating using a few logic gates per bit, and the process of adding is very close to XORing two registers. While multiplication of two registers is harder because there is no simple way to do it.
My hypothesis as to why the compiler is not using imul
is because it costs more time or clock cycles in the CPU hardware, and using addition, shifts and lea
would be faster. Therefore we can assign a cost value to each command where imul
is a high cost and addq
, leaq
, sarq
and salq
have low costs. If we the total number of these commands and compare it with the values for k
which the compiler starts to use imul
, we can evaluate the ratio of the costs of these commands.
Note: salq
and sarq
(shift arithmetic left/right) are used to bitshift registers. This can multiply by powers of 2 similarly to how adding a zero to the end of a decimal number will multiply it by 10.
Testing with Larger Integers
I created a bash script that would compile with values of k
from 1 to 80. It prints out the number of times each command appears in the assembly code for each value of k
.
!/bin/bash
for number in {1..80}
do
echo ===================$number
echo MUL =
make k=$number | grep -c imul
echo \nlea =
make k=$number | grep -c lea
echo \nadd =
make k=$number | grep -c add
echo \nsub =
make k=$number | grep -c sub
echo \nsal =
make k=$number | grep -c sal
done
This reveals that the first time imul
is used is for k=46
// For k=46
imulq $46, %rdi, %rax
Using the results for all values between 1 and 80, I calculated the maximum number of commands used. I found the maximum was 3 commands. So I investigated the code for values that are one command away from 46, which would be 46/2
, since addq
can be used to quickly double a register.
// For k=23
leaq (%rdi,%rdi,2), %rax
salq $3, %rax
subq %rdi, %rax
Assembly for k=23
uses 3 commands, meaning appending an addq %rax, %rax
would use a total of 4 commands.
Based on this finding, I believe the compiler is trying to use up to 3 of these simple commands to optimise the multiplication, and using imul
when the minimum number of these simple command is greater than 3.
Explaining this Behaviour
So based on our assumption that the compiler is assigning costs to these commands, where addq
, leaq
, sarq
and salq
could have costs of 1
, then imul
would have a cost of between 3-4
as this is where the compiler chooses to use imul
instead of 4
simple commands. I started to research the hardware costs of these commands.
In old CPUs that are simple, this cost would come in the number of clock cycles it takes to evaluate these commands. In modern CPUs, calculating the number of clock cycles probably won't work as modern CPUs would have parallel processing and other optimisations that allow out-of-order processing.
I found a document assigning costs to assembly commands written by Agner Fog. from the Technical University of Denmark. Found at this link: https://www.agner.org/optimize/instruction_tables.pdf
For an old CPU like the Intel Pentium, this document shows the clock cycles for each command.

This shows that mov
takes 1 clock cycle. It also specifies that add
, sub
and lea
take 1 cycle. It does not include sal
or sar
. For imul
it states 11 clock cycles, making imul
cost 11 times as much as the other simple instructions

This supports the idea that imul
costs more than the simple expression resulting in the compiler using multiple simple expressions before using imul
. However, the size of this extra cost is much more than we predict, being 11*
instead of 3-4*
greater. To work around this, we need to look at tables for modern CPUs which are conveniently provided in this document.
For a modern CPU, I looked at Intel's 10th gen IceLake CPU released in 2019. Because of complex CPU optimisations to support out-of-order computation, the instructions are not measured in clock cycles but instead measured in latency. Latency here refers to time in core clock cycles making it effectively the same as measuring clock cycles, while accounting for speed improvements due to parallel processing and other optimisations. This is because it measure how long the CPU must wait for the result of the calculation before it can return it, the same idea as clock cycles in the Pentium

mov
costs
lea
costsThis shows that mov
, lea
, add
, sub
, sar
and sal
all cost around 1 in terms of latency. Compare this is the forms of imul
.

imul
costsAll the forms of imul
cost 3-4 in terms of latency. This supports the measurement of the compiler valuing imul
as 3-4 times slower than the other simple commands.
Conclusion and Findings
Look at the assembly code created by the compiler, it chooses to use up to 3 simple commands of mov
, lea
, add
, sub
, sar
and sal
. If more than 3 of these commands are required, the compiler will instead use imul
. Researching the speed at which a modern CPU will process these commands shows that imul
does in fact cost 3-4 times as much as the simple commands. It is interesting to not that for an old CPU like the Pentium, this cost was 11 times as much, and modern CPU hardware has reduced this down to only 3-4. We can also explain this as the cost of the simple commands in terms of time (latency) or clock cycles have stayed at a value of 1 cycle, while it is imul
which has become cheaper to compute. This points to the simple commands being limited to 1 cycle as this is the fastest anything can run, while innovations such as parallelism and out-of-order calculation has made imul
more efficient.