**Understanding Windows x64 Assembly**
Welcome.
For those of you who have decided to take this step closer to the
silicon that powers most of the modern world, I hope this document will help you
just that little bit more when things seem confusing or obtuse. If this document
_itself_ proves to be either of those things, please do contact me and I will
try to remedy such issues.
# Code Repository #
All the source code for this tutorial is available [here](https://github.com/sonictk/asm_tutorial).
# Introduction #
This is _not_ a tutorial for absolute beginners on how to get started with
programming assembly, or a "assembly for dummies" guide. Rather, it is a set of
notes and observations I have made while on my own journey into the Microsoft
x64 calling conventions that I hope will be useful to others who attempt the
same path. Especially since there seems to a dearth of useful information out
there regarding some of the stumbling blocks I've come across.
If you _are_ looking for a tutorial that will walk through through the basics of
assembly step-by-step, I highly recommend taking a look at the Additional Resources
section.
## Why should I care about assembly? ##
If you've never touched x64 assembly before, it's unlikely that I'll win you
over in a few paragraphs, but let me give my hot take:
* **Dropping to assembly has helped me figure out bugs in release code under
real production scenarios**. (A particularly nasty one ended up being triggered
by a driver bug! Thanks, "gaming" drivers.) I've also triaged issues in
production code from third-party vendors that were sent back to them to
resolve. External code aside, it's also proven useful to debug issues that
occurred only in non-debug builds of Maya plugins, for example. And this is
only with the limited knowledge that I have available! Imagine what you could do
if you were [Raymond Chen, for example](https://blogs.msdn.microsoft.com/oldnewthing/20180706-00/?p=99185).
* **Possible performance optimizations, or understanding how to modify your code
to take better advantage of compiler optimizations.** Some programmers resort to
algorithms and increasing resource usage (threading, GPGPU, etc.) in order to
achieve better performance; most of the time, there's actually a ton of
compute being left on the table that could be better utilized instead. There's
also the fact that if you don't know what to look for in your assembly, you won't
know exactly if the compiler's doing what you wanted.
* **Ultimately, whatever code you write in whatever language you choose gets
executed on actual, real hardware**, through abstractions that we take for
granted. Without understanding assembly, which is a lower-level abstraction
closer to the hardware/metal, you'll never be able to understand the reasoning
behind some of your decisions at the higher-level of abstractions. Once you
have a clear sense of what actually happens on the hardware, it becomes very
obvious what programming decisions are going to actually help improve
performance. You'll be able to understand which optimizations are actually
premature, and which should be done "by default".
* **A better understanding of unexpected behaviour.** Don't understand why your
code changes are triggering a performance hit or a crash? Profilers can
only tell you so much; if you understand how assembly works and how the
compiler generates it, you'll be better equipped to handle odd degenerate
cases, hardware quirks, or even compiler bugs. The only thing left between you
and solving the most perplexing of issues will be the hardware itself.
## Requirements ##
This tutorial will focus on writing x64 assembly in Windows. Linux and OSX,
which use the [System V ABI](https://wiki.osdev.org/System_V_ABI) calling
conventions, will not be covered here, as I feel that there is already a
plethora of information out there covering writing assembly on those platforms
already. However, in the event that I find it useful to cover information
regarding those platforms, it will appear in the following format:
!!! tip Crossing the platforms
Platform-specific information goes here.
### What you will need ###
- A PC running Windows 10.
- A **C/C++ development environment** set up and ready to go. (If you want to see
what my Emacs setup looks like, it's
available [here](https://github.com/sonictk/lightweight-emacs).)
- You will need a relatively modern version of **Visual Studio**. I
will be using 2017 for this tutorial. This may also serve as your C/C++
development environment, if you so prefer.
- The [Netwide Assembler](https://www.nasm.us/) compiler, for compiling our
assembly code itself. This tutorial was written using `2.13.03`, but any
modern version should work.
!!! note Why not MASM?
On Windows, the officially supported assembly compiler is known as the
[Microsoft Assembler](https://docs.microsoft.com/en-us/cpp/assembler/masm/masm-for-x64-ml64-exe?view=vs-2017).
Unfortunately, its feature set is rather limited, especially in the area of
macros. It's also not really maintained by Microsoft as much these days, and
so I am using NASM instead, since it has a little more feature-rich macros,
which I make liberal use of. You are free to practise with MASM instead, but
I think apart from some basic syntax differences, the concepts are pretty
much identical and shouldn't be too difficult to port over to NASM.
To learn how to use MASM, please refer to the [MSDN documentation](https://docs.microsoft.com/en-us/cpp/assembler/masm/microsoft-macro-assembler-reference?view=vs-2017)
for it.
!!! warning Reliability issues
I've noticed that NASM's website seems to be rather poor in terms of
reliability. If you find that the website is down at the time of reading
this tutorial, please use your Google-fu and try to find mirrors to the
documentation or the assembler itself. If all else fails, version `2.14` of
the assembler is
available
[here](https://drive.google.com/open?id=1soqbXNtu5LZGUuYiaVEoKY4kMje331ZX)
and the manual is available [here](https://drive.google.com/open?id=1wEBz7A2pqpvPH0VG7MHhwBBHZrxF5lg4).
### What you should know ###
- **Basic knowledge of C/C++**. More importantly, you should be innately
familiar with how to compile and run a program, understanding what is involved
in the compilation process and how it all comes together to generate that
EXE/DLL. You should also have basic competence with writing C code and how to
debug and inspect the execution of your programs.
- **Vague knowledge of assembly**. While this tutorial is not aimed at
experienced programmers, as previously mentioned, it's not going to be a
assembly-from-scratch introduction either. Having a vague inkling of assembly
and how it relates to hardware, whether it's MIPS assembly, ARM or whatever
other architecture you might have previous experience in would be incredibly
helpful in transitioning to the x64 ISA. If you have zero experience in the
topic, please do feel free to try to continue reading, and pay attention to the
refresher and resources sections of the tutorial.
In the next section, we'll go over setting up a basic toolchain for starting x64
assembly programming on Windows.
## Setting up ##
Now, we'll get started with compiling a simple "Hello, world" program, which
will verify the toolchain that you have installed on your computer works before
we go any further. I find it helps to always be able to actually run the code
that you're trying to understand.
### Netwide Assembler ###
NASM on Windows is distributed via [the website](https://www.nasm.us/index.php)
itself. Download the latest stable version (either the installer or the
standalone version is fine), and make sure that you can run `nasm -v` from the
command line. If all goes well, you should get something like the following:
```
nasm -v
NASM version 2.13.03 compiled on Feb 7 2018
```
### Hello, world ###
Once you have verified that the assembler is available on your `PATH`, you're
ready to go! To test if the assembler is working, create the following file and
name it `hello_world.asm`:
~~~~~~~~~~~~~~~~~~~~~~~~~~~
bits 64
default rel
segment .data
msg db "Hello world!", 0xd, 0xa, 0
segment .text
global main
extern ExitProcess
extern printf
main:
push rbp
mov rbp, rsp
sub rsp, 32
lea rcx, [msg]
call printf
xor rax, rax
call ExitProcess
~~~~~~~~~~~~~~~~~~~~~~~~~~~
You might be asking, "what on earth am I typing here?" at this point. We'll go
over this later. For now, though, let's just make sure that your toolchain is
working.
Let's go ahead and try to assemble this text into an object file. Run the
following command in a command prompt.
```
nasm -f win64 -o hello_world.obj hello_world.asm
```
If it worked, you should see the `hello_world.obj` file appear in the same
directory as your `hello_world.asm` file. We can then use the linker to create
an executable out of this object file, just like when compiling C/C++
programs. Run the following command from a command prompt that has the Visual
Studio environment variables set (e.g. the Developer Command Prompt for VS2017):
```
link hello_world.obj /subsystem:console /entry:main /out:hello_world_basic.exe
```
You should now have a `hello_world.exe` file in the directory. Run it, and if
you get the following output, congratulations, you've just wrote an assembly
program that runs on Windows!
```
hello_world_basic.exe
Hello world!
```
Now that we've verified that you're able to assemble code and create an
executable that runs it, let's take a step back from the keyboard, and go to the
whiteboard for a bit to get a brief refresher course on the basics of the
hardware we're dealing with here in this tutorial.
# CPUs: A hardware refresher #
Let's focus on the hardware that exists for the majority of the population when
they run programs on their computers: the Intel x86-64 CPU. For those of you who
already have good understanding of assembly, you can skip this section. If you
only have an understanding of C/C++ and have never really understood the details
regarding what _actually_ lives on a CPU, you might want to read this section
before moving on.
!!! tip Put on your reading glasses
For those who want a more detailed explanation of the topics covered in this
section, I recommend reading
[What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf)
by Ulrich Drepper. It is a fantastic resource that goes into more details in
a far more comprehensive fashion than I could ever hope to achieve. It's a
long read, so make yourself comfy before getting started!
## Types of memory ##
This could be an entire article on its own, so we're just going to go over the
basics here. By now, you should be familiar with the fact that everything that
happens with computers happens by moving memory around; all we're doing is just
writing code at an abstraction level higher than that to help us manage that
movement of memory. As such, how fast the hardware can move memory from one
place to another is one of the most important considerations when we write code
that uses different types of memory in different ways.
The common types of memory available to us when programming x64 assembly are, in
order of speed of access (for the current generation of Intel Core i7 processors):
* **CPU registers** (mostly 64 bits per register, with exceptions for some
special registers that could go up to 512 bits wide)
* **CPU caches** (L1, L2 and L3, with L1 and L2 being ~64 KB and ~256 KB per core, and
L3 being shared across all cores with sizes ranging from 4 MB to 24 MB) An
illustration of how these caches are shared among the cores is shown below:
*************************************************************
* +---------------------+----------------------+---------+ *
* | | | | *
* | Core 1 | Core 2 | L | *
* | | | 3 | *
* +---------------------+----------------------+ | *
* | L1_1 | L1_2 | c | *
* +---------------------+----------------------+ a | *
* | L2_1 | L2_2 | c | *
* +---------------------+----------------------+ h | *
* | | | e | *
* | Core 3 | Core 4 | | *
* | | | | *
* +---------------------+----------------------+ | *
* | L1_3 | L1_4 | | *
* +---------------------+----------------------+ | *
* | L2_3 | L2_4 | | *
* +---------------------+----------------------+---------+ *
* *
*************************************************************
[Figure [diagram]: A representation of the L1/2/3 caches in relation to the CPU.]
As you can see, each physical core has its own L1 and L2 cache, with the L3
cache being shared among all the cores. Please note that the diagram is not
representative of how the CPU is acutally laid out physically in the real world.
* **RAM** (The sticks that you put on your motherboard, generally measuring anywhere
from 2 GB all the way to 128 GB or more)
* **PCI-E NVMe** (If you have a solid state drive that uses this interface, which
can be anywhere from a few hundred gigabytes to 4 terabytes' worth)
* **Hard disk** (Through the SATA cables, regardless of whether your disk is a
solid-state or spinning platter drive, which have similar sizes to SSDs)
* **Network** (over a CAT5e cable or similar, which then must go through this entire
list of memory all over again from whichever PC it's retrieving the
information from)
It's no coincidence that memory seems to be a function of speed/distance/size;
in other words, the faster the memory is, the smaller capacity it tends to
have, and the closer it seems to be to the CPU on the motherboard. Turns out
that high-school physics comes into play here as well, even in the real world!
Also worth noting here is that when we talk about the speed of access; we're
talking _magnitidues_ of speed of access slower; even the difference between
accessing memory from a CPU register versus that of the L2 cache can be more
than 6 times [slower](https://gist.github.com/jboner/2841832)! While CPU
manufacturers devise new and ingenious solutions to help memory designs get
faster and faster (and also more [complex](https://meltdownattack.com/) in the
process), we must make sure that as programmers, we keep this in mind when we
write our code. **Memory access is not free, and abstractions have a cost, no
matter how insignificant they may seem.**
For now, we're going to talk a little bit more about the major type of memory
that we'll be dealing with when writing assembly, which also happen to be the
fastest type of memory available that we have access to: the CPU registers.
## Registers ##
Central Processing Units (CPUs) are much more complicated these days than they
were in the early days of computing. Nevertheless, some things have persisted
through the decades, and one of those things is how CPU registers work.
What are they? Simply put, they're storage locations on the CPU die that are
incredibly fast. In fact, they're the fastest type of memory that you have
access to on all your hardware available. The x64 registers evolved from the x86
registers, which in turn evolved from the 8086 architecture, which in turn
evolved from the 8080 microprocessor...anyway, you get the idea. CPU registers
on the x86-64 instruction set architecture have a _very long_ history, which
influences their naming conventions, their purposes (which have become more
general rather than their original specialized ones), and their sizes. This has
resulted in an (unfortunately) slightly complex way of referring to the
registers, in the name of preserving backwards compatibilty across architecture
changes over the years.
Let's talk about the various kinds of registers available to us.
### General-purpose registers ###
There are 16 GPRs on the x64 instruction set; they are referred to as `rax`,
`rbx`, `rcx`, `rdx`, `rbp`, `rsi`, `rdi`, `rsp`, `r8`, `r9`, `r10`, `r11` `r12`,
`r13`, `r14` and `r15`. The prefix "general-purpose" is a little misleading;
while they are technically general-purpose in the sense that the CPU itself doesn't
govern how they should be used, some of these registers have specific purposes
and need to treated in specific ways according to what is known as the OS _calling
conventions_, which we will cover in a little bit. For now, let's try to
visualize how these registers are addressed a little better.
!!! tip Crossing the platforms
There are two major calling conventions in practice for the x64 instruction
set; the
[Microsoft x64 calling convention](https://docs.microsoft.com/en-us/cpp/build/x64-software-conventions?view=vs-2017#overview-of-x64-calling-conventions)
used in Windows, and the
[System V ABI](https://www.uclibc.org/docs/psABI-x86_64.pdf)
used in Linux and OSX. As the name of this tutorial implies, we will only
focus on the Microsoft one here, though if you work on Linux or OSX as a
target platform for your software, I recommend getting familiar with that
convention as well.
**********************************************************
* 64 bits total *
* | *
* .--------------------+---------------------. *
* | | *
* v v *
* +-----+-----+----------+---------------------+ *
* | 8 | 8 | 16 bits | 32 bits | *
* | bits| bits| | | *
* +-----+-----+----------+---------------------+ *
* | al | ah |//////////|/////////////////////| *
* +-----+-----+----------+---------------------+ *
* | ax |//////////|/////////////////////| *
* | |//////////|/////////////////////| *
* +-----------+----------+---------------------+ *
* | eax |/////////////////////| *
* | |/////////////////////| *
* +----------------------+---------------------+ *
* | rax | *
* | | *
* +--------------------------------------------+ *
* *
**********************************************************
[Figure [diagram]: Example of a single 64-bit GPR extended from the 8080 ISA.]
The diagram above shows how different parts of the register `rax` can be
accessed. Each of the GPRs on the x64 instruction set are 64 bits wide. To
access different parts of the register, the following names `al`, `ah`, `ax` and
`eax` are used, where `a` can be subsituted with any of the other registers
(i.e. to access the lower 8 bits of `rbx`, you would use `bl` and to access the
lower 32 bits of the `rcx` register, you would use `ecx`). We'll see why
this is important when we start writing actual assembly code, but since you'll
be seeing these names a lot, it's best to get familiar with how they relate to
different parts of the register.
If you have trouble remembering the suffixes/prefixes, the way I like to think
about it is that `l` stands for "low", `h` for "high", `e` for extended and `r`
for "register", as in the full register (all 64 bits) itself.
There is an exception to how these prefixes/suffixes work, which applies to the
`r8` through `r15` registers. To access the lower 32 bits of these registers,
you add the `d` suffix to the end (i.e. `r8d`), for the lower 16 bits, the `w`
suffix is used (i.e. `r8w`) and finally to access the lower 8 bits, the `b`
suffix is used (i.e. `r8b`). You can think of `d` meaning "double word" (where a
word on on the Microsoft x64 ISA is 16 bits), `w` meaning "word", and `b`
meaning byte (where a byte is 8 bits). To make sure we're crystal-clear here,
the diagram below illustrates this.
**********************************************************
* 64 bits total *
* | *
* .--------------------+---------------------. *
* | | *
* v v *
* +-----+-----+----------+---------------------+ *
* | 8 | 8 | 16 bits | 32 bits | *
* | bits| bits| | | *
* +-----+-----+----------+---------------------+ *
* | r8b |/////|//////////|/////////////////////| *
* +-----+-----+----------+---------------------+ *
* | r8w |//////////|/////////////////////| *
* | |//////////|/////////////////////| *
* +-----------+----------+---------------------+ *
* | r8d |/////////////////////| *
* | |/////////////////////| *
* +----------------------+---------------------+ *
* | r8 | *
* | | *
* +--------------------------------------------+ *
* *
**********************************************************
[Figure [diagram]: Example of a 64-bit GPR introduced in the x64 ISA.]
We see a difference here compared to the `rax` register; there is no way to
address the higher 8 bits of the `r8` register. This is because the registers
`r8` through `r15` were introduced in the x64 ISA and were not part of the
original x86 ISA. As such, there was no need to worry about backwards
compatibility with these new registers being introduced.
### Floating point registers ###
Without getting into too much detail about how floating-point works in terms of
memory storage right now, suffice to say that the GPRs are (usually) only used
for storing integers and memory addresses, while the floating point registers
are used for storing actual floating point numbers.
In the x64 instruction set, there are 16 floating point registers, known as the
`xmm` registers. These are 128-bit wide, and unlike the GPRs, their lower bits
cannot be addressed separately. They number `xmm0` through `xmm15`, and as you
might expect, are generally used for [SIMD](https://en.wikipedia.org/wiki/SIMD)
operations as well.
!!! note A long, long time ago...
You may have noticed the existence of several oddly named registers named
`st0` through `st7` when you open the registers window in Visual Studio while
debugging a program. These registers are 80-bits wide, and exist as an
artifact of history from the x87 coprocessor, which was also known as the
Floating Point Unit (FPU). These days, the registers still exist for legacy
reasons, but they are generally unused by modern calling conventions. The
[CDECL calling convention](https://docs.microsoft.com/en-us/cpp/cpp/floating-point-coprocessor-and-calling-conventions?view=vs-2017)
treated them the same way that the XMM registers are treated by the modern OS
calling conventions of today; to be used for working with floating-point
values.
It's worth noting that since the ratification of the
[IEEE 754](https://en.wikipedia.org/wiki/Single-precision_floating-point_format)
floating point standards, floating point numbers either use 32 bits (1 sign bit,
8 exponent bits and 23 bits for the mantissa) or 64 bits (1 sign bit, 11
exponent bits and 52 bits for the mantissa) each, so the `xmm` registers are
well-suited to operations involving more than one floating point at once, since
you can stuff 4 floats/2 doubles per-XMM register.
!!! tip How floating point works in computing
If you're not familiar with how floating point works, I _strongly recommend_
getting a good grasp on the subject. There are
[plenty of resources](https://en.wikipedia.org/wiki/Single-precision_floating-point_format)
available regarding this topic that cover it in greater detail than I will be
expanding on here, and there are even
[visualizers](https://www.h-schmidt.net/FloatConverter/IEEE754.html)
available that can show you how a floating point number is represented in
binary format, which is also how they will get stored in the floating-point
registers.
We generally will not need to convert between binary/hexadecimal/floating point
representations by ourselves, since debuggers have facilities to help us do the
math, but it's good to know how they're stored so that you can recognize
patterns and regions of memory that might have data you're interested in.
### Extensions: SSE/AVX/AVX2/AVX512 ###
As previously mentioned, the XMM registers are 128 bits wide. This was
introduced as an _extension_ to the x64 instruction set by Intel in 1999, known
as the **Streaming SIMD Extensions (SSE)**. However, just like the GPRs have
been expanded over the years, so have these new wide registers. The **Advanced
Vector Extensions (AVX)** introduced wider registers in the form of 256-bit
registers, and AVX-512 bumped this up to a whopping 512-bit wide registers.
In terms of addressing them, they follow a similar pattern to the GPRs, where
`xmm` addresses the lower 128 bits, `ymm` the lower 256 bits, and `zmm` the
entire width of the register (if you're lucky enough to have a CPU that
expensive), which totals 512 bits.
***********************************************************************
* *
* 512 bits *
* | *
* .-----------------------------+-------------------------. *
* | | *
* v v *
* +--------------+------------------------------------------+ *
* | xmm |//////////////////////////////////////////| *
* | 128 bits |//////////////////////////////////////////| *
* +--------------+------------+-----------------------------+ *
* | ymm |/////////////////////////////| *
* | 256 bits |/////////////////////////////| *
* +---------------------------+-----------------------------+ *
* | zmm | *
* | 512 bits | *
* +---------------------------------------------------------+ *
* *
***********************************************************************
[Figure [diagram]: Illustration of a SIMD register in the x64 ISA.]
!!! note Spot the difference?
You might have noticed that Intel has a distinction between the `AVX` and
`AVX2` terms when referring to their CPU feature sets. For all intents and
purposes, both instruction set extensions use the same registers in the same
fashion; `AVX2` adds new instructions designed for use with the `ymm`
registers that allow for more efficient math operations that weren't
available in `AVX`. We will go over some of these operations later when we
write some SIMD code in assembly.
Most consumer-grade CPUs these days (as of 2018, anyway) will support
AVX2. AVX-512 itself is still limited to only higher-grade CPUs that are
generally seen only in high-end PC workstations and servers. We will not be
covering usage of these registers in this tutorial, but the concepts you learn
from working with the narrower `xmm` and `ymm` registers will expand to that of
the `zmm` registers as well.
It's important to note here that the diagram above does not mean that a SIMD
register has separate lanes for `xmm`, `ymm` and `zmm` instructions; they all
form part of the _same register_; the names just allow you to access different
parts of it. It's also worth noting here that while `xmm`/`ymm` registers number
0 through 15 (for a total of 16 SIMD registers), AVX-512 expands on this and
offers a whopping 32 SIMD registers, all 512-bit wide, numbering from 0 through 31.
### Special purpose registers ###
In addition to the registers that are available for use, there are also other
registers on the CPU that technically can be used, but really shouldn't under
almost all circumstances. These are registers that are reserved for special
uses, or part of the evolution of the ISA and left behind for compatibility
reasons. We'll talk about some of the more significant ones that you might run
into on a more frequent basis when dealing with assembly programming.
Firstly, there's the instruction pointer, or `rip`. This 64-bit register holds
the address of where the next instruction to be executed is at in the
assembly. While you can do some neat tricks with this by moving `rip` around to
simulate jumping around the program, this is almost always a bad idea, and you
should leave this register alone under most circumstances. It is, however,
useful for guiding yourself if you're looking at the disassembly of an
application as to figuring out where the program's current execution is at.
If this is a little hard to understand, here's a diagram that should clear
things up.
*****************************************************
* *
* foo: *
* push rbp *
* mov rbp, rsp *
* sub rsp, 32 *
* *
* o mov rax, 0 <----- Execution halted here *
* *
* leave <--- Address of this instruction is now *
* ret in the `rip` register *
* *
*****************************************************
[Figure [diagram]: Explanation of how the `rip` register contents work.]
The next instruction to be executed would be the `leave` instruction, which
would mean that the address pointing to its location in the program's memory
would be in the register `rip` at this point.
Secondly, there's the stack pointer, or `rsp`. While this is technically
considered a general-purpose register, this is meant to point to the bottom of
the stack. Messing with this is allowed, but if you break the calling
conventions (i.e. aligning on a non-16 byte boundary), you might trigger a
crash. We'll go over the calling conventions a little bit later.
Third, the next special-purpose register that I want to cover is the base pointer, or
the `rbp` register. This is usually used to refer to the original value of
`rsp`, which points to the current position of the last item pushed onto the
stack. This thus allows us to remember where the stack started from, and allows
us to unwind the stack when we leave the current scope (i.e. from a function
call, for example).
Finally, the last special-purpose register you should know about is the flags
register, also referred to as the `FLAGS/EFLAGS/RFLAGS/efl/rfl` register. Unlike
the other GPRs, this register stores bit flags that are set after certain
assembly instructions have executed.
A list of the bit flags used commonly are shown in the diagram below:
**************************************************************************************
* *
* +-----+------+------+------+------+------+-----+-----+-----+-----+-----+-----+ *
* | CF | PF | AF | ZF | SF | TF | IF | DF | OF | IOPL| NT | N/A | *
* +-----+------+------+------+------+------+-----+-----+-----+-----+-----+-----+ *
* Legend: *
* CF = Carry flag PF = Parity flag AF = Adjust flag ZF = Zero flag *
* SF = Sign flag TF = Trap flag IF = Interrupt flag DF = Direction flag*
* OF = Overflow flag IOPL = I/O privilege level flag (legacy) *
* NT = Nested task flag (legacy) N/A = Reserved *
* *
**************************************************************************************
[Figure [diagram]: Illustration of the bit flags in the `FLAGS` register.]
While this is a register that you normally would not be setting values in
yourself, it is useful for referring to in order to infer the result of a
previous operation. For example, if you notice that the sign flag is set, it
means that the previous instruction was probably a `test` or `cmp`
instruction. Instructions like `jl` (Jump short if Less) rely on checking the
bits set in this register in order to know which branch that they should
execute.
It's worth noting that there that there are only 16 bits laid out here, with one
of them already unused (the last bit, which is reserved for future standards
extensions). While the register itself is technically a 64-bit register (`rfl`),
only the lower 16 bits are used in this case. The rest of the bits are reserved,
and should not be tampered with or relied upon.
We will mostly be referring to the sign, zero, carry, direction, parity and overflow
flags when working in this tutorial, so I would advise you to try and remember
what their semantic meanings are.
* DF: Determines left/right direction for moving (i.e. comparing string
data). `0` means left-to-right and vice versa for `1`.
* SF: Shows the sign of the result of an arithmetic operation. Indicated by the
high-order of the left-most bit. Positive means this bit will be set to `1`,
and vice versa.
* ZF: Non-zero result gives `0` and vice versa.
* OF: Used to indicate when an arithmetic operation has resulted in overflow,
which means that the result did not fit in the number of bits used for the
operation by the Arithmetic Logic Unit (ALU). This can happen when you add a
large enough integer to another, resulting in a value greater than can be
represented in a 32-bit operation, resulting in integer overflow. `1` means an
overflow, and vice versa.
* PF: Used to indicate the total number of bits that _are_ set in the
result. For example, if the result of a operation was `01100`, this would be
set to `1`, indicating an even number of bits set (2), and vice versa.
# Windows: the window to the hardware #
Now that we understand the basics of the physical hardware that we're going to
be working with, it's about time to take a look at the interface that we're
going to be using in order to access it, that is, the operating system itself.
Let's start by taking a look at how the `rsp` register is used more thoroughly when
working with the stack.
## The stack and heap, revisited ##
While you should be familiar with the concept of the heap and stack by now if
you've been programming in C/C++, let's take this opportunity to do a quick
refresher. The **stack** and **heap** are just different memory segments of a
program; while there are OS kernel functions (`HeapAlloc`, `VirtualAlloc`, etc.)
that are designed to work specifically with the heap segment of a program's
memory address space, both are essentially just regions of memory in the layout
of a program's address space.
********************************************************************************
* address space of (2^64) - 1 *
* +--------------------------+ <--higher memory addresses *
* | | *
* | stack | *
* | | *
* | | *
* +- - - - - - +- - - - - - -+ *
* | grows | subtract | *
* | downwards | rsp | *
* | v | *
* | | *
* | | *
* | ^ | *
* | call | grows | *
* | malloc() | upwards | *
* | | | *
* +- - - - - - + - - - - - - + *
* | | *
* | heap | *
* | | *
* +--------------------------+ *
* | (uninitialized data) | *
* | bss | *
* | | *
* +--------------------------+ *
* | (initialized data) | *
* | data | *
* | | *
* +--------------------------+ *
* | | *
* | (code) | *
* | text | *
* | | *
* | | *
* +--------------------------+ <--lower memory addresses *
* *
********************************************************************************
[Figure [diagram]: Illustration of a program executable's memory layout.]
From the illustration shown, we can see that in order to grow the stack, we
actually _subtract_ from the stack pointer; thus, as new objects are pushed onto
the stack, their memory address _decreases_. This may seem counter-intuitive,
but this is how both Windows and Linux/OS X have defined their memory layouts,
so it's important to keep this in mind. In contrast, the heap grows by expanding
_upwards_ towards the stack's memory addresses, which means that new objects
allocated on the heap will have increasingly higher memory addresses.
Generally, you won't allocate objects on the heap by yourself directly; you'll
be using the OS kernel functions to do so. The main takeaway here is that while
there is historically a distinction between the two regions of memory, the fact
is that they are just that; two distinct regions of memory with specific
semantics that apply to either of them. Neither is inherently faster than the
other since they occupy the same virtual address space and map to the same
physical hardware. The only difference is up to us and how we choose to make
use of these separate regions of memory.
!!! note Direction of heap growth
[John Calsbeek](https://twitter.com/jcalsbeek) points out that this diagram
is slightly misleading; the heap isn't guaranteed to _always_ grow upwards as
a continguous pool of memory. The implementation of how the OS heap functions
work is technically not mandated by the x64 ABI, and thus heap addresses
should not be given the same treatment of predictability as you would with
stack addresses.
## Other program segments ##
You may have noticed in the the diagram above that in addition to the stack and
heap, there are separate, distinct regions of memory in the program's address
space that I drew, namely the `bss`, `data` and `text` **segments**. If you recall
the Hello, world section example, you might also have realized that there are
**labels** in that example that line up with the names of these program segments.
Well, that's no coincidence; it turns out that _is_ what is being defined in
those regions, and these regions exist as a result of the [PE executable file
format](https://docs.microsoft.com/en-us/windows/desktop/debug/pe-format),
which is a standardized format that Microsoft dictates for how executable's
memory segments should be arranged in order for the operating system's
**loader** to be able to load it into memory.
!!! tip Crossing the platforms
On Linux/OS X, the executable file format is known as the [Executable and
Linkable Format, or ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format).
While there _are_ differences between it and the PE file format used by
Windows, the segments we'll be dealing with here exist on both, and function
the same way on both. [On a fun note, Fabian Giesen has an amusing tweet
regarding the name](https://twitter.com/rygorous/status/1019339546738081793?s=20).
Let's go over what each of the major segments are for.
* **Block Started by Symbol (BSS)**: This segment of memory is meant for
variables that are _uninitialzed_ (e.g. `int a;`). The name, as is the case
with many things in assembly, has
a [very long history](https://en.wikipedia.org/wiki/.bss#Origin).
* **Data segment**: This segment of memory contains _constants_ or _initialized
data_ (e.g. `int a = 5;` or `const int a = 5;`). This is fairly straightfoward.
* **Text segment**, also known as the **code segment**: This contains the actual
assembly instructions, hence the name.
!!! note Holy Segmentation Fragmentation Batman, what on earth is all that?
If you _did_ click on the MSDN link above, you might have been taken aback at
all the different types of possible segments that can exist in a x64
program's memory address space. For this tutorial, we will only be focusing
on the ones just described, but if you're going to be debugging code in
production that wasn't originally written in assembly (which is likely to be
the case), you might want to keep that reference page handy.
Now that we have a better understanding of how memory is mapped by the operating
system into the address space of a process, let's take a closer look at how that
mapping process works.
## The Virtual Memory Address System ##
If you've never heard of the term **virtual memory** before, this section is for
you. Basically, every time you've stepped into a debugger and seen a hexadecimal
address for your pointers, your stack variables, or even your program itself,
this _has all been a lie_; the addresses you see there are **not** the _actual_
addresses that the memory resides at.
What actually happens here is that every time a user-mode application asks for
an allocation of memory by the operating system, whether from calling
`HeapAlloc`, reserving it on the stack through `int a[5]` or even doing it in
assembly, the operating system does a translation from a **physical address** of
a real, physical addressable location on the hardware to a **virtual address**,
which it then provides you with to perform your operations as you deem
fit. Processes are mapped into a **virtual address space**, and memory is
reserved by the operating system for that space, but _only_ in virtual memory;
the physical hardware could still have memory unmapped to that address space
until the user-mode process actually requests an allocation.
If the memory is already reserved and available for use, the CPU **translates**
the physical addresses into virtual addresses and hands them back to the
user-mode process for use. If the physical memory is not available yet
(i.e. perhaps because the CPU caches are full and we now need to use the DDR RAM
sticks' storage instead), the OS will **page** in memory from whatever available
physical memory is available, whether that be DDR RAM, the hard drive, etc.
The translation of the physical memory addresses of the hardware to virtual
memory addresses that the operating system can distribute to applications is
done using a special piece of hardware known as the **Memory Management Unit, or
MMU**. As you might expect, if every single time a program requested memory
required a corresponding translation operation, this might prove costly. Hence,
there is another specialized piece of hardware known as the **Translation
Look-aside Buffer, or TLB cache**. That also sits on the CPU and caches the
result of previous addresses translations, which it then hands to the operating
system, which in turn hands it to the application requesting the allocation.
Both pieces of hardware are present on all modern CPUs. The exact way in which
both of them work to do the actual translation process is
implementation-dependent and can vary across different CPUs, so we won't worry
about them in this tutorial.
In addition to the TLB cache, there is also the presence of what's known as a
**page table** that the VMAS implements. This table is where the operating
system itself stores the mappings of the virtual and physical addresses, with
each individual assocation referred to as a **page table entry**. This
essentially acts as a larger version of the TLB cache, but due to its
implementation (the TLB cache is what's known as a _fully associative_ cache, as
opposed to the page table, which is usually implemented similar to a linked list
in terms of data structure and thus has to be _walked_ in order to find an
entry) is slower. If the TLB cache and the PTEs are both _missed_, then a **page
fault** occurs, and new physical memory addresses are mapped into the PTEs and
returned to the CPU.
**********************************************************************
* *
* +--------------------------------------------------------------+ *
* | | *
* | CPU (entire die) | *
* | +-----------+ | *
* | | MMU | | *
* | | (kicks in | | *
* | +--------+ | if TLB | | *
* | | | virtual address | cache | | *
* | | Core +------------------->| misses) | | *
* | | | | | | *
* | +--------+ | +------+ | | *
* | | | TLB | | | *
* | | +------+ | | *
* | +-----+-----+ | *
* | physical address | | *
* | +---------------------------------v-------------------+ | *
* | | CPU caches (L1/L2/L3) | | *
* | | | | *
* | +--------------------------+--------------------------+ | *
* +-------------------------------|------------------------------+ *
* physical address | *
* | *
* +-------------------------------v------------------------------+ *
* | Page Table Entries (PTE) ---. | *
* +------------------------------------------------|-------------+ *
* | | | *
* | <---' | *
* | Main memory | *
* | | *
* | | *
* +--------------------------------------------------------------+ *
* *
* *
**********************************************************************
[Figure [diagram]: How a virtual address is translated to a physical address.]
There are two important things to note here. The first is that **the TLB cache
is in practice a very small cache, and thus both temporal and spatial locality
are very important when we do programming** in order to avoid TLB **cache
misses** (i.e. we have to rely on the MMU to do a translation since the previous
translation is no longer in-cache). The second is that **while TLB cache misses
are costly, page faults are even moreso**, and so we should write our code to
reduce the possibility of either scenario from taking place if we want our
programs to run quickly. So even though we are working at the level closest to
the bare metal possible, there are _still_ layers of abstraction between us and
the actual, physical hardware, both by the operating system and the CPU.
!!! note Trick or travesty?
If you're wondering what the purpose of all this CPU machinery is, there
_was_ a time when there was no such thing as a VMAS, and programs were
allowed to use actual physical memory addresses on the hardware. This was a
much simpler, but also much more dangerous time, since an application could
easily bring down the machine by utilizing reserved memory addresses (perhaps
for hardware interrupts, I/O, etc.) in a manner that was unsupported.
For example, the [Game boy CPU](http://marc.rawer.de/Gameboy/Docs/GBCPUman.pdf)
(similar to an Intel 8080) did not have the concept of virtual memory.
The [Intel 286](https://en.wikipedia.org/wiki/Intel_80286#Protected_mode) was
the first processor to introduce the concept of a **protected mode**, which
introduced the virtual address system that has become ubiquitous among all
modern CPUs. This allowed for safer multi-tasking between concurrently
running applications (since the OS could intercept calls to invalid memory
addresses when translating the virtual addresses to physical ones and trigger
a [segmentation fault](https://en.wikipedia.org/wiki/Segmentation_fault) instead)
and concepts such as [privilege levels](https://en.wikipedia.org/wiki/Protected_mode#Privilege_levels)
to become possible.
So, while there's no doubt that this concept has certainly increased the
complexity of how we approach programming these days, there's clear benefits
to the paradigms introduced here that were deemed significant enough to
outweigh the drawbacks of added complexity and a slight performance hit.
!!! tip A black hole of knowledge...
This topic's rabbit hole goes _very_ deep, and I will not cover all the
implementation details regarding the TLB here.
However, I highly recommend perusing the Resources section, especially
the book by Ulrich Drepper to get more details regarding the TLB cache and
how it affects your code. Specifically, I would recommend looking at the way
the TLB maps **page-table entries** to the **segment-table entries** in order
to generate the final virtual address from a physical one. There's also good
information in there about how PTEs are implemented on the x64 ISA, which
should make it clear why they're slower than the TLB cache.
## Bits, bytes and sizes ##
At this point, it's also probably a good idea to give a quick refresher for the
non-Win32 API programmers out there who might not be as familiar with the
nomenclature of the Windows data types.
* **Bit**: 0 or 1. The smallest addressable form of memory.
* **Nibble**: 4 bits.
* **Byte**: 8 bits. In C/C++, you might be familiar with the term `char`.
* `WORD`: On the x64 architecture, the word size is 16 bits. In C/C++, you might
be more familiar with the term `short`.
* `DWORD`: Short for "double word", this means 2 x 16 bit words, which means 32
bits. In C/C++, you might be more familiar with the term `int`.
* `oword`: Short for "octa-word" this means 8 x 16 bit words, which totals 128
bits. This term is used in NASM syntax.
* `yword`: Also used only in NASM syntax, this refers to 256 bits in terms of
size (i.e. the size of `ymm` register.)
* `float`: This means 32 bits for a single-precision floating point under the
IEEE 754 standards.
* `double`: This means 64 bits for a double-precision floating point under the
IEEE 754 standard. This is also referred to as a **quad word**.
* **Pointers**: On the x64 ISA, pointers are all 64-bit addresses.
These terms will come up often in assembly programming, so make sure you have
them all memorized.
!!! tip Crossing the platforms
This nomenclature is specific to the Intel x86-64 ISA. For example, on the
ARM64 architecture, the word size is 32 bits, so don't always assume the
size specifiers mean the same thing on different ISAs!
## The Microsoft x64 Calling Convention ##
Finally, we get to one of the most confusing parts of assembly. The calling
conventions. At the end of the day, our assembly code must run on an operating
system, and in this case, we're focusing on Windows 10, so let's take a look at
what it will take for the operating system to run our assembly code.
The Microsoft compiler (MSVC) actually has a couple of calling convention
modifiers that developers can elect to use which changes the rules of the
conventions, but we'll be focusing on the most common one here, which is known
as the **Microsoft x64 Calling Convention.**
!!! tip
For more information on these other calling convention modifiers that MSVC has
available, you can refer to the
[MSDN article](https://docs.microsoft.com/en-us/cpp/cpp/microsoft-specific-modifiers?view=vs-2017).
These conventions are usually applicable in very unusual situations where
performance is of an utmost priority, and should not be used under normal circumstances.
### What is a calling convention? ###
Simply put, it's a set of strict guidelines that our code must adhere to in
order for the operating system to be able to run our assembly code. This is
because assembly is, by nature, as close to the hardware as we can possibly get,
and so compilers standardize how our functions should be implemented and how
they should be called by the actual hardware. This allows code from multiple
sources to all run together harmoniously, instead of having every developer try
to develop their own conventions around how the hardware should be used and
potentially causing incompatibilties between each other's code. When the
operating system can make assumptions about how the code runs, it itself can
then use the hardware to perform certain operations behind-the-scenes (i.e. fix
up calls to `HeapAlloc()` to the right addresses in memory). If you break those
assumptions, you risk incorrect execution of the code.
The conventions themselves, much like the ISA, have evolved over the years into
a long, gruelling set of standards that will most likely take a bunch of
trial-and-error before you get the hang of them. Keep practising, and eventually
they'll become second nature to you once you've hit enough crashes.
We won't go over the entire list of requirements here, since that would become
more of an academic exercise, but we'll focus on the most practical ones that
impact the way you write your code, and talk about any others that come up as we
go along.
### Alignment requirements ###
* Most data structures will be aligned to their natural alignment. This means
that the compiler will insert padding as needed if a data structure needs to
be aligned to a specific boundary. Most of the time, this won't be the case,
and your data structures will be left as-is. However, the compiler will
sometimes add padding to your structures if it thinks that it will help
increase performance by aligning it to certain byte boundaries, depending on
the architecture your code is being compiled on. Since we're working with
assembly and don't have the benefit of that, we must fulfill any special
alignment requirements ourselves if necessary.
* Most importantly, the stack pointer `rsp` _must_ be aligned to a 16-byte
boundary. This means that if we move the stack pointer by adding 8 bytes, we _must_
move it by another 8 bytes or we will be violating the calling convention.
### Function parameters and return values ###
The calling convention also dictates how functions should be called, and how
they should return their results. In order to ensure that we are able to craft
assembly that works harmoniously with other code, we must adhere to these rules.
This is a common source of confusion for those new to the calling convention, so
I'm going to do my best to illustrate it here.
#### Integer arguments ####
From the [MSDN documentation](https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2017):
> The first four integer arguments are passed in registers. Integer values are
> passed in left-to-right order in RCX, RDX, R8, and R9, respectively. Arguments
> five and higher are passed on the stack.
This might seem a little confusing, so let's illustrate this a little better
with an example function that takes 5 integers.
~~~~~~~~~
void foo(int a, int b, int c, int d, int e)
{
/// Some stuff happens here with the inputs passed in...
return;
}
~~~~~~~~~
In order to follow the x64 calling convention, we must therefore pass `a`, `b`,
`c` and `d` in the registers `rcx`, `rdx`, `r8` and `r9` respectively, with `e`
being pushed onto the stack before calling the function `foo()`. Thus, if I
wanted to call `foo()` like so:
~~~~~~
foo(1, 3, 5, 7, 9);
~~~~~~
Then I would have to make sure that the registers held the appropriate values,
along with the first item on the stack as well:
*************************************************************************
* *
* +---------+---------+ +-------------+ *
* | rcx | 1 | | | *
* +---------+---------+ | stack | *
* | rdx | 3 | + - - - - - - + *
* +---------+---------+ | 9 (1st item)| *
* | r8 | 5 | +------+------+ <-- `rsp` points here *
* +---------+---------+ | | | *
* | r9 | 7 | | | | *
* +---------+---------+ | v | *
* | ... | ... | | (grows down)| *
* +---------+---------+ +-------------+ *
* *
*************************************************************************
[Figure [diagram]: How the memory should be laid out before calling `foo()`.]
#### Floating-point arguments ####
Again, from the [MSDN documentation](https://docs.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2017):
> Any floating-point and double-precision arguments in the first four parameters
> are passed in XMM0 - XMM3, depending on position.
For floating-point arguments, the `xmm` registers `0` through `3` are used, for
a total of 4 arguments being passed in the SIMD registers, with the rest being
pushed onto the stack.
Let's see how this works with another example:
~~~~~~
void foo_fp(float a, float b, float c, float d, float e)
{
// Do something with these floats...
return;
}
~~~~~~
Assuming I wanted to call `foo_fp()` like so:
~~~~~~
foo_fp(0.1f, 0.2f, 0.3f, 0.4f, 0.5f);
~~~~~~
I would need to have the following memory set up before making the function
call:
****************************************************
* +---------+---------+ +-------------+ *
* | xmm0 | 0.1f | | | *
* +---------+---------+ | stack | *
* | xmm1 | 0.2f | + - - - - - - + *
* +---------+---------+ | 0.5f (1st)| *
* | xmm2 | 0.3f | +------+------+ *
* +---------+---------+ | | | *
* | xmm3 | 0.4f | | | | *
* +---------+---------+ | v | *
* | ... | ... | | (grows down)| *
* +---------+---------+ +-------------+ *
****************************************************
[Figure [diagram]: How the memory should be laid out before calling `foo_fp()`.]
As we can see, the concept applies much like it for integers. But what if we
have something like this?
~~~~~~
void foo_mixed(int a, int b, float c, int d, float e)
{
// Variable types of arguments...now what?
return;
}
~~~~~~
Which values go in which registers now?
#### Mixing parameter types ####
The answer is that the _position_ of the argument dictates which register it
goes in. Therefore, if we called `foo_mixed()` like so:
~~~~~~
foo_mixed(1, 2, 0.3f, 4, 0.5f);
~~~~~~
Then we would need to have the following memory layout:
***************************************************************************
* +---------+---------+ +---------+---------+ +-------------+ *
* | xmm0 |/////////| | rcx | 1 | | | *
* +---------+---------+ +---------+---------+ | stack | *
* | xmm1 |/////////| | rdx | 2 | + - - - - - - + *
* +---------+---------+ +---------+---------+ | 0.5f | *
* | xmm2 | 0.3f | | r8 |/////////| +------+------+ *
* +---------+---------+ +---------+---------+ | | | *
* | xmm3 |/////////| | r9 | 4 | | | | *
* +---------+---------+ +---------+---------+ | v | *
* | ... | ... | | ... | ... | | (grows down)| *
* +---------+---------+ +---------+---------+ +-------------+ *
***************************************************************************
[Figure [diagram]: How the memory should be laid out before calling `foo_mixed()`.]
Hopefully this makes some sense. There are other conventions that govern
parameter passing as well, which I'll try to summarize here:
* Intrinsic types, arrays and strings are never passed into a register
directly. A pointer to their memory locations is what goes into the registers
instead.
* Structs/unions 8/16/32/64 bits in size may be passed as if they were integers of
the same size. Those of other sizes are passed as a pointer as well.
* For variadic arguments (i.e. `foo_var(int a, ...)`), the aforementioned conventions
apply depending on the type of the arguments that are passed in. It is the
caller's responsibility that the memory layout is correct before calling the
function. Additionally, for floating-point values, both the integer and
floating-point registers _must_ have the same argument's value, in case the
callee expects the value in the integer registers.
* For unprototyped functions (e.g. forward-declarations), the caller passes
integer values as integers and floating-point values as double-precision. The
same rule about floating-point values needing to be in both the integer and
floating-point registers applies as well.
For example:
~~~~~~
void foo_unprototyped();
void foo2()
{
foo_unprototyped(1, 0.2f, 3);
return;
}
~~~~~~
Would mean the following memory layout is required to be set up by the caller:
******************************************************
* +---------+---------+ +---------+---------+ *
* | xmm0 |/////////| | rcx | 1 | *
* +---------+---------+ +---------+---------+ *
* | xmm1 | 0.2 | | rdx | 0.2 | *
* +---------+---------+ +---------+---------+ *
* | xmm2 |/////////| | r8 | 3 | *
* +---------+---------+ +---------+---------+ *
* | xmm3 |/////////| | r9 |/////////| *
* +---------+---------+ +---------+--- -----+ *
* | ... | ... | | ... | ... | *
* +---------+---------+ +---------+---------+ *
******************************************************
[Figure [diagram]: The memory layout for a variadic/unprotyped function.]
#### Return values ####
Now that we've sorted our inputs, it's time to do the same for our outputs as
well. The Microsoft x64 calling convention, thankfully, has simpler rules when
it comes to return values from functions.
* Any scalar return value 64 bits or less in size is returned in `rax`.
* Any floating-point value is returned in `xmm0`.
Thus, a function `int foo()` would return its value in the `rax` register, and a
function `float foo()` or `double foo()` would return its value in the `xmm0`
register. If your function is returning a user-defined type (such as a
`struct`), the same rules apply to it as if it were being passed in a register:
it must be of a size of 1/2/4/8/16/32/64 bits. If it is, it will be returned in
the `rax` register; if not, a pointer to its memory will be returned
instead. The caller is responsible for allocating this memory and passing the
pointer in the appropriate integer register before making the call to the function.
An example to illustrate this case is shown below:
~~~~~~
struct Foo
{
int a, b, c; // Total of 96 bits. Too big to fit in one of the GPRs.
}
Foo foo_struct(int a, float b, int c)
{
// Stuff happens...
return result; // This is a `Foo` struct.
}
Foo myStruct = foo_struct(1, 2.0f, 3);
~~~~~~
The caller must ensure the memory in the registers is like so:
*****************************************************
* +---------+---------+ +---------+---------+ *
* | xmm0 |/////////| | rcx |*myStruct| *
* +---------+---------+ +---------+---------+ *
* | xmm1 |/////////| | rdx | 1 | *
* +---------+---------+ +---------+---------+ *
* | xmm2 | 2.0f | | r8 |/////////| *
* +---------+---------+ +---------+---------+ *
* | xmm3 |/////////| | r9 | 3 | *
* +---------+---------+ +---------+--- -----+ *
* | ... | ... | | ... | ... | *
* +---------+---------+ +---------+---------+ *
*****************************************************
[Figure [diagram]: The memory layout for a user-defined `struct`.]
The return value for the function will be a pointer to the `myStruct` result, in
the `rax` register.
Hopefully, just from reading these conventions, you start to realize just how
important simple things like the layout of your data structures and function
prototypes can be in affecting the behaviour of the hardware! While it's not
something you should sweat over every time you write a new function, it's
definitely something to keep in mind as you design new APIs, especially if
performance is a requirement.
!!! tip Crossing the platforms
Remember, this is _only_ for Windows platforms! On Linux/OS X, the [System V
ABI](https://wiki.osdev.org/System_V_ABI) calling convention calls for the
first **6** integer arguments to be passed in `rdi, rsi, rdx, rcx, r8, r9`
and the first **8** floating-point arguments to be passd in `xmm0` through
`xmm7`, with the rest of the respective arguments being pushed onto the
stack.
There is also the requirement of a [red zone](https://en.wikipedia.org/wiki/Red_zone_(computing))
of **128 bytes** below the stack address space, which is basically a region of
memory that is reserved for the operating system and _should not_ be modified
by the function under any circumstances. On current Windows architectures,
the red zone is usually not present.
For light reading, you might be interested in [Raymond Chen's musings](https://blogs.msdn.microsoft.com/oldnewthing/20190111-00/?p=100685)
regarding the subject.
### Volatile and non-volatile ###
The concept of _volatility_ differs from the `volatile` keyword in C/C++
slightly, though the general meaning doesn't change. In the Microsoft x64
calling convention, certain registers are treated as **volatile**, meaning that
they are subject to change and are not guaranteed to be preserved between
function calls or scope changes. However, some registers are what's known as
**non-volatile**, which means that we are potentially responsible for _preserving_
the state of the registers and _restoring_ their states after our function call
is complete. If we do not, then the caller might act upon invalid data in those
registers, since we failed to preserve their original values that the caller was
dependent on.
Under the Microsoft x64 calling convention, the `rax`, `rcx`, `rdx`, `r8`, `r9`,
`r10` and `r11` registers are considered volatile, while the `rbx`, `rbp`,
`rdi`, `rsi`, `rsp`, and `r12` through `r15` registers are considered
non-volatile and _must_ have their states saved and preserved through function
calls. We'll see how this is done in actual assembly code later.
### The shadow space ###
Under the Microsoft x64 calling convention, there is a unique concept of what's
known as a **shadow space**, also referred to as a **home space**. This is a
space that is reserved every time you enter a function and is equal to at least
32 bytes (which is enough space to hold 4 arguments). This space _must_ be
reserved whenever you're making use of the stack, since it's what is reserved
for things leaving the register values on the stack for debuggers to inspect
later on. While the calling convention does not explicitly require the callee to
use the shadow space, you should allocate it regardless when you are utilizing
the stack, especially in a non-leaf function.
Also, as a reminder, no matter how much space you allocate for the shadow space
and your own function's variables, you still need to ensure that the stack
pointer is aligned on a 16-byte boundary after all is said and done.
# Hello, world revisted #
Ok, that's enough theory for now. Let's take what we've learned so far, and go
back to the "Hello, world** example again.
## Assembly directives: a refresher ##
Before we delve back into the code, here's a quick reminder on how assembly
instructions work in the Intel x86-64 ISA.
Assembly instructions generally follow a specific pattern that can be described
as follows:
***************************************
* *
* mov rax, rbx *
* ^ ^ ^ *
* | | | *
* | | | *
* | | source operand *
* | | *
* | dest. operand *
*psuedo-op/directive/mmemonic *
* *
* .-----------. *
* | | *
* +-----v---+ +----|---+ *
* | rax | | rbx | *
* +---------+ +--------+ *
* *
* value of rbx copied to rax *
* *
***************************************
[Figure [diagram]: Illustration of what happens after an assembly `mov` instruction.]
In this example, the directive/pseudo-op `mov` is issued with the `rbx` register
as the _source operand_, and the `rax` register as the _destination
operand_. Thus, when this instruction is executed by the CPU, the contents of
the `rbx` register will be copied to the `rax` register.
Some assembly directives have more than two operands, especially specialized
instructions that are designed for SIMD. Such a directive is the `mulss`
instruction, which can take 2 or 3 operands. Let's take a look at the latter:
*************************************************
* *
* mulss xmm0, xmm1, xmm2 *
* ^ ^ ^ ^ *
* | | | | *
* | | | | *
* | | | | *
* | | | | *
* pseudo-op dest. op. | | *
* | | *
* 2nd. operand | *
* 1st operand*
* *
* .------<-----. .----<-----. *
* | | | | *
* +----v----+ +---|-v---+ +----|----+ *
* | xmm0 | | xmm1 | | xmm2 | *
* +---------+ +---------+ +---------+ *
* xmm2 multiplied by xmm1 *
* result stored *
* in xmm0 *
* *
*************************************************
[Figure [diagram]: Illustration of what happens after an assembly `mulss` instruction.]
As we can see, this allows us to do work on 3 registers using only one single
instruction, which, depending on the number of CPU cycles it costs, might be
faster than using separate instructions for multiplying the `xmm2` and `xmm1`
registers first, before moving the result to the `xmm0` register. It's important
to be aware of these specialized instructions that are available for use when
writing bare assembly, as they can yield significant performance gains when
utilized correctly.
When it comes to operands, we have other ways of using them other than using
register names or constant values explicitly. NASM has support for expressions
to be used in operands, which boils down to:
* Having a **starting address** of the memory segment we're referring to.
* Having an **effective address** from that memory, which is comprised of a
**displacement**, and/or a **base register**.
Examples of such instructions are:
~~~~~~
mov rax, [rbx + 8] ; 8 is the displacement.
mov rax, [rbx + rcx] ; rbx is the base register.
~~~~~~
There are hundreds of assembly directives and there's no hope of doing a
comprehensive revision within the scope of a single tutorial, let alone a single
section, even if we limit it to only the most common assembly directives. I
advise the reader to refer to the Intel Instruction Manuals, available in the
Additional Resources section of this tutorial for the full reference. For now, I
will continue and explain new directives as we encounter them in the "Hello,
world!" example and others on a case-by-case basis.
## Breaking the code down ##
Let's look at the code again, line-by-line.
### Architecture directives ###
~~~~~~~~~~~~~~~~~~~~~~~~~~~
bits 64
default rel
segment .data
msg db "Hello world!", 0xd, 0xa, 0
segment .text
global main
extern ExitProcess
extern _CRT_INIT
extern printf
main:
push rbp
mov rbp, rsp
sub rsp, 32
call _CRT_INIT
lea rcx, [msg]
call printf
xor rax, rax
call ExitProcess
~~~~~~~~~~~~~~~~~~~~~~~~~~~
The first line is the `bits` directive and tells the NASM assembler to generate
code designed to run on 64-bit processors, which is what our hardware is. (At
least, for the purposes of this tutorial.) This directive is not strictly
required, since you can specify the target object format to the NASM assembler
during compilation, but we're going to leave it here to make it very clear as to
what architecture our assembly code is meant to be run on.
The next directive is a little more complex and requires a bit more explanation:
it's basically telling the NASM assembler to use **`rip`-relative addressing**.
### Addressing modes ###
Before we talk about the specific type of addressing mode known as
**`rip`-relative addressing**, let's have a refresher course on what addressing
modes are in the first place. Thankfully, this is a fairly simple concept.
**Addressing modes**, put simply, are the different conventions available by
which an assembly instruction can access registers or other memory. For example,
the instruction:
```
mov rax, 0
```
Is what's known as an **immediate addressing mode**. This is because the operand
has a _constant value_ (in lieu of a explicit constant, an expression is also
allowed when it evaluates to a constant as well).
In contrast, the instruction:
```
mov rax, rbx
```
Is simply known as **register addressing**, for obvious reasons.
We can also do what is known as **indirect register addressing**:
```
mov rax, [rbx]
```
Which basically results in the following operation:
*****************************************************************
* +--------------+ *
* rax register |//////////////| *
* +---------+ |//////////////| *
* | 10010...| |//////////////| *
* +---^-----+ |//////////////| (more addresses above) *
* | |//////////////| ... *
* | +--------------+ 0xa53 *
* '-------------+10010... | 0xa54 *
* | (64 bits) | ... *
* +--------+ | | (more addresses below) *
* | 0xa54 | +--------------+ *
* +--------+ |//////////////| *
* rbx register |//////////////| *
* +--------------+ *
*****************************************************************
[Figure [diagram]: Illustration of what happens after an indirect address operation.]
As we can see, the memory that was located by first _dereferencing_ the address
that was stored in `rbx` was then moved to `rax` as a result of the operation.
There are various types of addressing modes available in the Intel x64 ISA,
which are essentially the same as the x86 ISA (since it evolved from there
anyway), with the exception of the **`rip`-relative addressing** mode; that was
introduced with the x64 ISA and is only available on that architecture.
### `rip`-relative addressing ###
What is `rip`-relative addressing? To understand the answer to this question, we
need to understand a little bit of history.
#### Position-Independent Code ####
In the previous x86 ISA, some assembly instructions for working with memory
loading/storing were absolute. There was no ability to reference data by using a
displacement off the current _instruction pointer (`rip`)_. This made it very
difficult to generate code that was **position-independent (PIC)**, which is to say,
code that _does not depend on where exactly it is loaded into memory_ by the
operating system.
Obviously, since programs generally _weren't_ written by hardcoding base
addresses all over the code (since you couldn't assume that your program would
always be loaded by the operating system at the same base address each time it
was run), the _loader_ had to dynamically, at the time of loading your program
into memory, _fix up_ all the code to relocate any instances of memory addresses
that were specified in an absolute manner and add the displacement that your
program's base address was actually loaded at in order to make sure everything
worked. What this meant in practice was that even the simplest of operations
required vastly more instructions to execute, since the amount of fix-ups the
loader would have to do was tremendous.
Ok. So how does `rip`-relative addressing help?
Well, with the x64 ISA, the concept of allowing assembly instructions to
reference memory by specifying a displacement (32 bit signed) from the current
`rip` instruction pointer means that even though the program may be loaded at a
different base address, because the data is now being referenced from the `rip`
instruction pointer instead of an absolute address, the loader no longer has to
perform any fixup process on that code. This means that x64 code in general
which uses this form of addressing as opposed to the flat addressing model
(i.e. hardcoding the address references) does not require the loader to perform
the fixups, and results in a much reduced binary size, since the base address
re-location information that the loader performs has to be stored in the PE file
format (in the [`.reloc` section](https://docs.microsoft.com/en-us/windows/desktop/debug/pe-format#the-reloc-section-image-only)).
To tell NASM to compile our program with `rip`-relative addressing, the
following directive is used:
```
default rel
```
Thus, we get what we want; the loader does less work, and the program still
runs, and we get our position-independent code.
### Initializing data in assembly ###
The next line of "Hello, world!" is the **data segment**. (If you don't remember
what that is, go back to the Other program segments section and refresh your
memory.)
```
segment .data
msg db "Hello world!", 0xd, 0xa, 0
```
This is fairly straightforward, if a little terse. Basically, we define a
**variable** in NASM syntax called `msg` of type `byte` (`db` is a mmemonic for
"define byte") that is a string of "Hello world!", followed by the `CR` and `LF`
line terminator characters using
their [hexadecimal ASCII code equivalents](http://www.asciitable.com/),
and finally a null terminator to indicate the end of the string.
!!! tip Crossing the platforms
As a reminder, the `CRLF`(Carriage return, line feed) line-ending is
Windows-specific; Linux and OS X use just `LF` to indicate a newline.
There are several ways to initialize data in NASM, which you should read up on
[in the manual](https://nasm.us/doc/nasmdoc3.html#section-3.2.1).
### Importing and Exporting symbols ###
Now let's look over the next few lines of the example.
```
segment .text
global main
extern _CRT_INIT
extern ExitProcess
extern printf
```
The first line, much like the `segment .data` directive, tells the NASM
assembler that this is the beginning of the `.text` section, or where all the
actual assembly instructions should go. No big surprise. But what's the next few
lines mean?
If you have worked with DLLs/`.so` files before (or possibly read
my [previous](https://sonictk.github.io/maya_hot_reload_example_public/) [tutorials](https://sonictk.github.io/maya_python_c_extension/)?)
then you'll most likely recognize the `extern` keyword, except it seems to be
used backwards here: normally, in C, the `extern` keyword is used for
_exporting_ symbols to be exposed to be used by other code. And what's the
`global` keyword mean, anyway?
It turns out that those two keywords
[mean exactly the opposite in NASM](https://nasm.us/doc/nasmdoc6.html#section-6.5):
`extern` is used to _import_ symbols, and `global` is used to _export_ them.
Ok, fine. But what's the odd `extern _CRT_INIT` symbol we're importing? What's
that for?
## The Microsoft C Runtime Library ##
If you're experienced enough with C/C++, you'll no doubt have guess that the CRT
refers to the **C standard run-time library**, more specifically, the
**Microsoft Visual C++ Runtime Library (MSVCRT)**, which is Microsoft's
implementation of the C99 ISO standard library.
!!! tip Crossing the platforms
For those of you who are still beginners in C/C++ and might not understand what
this means, basically functions such as `printf`, `rand` or even the entry point
`main` are defined as part of the C standard library specification. Every C-compliant
compiler, whether it's MSVC, GCC or Clang, must supply their own implementation
of the specification. For Microsoft, the aforementioned MSVCRT is what we use
when we need to use functions defined from the C standard. `libc` and
`libc++` are the equivalents on Linux/OS X. Both implementations differ due
to both platform and historical reasons, as we shall soon see very shortly.
Thus, by the powers of deduction, we can guess that `_CRT_INIT` means "hey,
initialize the C runtime library". But so what? Why do we need to call this
anyway?
## `WinMain` and `main` ##
Well, let's look further down the code, to the first **label** in the `.text`
section of our assembly program:
```
main:
```
All C/C++ programmers should be familiar by now with the concept of their
program's **entry point**, that is, the initial point where code starts to
execute when their program is loaded. The signature usually appears as `int
main(int argc, char *argv[])`.
However, if you're at all familiar with Win32 API programming, you'll know that
things are not that simple. Technically,
[the entry point for Windows programs is defined as `WinMain`, not `main`](https://docs.microsoft.com/en-us/windows/desktop/learnwin32/winmain--the-application-entry-point);
specifying the
[subsystem](https://docs.microsoft.com/en-us/cpp/build/reference/subsystem-specify-subsystem?view=vs-2017)
to the MSVC linker determines which entry point symbol is chosen by
default. Additionally, the MSVCRT's implementation of `main` actually _calls_
`WinMain`, which means that it's essentially a _wrapper function_. More than
being just a simple wrapper, however, there's one other important thing that
MSVCRT's `main` function does, and that is to also perform any _static
initialization_ of variables required.
Thus, [the following advice from MSDN](https://docs.microsoft.com/en-us/cpp/error-messages/tool-errors/linker-tools-warning-lnk4210?view=vs-2017):
> If your project is built using /ENTRY, and if /ENTRY is passed a function
> other than _DllMainCRTStartup, the function must call _CRT_INIT to initialize
> the CRT.
If you don't do this, you'll get linker warnings when attempting to produce the
final executable, such as:
```
warning LNK4210: .CRT section exists; there may be unhandled static initializers or terminators
```
Which is the linker telling us, "hey, it looks like you've got some data you
wanted to statically initialize, but the appropriate actions to take for
_actually_ initializing them haven't been taken." This refers to the fact that
for global variables that need to be initialized before the ``main()`` function,
the VCRuntime library startup code (which handles running the static
initializers or terminators) hasn't been run yet.
!!! note How deep _does_ this rabbit hole go?
Savvy readers may have remembered that the MSVC linker has the
[`/entry` flag](https://docs.microsoft.com/en-us/cpp/build/reference/entry-entry-point-symbol?view=vs-2017),
which allows you to specify a specific entry point for a program (and which
we will be making use of very shortly). The truth is, `WinMain` is _just_ a
convention for the name
[given to user-provided entry points](https://blogs.msdn.microsoft.com/oldnewthing/20061204-01/?p=28853).
Since `main` (which actually maps to either `mainCRTStartup`/`WinMainCRTStartup` or
`wmainCRTStartup`/`wWinMainCRTStartup` depending on the subsystem and
ASCII/Unicode options) in the MSVCRT ultimately calls `WinMain`, that is
technically the real entry point of the application.
So hold on a minute: other than our `msg` variable, we don't _actually_ have
anything that should rely on static initializers. What is this warning for,
then?
It turns out that `printf` from the VCRuntime library is the culprit here; it
relies on static initializers for its implementation, thus requiring us to
initialize the CRT before making use of it, which makes sense, seeing as
`printf` is part of the CRT after all!
## Making a stack ##
Let's continue looking at the example code.
```
push rbp
mov rbp, rsp
sub rsp, 32
```
The `push` psuedo-op takes the operand passed to it, decrements the stack
pointer, and then stores the operand on top of the stack. We do this to the base
pointer so that we can save the current position of the stack. (So that if we
need to refer to variables on the stack, we have a _base address_ to refer to,
since we could be adding/removing objects from the stack all the time and thus
the stack pointer alone would be insufficient.)
We then move the stack pointer to the current base pointer's position, which is
is currently pointing to the top of the stack, and then subtract `32` from it,
giving us the **shadow space** necessary to follow the Microsoft x64 calling convention.
You'll likely see this snippet of code (or some variation thereof) almost always
at the beginning of every function you write in assembly, so get used to
it. It's also very useful for identifying where the start of a function is if
you're reading the assembly of optimized code (i.e. sometimes it could
indicate the start of an inlined function call).
You then see we initialize the C runtime library as previously mentioned, with the
following line:
```
call _CRT_INIT
```
## Calling functions in assembly ##
The next few lines are the real business logic of the example assembly code:
```
lea rcx, [msg]
call printf
```
The first instruction here is a little confusing. Isn't `lea` the name of the
protagonist in [that one game](http://www.cross-code.com/en/home)?
Well, LEA actually stands for **Load Effective Address**, which comes no closer
to explaining what it actually does. The way I like to think about it is that
`lea` serves essentially the same purpose as `mov`; they both move memory from
the second operand to the first. The difference is in _how_ exactly they do it;
the `mov` instruction moves memory directly from operand to operand, while `lea`
computes the _effective address_ of the second operand, before moving it to the
first operand. As such, while both instructions technically could do the same
thing (just with slightly different syntax), the way they function is different
enough to distinguish their use cases.
!!! note All together now...!
There's also another reason why the `lea` instruction is important: on the
x64 ISA, where we are generally making use of `rip`-relative addressing, the
offset that we use for `rip`-relative addressing is limited to 32 bits. Thus,
we usually load the address of the data that we want using the `lea`
operation first, instead of loading 64-bit values directly from that address.
In this case, we use `lea` to load the address of the `msg` variable (which is a
pointer to our "Hello, world!" string) into the `rcx` register. If you recall
the calling conventions, this is the register that should contain the first
argument for a function call.
We then `call printf` to print the string that we passed in as the first
argument. It's as simple as that.
## Shutting down the program ##
Now that we've printed our line out to the console, we're good. It's time to
shut it down!
```
xor rax, rax
call ExitProcess
```
Remember that according to the calling convention, the return value for a
function goes into `rax` for integers. Well, `main` is no different, and so we
**exclusive-or** the `rax` register with itself, effectively zero-ing it out,
before calling the Win32 `ExitProcess` function, thus ending the application.
With that, we've successfully gone over a very simple assembly program. Let's continue.
## Writing a build script ##
Even though we compiled our assembly code and linked it earlier in the tutorial
via a few simple commands, let's go ahead and make a little `build.bat` that we
can use to compile more programs easily, while allowing us to specify build
options as well. Plus, we also want to set it up so that we don't _have_ to keep
opening a special Visual Studio Developer Command Prompt every time we want to
build our program; the build script should take care of that automatically for us.
If you're not familiar with Batch files on Windows, [here's a
good resource](https://en.wikibooks.org/wiki/Windows_Batch_Scripting) for
getting caught up, though frankly, the syntax used here shouldn't be terribly
complicated to muddle through.
What we're going to do here is make a build script that we can call like so:
```
build.bat [debug|release|clean] [exe|dll]
Examples:
build.bat debug hello_world will build hello_world.asm in debug mode
build.bat release goodbye_nothing will build goodbye_nothing.asm in release mode
build.bat release goodbye_nothing dll will build goodbye_nothing.asm in release mode as a DLL instead of an exe
```
Let's start with the basics of parsing the command-line arguments:
```
echo Build script started executing at %time% ...
REM Process command line arguments. Default is to build in release configuration.
set BuildType=%1
if "%BuildType%"=="" (set BuildType=release)
set ProjectName=%2
if "%ProjectName%"=="" (set ProjectName=hello_world)
set BuildExt=%3
if "%BuildExt%"=="" (set BuildExt=exe)
set AdditionalLinkerFlags=%4
echo Building %ProjectName% in %BuildType% configuration...
```
This essentially sets the first, second, third and fourth command-line arguments
that we pass to `build.bat` to be stored as the `BuildType`, `ProjectName`,
`BuildExt` and `AdditionalLinkerFlags` variables respectively. To retrieve the
value stored in a variable, you use the `%%` syntax.
The next thing to do, assuming that we're running this batch script from a
normal command prompt (since we don't want to have launch the special developer
command prompt), is to set up the environment for the MSVC compiler/linker ourselves:
```
if not defined DevEnvDir (
call "%vs2017installdir%\VC\Auxiliary\Build\vcvarsall.bat" x64
)
```
Checking if `DevEnvDir` has been defined before calling the helper batch script
ensures that we don't invoke the environment setup script twice.
!!! note Older versions of Visual Studio
If you're using Visual Studio 2015, replace the line above with the following
line:
```
call "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat" x64
```
Next, let's create a directory to store all build artifacts in.
```
set BuildDir=%~dp0msbuild
if "%BuildType%"=="clean" (
setlocal EnableDelayedExpansion
echo Cleaning build from directory: %BuildDir%. Files will be deleted^^!
echo Continue ^(Y/N^)^?
set /p ConfirmCleanBuild=
if "!ConfirmCleanBuild!"=="Y" (
echo Removing files in %BuildDir%...
del /s /q %BuildDir%\*.*
)
goto end
)
echo Building in directory: %BuildDir% ...
if not exist %BuildDir% mkdir %BuildDir%
pushd %BuildDir%
```
Because the focus of this tutorial isn't about batch scripting, I'm going to go
over this quickly.
* `%~dp0` is a special variable in batch script syntax that specifies the
current directory that the batch script is being run in, including the
trailing slash. Thus, if we ran `build.bat` from `C:\tmp\build.bat`, `%~dp0`
would expand to `C:\tmp\`.
* The `if` statement is used for conditional execution in batch syntax, much
like other programming/scripting languages. In this case, we're checking if
the user entered `clean` as an argument, in which case we _prompt_ the user
for confirmation before removing the directory and all files in it to perform
a clean build.
* The `setlocal EnableDelayedExpansion` line allows execution of expressions to
be done at **execution time**, instead of **parsing time**. This is an
important distinction in batch files, because we want to parse the user input
to the confirmation prompt _only_ when the script executes at this point,
_not_ at the time when the script is parsed itself (otherwise the user input
would not be correct). We use `setlocal` here since we only want this to
affect the local scope of the batch file, and not the user's global environment.
* The `set` command with the `/p` flag tells the command prompt to wait for user
input, parse it, and then we store it in the `ConfirmCleanBuild` variable. We
then basically delete the build directories if the user gives a `Y` answer.
```
if not exist %BuildDir% mkdir %BuildDir%
pushd %BuildDir%
set EntryPoint="%~dp0%ProjectName%.asm"
set IntermediateObj=%BuildDir%\%ProjectName%.obj
set OutBin=%BuildDir%\%ProjectName%.%BuildExt%
set CommonCompilerFlags=-f win64 -I%~dp0 -l "%BuildDir%\%ProjectName%.lst"
set DebugCompilerFlags=-gcv8
```
We then create the build artifacts directory if it doesn't exist, and push that
directory onto the stack (yes, there is too the concept of a stack even in the
Command Prompt!). This places us in the build artifacts directory, but allows us
to return to the original directory that we were in later on (which is nice when
you're running this build script from somewhere else since it saves you the
trouble of having to navigate back to where you were.)
We set the _entry point_ for compilation to the project name that is intended to
be passed to the build script at compilation time. That way, we can keep using
the same build script to build multiple projects just by changing the name
passed to the script on the command line. As long as there's a `.asm` file that
has the same project name, we can assemble it.
We also set the output assembled intermediate object's path explicitly, along
with the final executable to be produced.
Then comes the [NASM flags](https://nasm.us/doc/nasmdoc2.html#section-2.1). We
set the binary type that we're producing as a 64-bit PE format file, set the
include path for the assembler to the current directory, and also specify that a
**listing file** should be generated by the assembler, using the `-l` argument.
!!! note What is a **Listing File**?
A listing file, also known as a _source-listing_ file, is basically a version
of the assembly source code that has the addresses of the generated code and
the actual source code associations clearly visible. It also expands all
macros and other shorthand that may have been used as part of special syntax
to the assembler to make it easier to debug when things go wrong.
You'll also notice that in the distinction between the common assembler flags
and debug flags, there's the odd [`-gcv8`](https://nasm.us/doc/nasmdoc2.html#section-2.1.13)
flag. Basically, the `-g` flag itself enables the assembler to generate
debugging information for the binary generated, and we explicitly specify the
debug format to be `cv8`, which is shorthand for [CodeView 8](https://en.wikipedia.org/wiki/CodeView),
the debugging format that Microsoft uses. This will allow us to debug the
assembly code more effectively in Visual Studio when the time comes.
```
if "%BuildExt%"=="exe" (
set BinLinkerFlagsMSVC=/subsystem:console /entry:main
) else (
set BinLinkerFlagsMSVC=/dll
)
```
This bit is pretty self-explanatory: if we're building a normal executable, we
default to using `main` as the symbol for our entry point; otherwise, we tell
the linker that we're building a dynamic link library instead.
```
set CommonLinkerFlagsMSVC=%BinLinkerFlagsMSVC% /defaultlib:ucrt.lib /defaultlib:msvcrt.lib /defaultlib:legacy_stdio_definitions.lib /defaultlib:Kernel32.lib /defaultlib:Shell32.lib /nologo /incremental:no
set DebugLinkerFlagsMSVC=/opt:noref /debug /pdb:"%BuildDir%\%ProjectName%.pdb"
set ReleaseLinkerFlagsMSVC=/opt:ref
```
These should be fairly recognizable to anyone who's worked with MSVC and C/C++
for any modest amount of time; we're just specifying the MSVC linker options
here, and making two separate configurations for whether we're making a release
or a debug build, with the latter turning off all optimizations and generating a
`.pdb` file for the debugger to use later.
!!! note I have a bad feeling about this...
There _is_ one thing of interest here, though, if you look carefully: we're
linking against a library called `legacy_stdio_definitions.lib`, which is not
something you'd normally see in C/C++ projects, even those that make extensive
use of the Win32 API. We're also linking against _both_ the `ucrt.lib` and the
older `msvcrt.lib`, which seems odd, since the former is supposed to supercede
the latter. What gives?
Well, in our sample code, we make use of a function from the CRT known as
`printf`. Now, just because the C standard says that `printf` _has_ to exist in
your implementation, it doesn't actually govern _how_ it should be
implemented. As it turns out, [`printf` has become an inlined function since
Visual Studio 2015](https://docs.microsoft.com/en-us/cpp/porting/visual-cpp-change-history-2003-2015?view=vs-2017#BK_CRT)
and as such, we need to link to `legacy_stdio_definitions.lib` library in order
to maintain access to that CRT function. (This isn't normally a problem in
C/C++, since Microsoft fixed up their `stdio.h` header to automatically resolve
this).
As for linking against both the newer and older CRT libraries, it turns out that
some other CRT functions such as `malloc` and `rand` are only available in the
latter, older CRT library, while the rest of the CRT's functionality is in the
UCRT one instead. Yes, it _is_ indeed a mess.
```
if "%BuildType%"=="debug" (
set CompileCommand=nasm %CommonCompilerFlags% %DebugCompilerFlags% -o "%IntermediateObj%" %EntryPoint%
set LinkCommand=link "%IntermediateObj%" %CommonLinkerFlagsMSVC% %DebugLinkerFlagsMSVC% %AdditionalLinkerFlags% /out:"%OutBin%"
) else (
set CompileCommand=nasm %CommonCompilerFlags% -o "%IntermediateObj%" %EntryPoint%
set LinkCommand=link "%IntermediateObj%" %CommonLinkerFlagsMSVC% %ReleaseLinkerFlagsMSVC% %AdditionalLinkerFlags% /out:"%OutBin%"
)
```
Finally, we set the actual assembly and linking commands that are to be executed...
```
echo.
echo Compiling (command follows below)...
echo %CompileCommand%
%CompileCommand%
if %errorlevel% neq 0 goto error
echo.
echo Linking (command follows below)...
echo %LinkCommand%
%LinkCommand%
if %errorlevel% neq 0 goto error
if %errorlevel% == 0 goto success
:error
echo.
echo ***************************************
echo * !!! An error occurred!!! *
echo ***************************************
goto end
:success
echo.
echo ***************************************
echo * Build completed successfully! *
echo ***************************************
goto end
:end
echo.
echo Build script finished execution at %time%.
popd
exit /b %errorlevel%
```
...and execute them. The full build script is available in the Code Repository section
for reference.
Now that you have a build toolchain set up, and a good initial understanding of
how assembly programs work, I encourage you to take a look at the other examples
in the Code repository section. Build them, play with them and try to understand
how they work. The Additional Resources section also has links to material that
contains other samples for you.
# Using assembly in C/C++ programs #
The next part of this tutorial will focus on actually starting to write some
assembly. To get ourselves comfortable with writing assembly that we can
actually _use_, let's start off with something simple. Let's write a basic
mathematical function in assembly that we or others can then call from our C/C++
code: a factorial function.
## A factorial function ##
Let's refresh our memory a little bit on what a factorial is with a few
examples:
```
factorial(5) = 5! = 5 x 4 x 3 x 2 x 1
factorial(8) = 8! = 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1
factorial(10000) = 10000 x 9999 x 9998 x ... x 1
```
Ok, so by simple deduction:
```
factorial(n) = n! = n x (n - 1)!
```
Writing this as a simple C function gives us:
```
int factorial(int n)
{
if (n == 0) {
return 1;
}
int result = 1;
for (int c = 1; c <= n; ++c) {
result = result * c;
}
return result;
}
```
Fairly straightforward, then? Let's start writing the assembly version of
this. We'll start with a skeleton function, with just the bare minimum definition.
```
default rel
bits 64
segment .text
global factorial
factorial:
push rbp
mov rbp, rsp
sub rsp, 32
leave
ret
```
This code should be fairly straightforward. We define a label `factorial` that
will act as the anchor for where our function's text (i.e. instruction code)
actually lives in the DLL's memory. We also declare that `factorial` is to be
exported with the `global` directive.
After that, within `factorial` itself, we set up the stack for our function, and
reserve the basic amount of **shadow space** as required by the Microsoft x64
calling convention.
Finally, we call `leave` to undo the stack frame changes we've made (it copies
`ebp` to `esp`, which is the opposite of what we did when entering the
function), and then `ret` to return to the site of the `call` instruction (which
would have placed the address to return to on the stack) and exit the function.
Because our function takes a single argument, `n`, let's go ahead and implement
the first part of the logic, which returns `1` if `n` is equal to `0`.
```
...
factorial:
push rbp
mov rbp, rsp
sub rsp, 32
test ecx, ecx
jz .zero
.zero:
mov eax, 1
leave
ret
```
The first argument to the function is of type `int`, which means that it will be
passed in `ecx` according to the Microsoft x64 calling convention. Since `int`
refers to a 32-bit signed integer on MSVC, we can just use `ecx` to address the
lower 32 bits of the register and save a tiny bit on performance (by not having
to address the entire register).
We then [`test`](https://www.felixcloutier.com/x86/test) the register with
itself to check if it's equal to `0`; if it were, it would set the zero flag in
the `EFLAGS` register, which is used by
the [`jz`](https://www.felixcloutier.com/x86/jcc) (jump if zero) jump
instruction immediately after that to perform the conditional branch. If `n` was
indeed zero, We jump to a new **label**, `.zero`, which has a single `mov eax,
1` instruction within it. Recall that the `int` return value for a function
under the Microsoft x64 calling convention should be returned in the `rax`
register, and it becomes obvious why this instruction is required: our function
is returning `1` if the input `n` is zero.
Let's continue implementing the logic of the factorial function.
```
...
factorial:
push rbp
mov rbp, rsp
sub rsp, 32
test ecx, ecx ; n
jz .zero
mov ebx, 1 ; counter c
mov eax, 1 ; result
inc ecx
.for_loop:
cmp ebx, ecx
je .end_loop
mul ebx ; multiply ebx * eax and store in eax
inc ebx ; ++c
jmp .for_loop
.zero:
mov eax, 1
.end_loop:
leave
ret
```
We add a new `.for_loop` label to indicate the start of the loop, and the
`.end_loop` label to indicate where execution should resume once the loop
condition has been met. Simple enough.
A few new instructions are also introduced here: we start with the [`cmp`](https://www.felixcloutier.com/x86/cmp)
instruction, which is a comparison operation. Depending on the result of the
comparison, it will set various bit flags in the `EFLAGS` register to indicate
the result.
We then use the [`je`](https://www.felixcloutier.com/x86/jcc)
operand to check if the value in `ebx` was equal to `ecx + 1` (we added the `1`
using the [`inc`](https://www.felixcloutier.com/x86/inc) psuedo-op, which means
"increment"). If the values are equal, we exit the loop by jumping to the end of
the function. If not, we use the [`mul`](https://www.felixcloutier.com/x86/mul)
pseudo-op to perform our multiplication for the current iteration of the loop,
and store the result in `eax`, which is our return value.
And that's it! Our factorial function is complete. But how do we test if it
works? Well, before wasting a lot of time exporting it into a DLL and calling it
from C/C++, let's just take an easier approach and implement a test right in
assembly first:
```
default rel
bits 64
segment .data
fmt db "factorial is: %d", 0xd, 0xa, 0
segment .text
global main
global factorial
extern _CRT_INIT
extern ExitProcess
extern printf
factorial:
...
main:
push rbp
mov rbp, rsp
sub rsp, 32
mov rcx, 5
call factorial
lea rcx, [fmt]
mov rdx, rax
call printf
xor rax, rax
call ExitProcess
```
We declare the `.data` segment of our assembly program, and initialize a string
in it that we can pass to `printf` for writing out the result of our factorial
function. We then make our `main` entry point, and basically do the equivalent
of:
```
char fmt[] = "factorial is: %d\n";
int result = factorial(5);
printf(fmt, result);
```
Now compile that using the build script as mentioned, or manually from the
command-line if you wish:
```
nasm -f win64 -o "factorial_test.obj" "factorial_test.asm"
link "factorial_test.obj" /subsystem:console /entry:main /defaultlib:ucrt.lib /defaultlib:msvcrt.lib /defaultlib:legacy_stdio_definitions.lib /defaultlib:Kernel32.lib /nologo /incremental:no /opt:ref /out:"factorial_test.exe"
```
If you run the resulting `factorial_test.exe`, you should get the following
output:
```
factorial is: 120
```
Which is indeed the factorial of `5`.
## Making a DLL from assesmbly ##
Now that we've verified that our factorial function seems to work, let's
undertake another exercise, and see how to call this function from a normal C
program. To do this, we'll generate a DLL that contains our factorial
function. Since it follows the Microsoft x64 calling convention, it will allow
us to link a normal C program with it and call functions from within it as you
would expect from a normal DLL.
If you've written the build script as described in the Writing a build script
section, you should be able to just build the DLL version by using the following
command:
```
build.bat release factorial_test dll
```
If you _haven't_ made the build script, you can try doing so on the command line
the exact same way you make DLLs when doing normal C/C++ programming by passing
the [`/dll`](https://docs.microsoft.com/en-us/cpp/build/reference/dll-build-a-dll?view=vs-2017)
flag to the linker:
```
nasm -f win64 -o "factorial_test.obj" "factorial_test.asm"
link "factorial_test.obj" /dll /defaultlib:ucrt.lib /defaultlib:msvcrt.lib
/defaultlib:legacy_stdio_definitions.lib /defaultlib:Kernel32.lib /nologo
/incremental:no /opt:ref /export:factorial /out:"factorial_test.dll"
```
You should get a `factorial_test.dll` file, along with its accompanying
`factorial_test.lib` and `factorial_test.exp` files generated by the linker. We
can then go ahead and make a simple C program to test this with:
```
#include
int main(int argc, char **argv)
{
int result = factorial(5);
printf("Factorial is: %d", result);
return 0;
}
```
Nothing in there should be foreign to anyone who's worked with C/C++. We can
then go ahead and compile this program, linking against our `factorial_test.lib`
file with the following command:
```
cl factorial_test_main.c /Ox /link /subsystem:console /entry:main /incremental:no /machine:x64 /nologo /defaultlib:Kernel32.lib /defaultlib:User32.lib /defaultlib:ucrt.lib /defaultlib:msvcrt.lib /opt:ref factorial_test.lib
```
Finally, upon running the `factorial_test.exe` executable generated, you should get
the following output:
```
Factorial is: 120
```
As you can see, as long as you've followed the OS calling conventions, being
able to use assembly from your C programs is extraordinarily easy. In this way,
you can have high-performance versions of certain functions readily available to
be swapped in/out in your code as the situation dictates, while keeping the rest
of your codebase in C/C++ and continuing to reap the benefits of the optimizing
compilers.
## Comparing our work ##
So now that we know how to write some assembly, let's return to reality a bit:
while you're potentially a pretty smart person (unlike the author of this
tutorial), chances are you're not going to be functioning at 100% all the
time. Whereas compilers like GCC, Clang and MSVC _are_. It's therefore
tremendously helpful in situations to see what the compiler output is compared
to the hand-written assembly you've written. While tools
like [Godbolt](https://godbolt.org/) are amazing resources for looking at the
assembly of your functions, I'd like to also make sure that you're familiar with
how to do this using the compiler itself.
The
[`/FA`](https://docs.microsoft.com/en-us/cpp/build/reference/fa-fa-listing-file?view=vs-2017) flag
is used to generate a **listing file** by the MSVC compiler, which will list the
assembly for our C source files in MASM syntax. So let's go ahead and compare
our factorial function, with what MSVC thinks a good factorial function is.
We add an empty `main` function to our C program to allow for MSVC to compile
it, in `test.c`:
```
int factorial(int n)
{
if (n == 0) {
return 1;
}
int result = 1;
for (int c = 1; c <= n; ++c) {
result = result * c;
}
return result;
}
int main(int argc, char* argv)
{
return 0;
}
```
and compile it using the following command:
```
cl test.c /FA /Ox
```
To generate a `test.asm` listing that contains our assembly:
```
; Listing generated by Microsoft (R) Optimizing Compiler Version 19.16.27025.1
include listing.inc
INCLUDELIB LIBCMT
INCLUDELIB OLDNAMES
PUBLIC factorial
PUBLIC main
; Function compile flags: /Ogtpy
_TEXT SEGMENT
argc$ = 8
argv$ = 16
main PROC
; File c:\users\sonictk\tmp\test.c
; Line 17
xor eax, eax
; Line 18
ret 0
main ENDP
_TEXT ENDS
; Function compile flags: /Ogtpy
_TEXT SEGMENT
n$ = 8
factorial PROC
; File c:\users\sonictk\tmp\test.c
; Line 2
mov edx, ecx
; Line 3
mov eax, 1
test ecx, ecx
je SHORT $LN3@factorial
; Line 7
mov ecx, eax
; Line 8
cmp edx, eax
jl SHORT $LN3@factorial
$LL9@factorial:
; Line 9
imul eax, ecx
inc ecx
cmp ecx, edx
jle SHORT $LL9@factorial
$LN3@factorial:
; Line 13
ret 0
factorial ENDP
_TEXT ENDS
END
```
Reading MASM syntax might be a little confusing, but basically, the `PROC` and
`ENDP` directives indicate the start and the end of a _procedure_, or
function. Thus, the important part we're interested in is this, with other bits
and bobs removed:
```
factorial PROC
mov edx, ecx
mov eax, 1
test ecx, ecx
je SHORT $LN3@factorial
mov ecx, eax
cmp edx, eax
jl SHORT $LN3@factorial
$LL9@factorial:
imul eax, ecx
inc ecx
cmp ecx, edx
jle SHORT $LL9@factorial
$LN3@factorial:
ret 0
factorial ENDP
```
Comparing this to our assembly:
*********************************************************************
* factorial: factorial PROC *
* push rbp mov edx, ecx *
* mov rbp, rsp mov eax, 1 *
* sub rsp, 32 test ecx, ecx *
* je SHORT $LN3@factorial *
* test ecx, ecx mov ecx, eax *
* jz .zero cmp edx, eax *
* jl SHORT $LN3@factorial *
* mov ebx, 1 $LL9@factorial: *
* mov eax, 1 imul eax, ecx *
* inc ecx *
* inc ecx cmp ecx, edx *
* jle SHORT $LL9@factorial *
* .for_loop: $LN3@factorial: *
* cmp ebx, ecx ret 0 *
* je .end_loop factorial ENDP *
* *
* mul ebx *
* *
* inc ebx *
* jmp .for_loop *
* *
* .zero: *
* mov eax, 1 *
* *
* .end_loop: *
* leave *
* ret *
* *
*********************************************************************
[Figure [diagram]: Comparison of the hand-written assembly vs MSVC's output.]
So first of all, we can see that the compiler elected to not set up the shadow
space. (`push rbp` etc.) It did that because it could determine that our
function was a *leaf function*, in that it never called any other functions, and
also did not make use of the stack in the first place (In that simple program,
our function was never even used at all, as a matter of fact!). You too can
elect not to do this if you can make the same assumptions in your assembly
functions as well.
Other than that, we see that the compiler also generated a more-efficient
version of the assembly, not requiring a separate `inc` instruction (I
originally elected to do that since I knew the `je` instruction technically is a
little faster than the `jge` instruction since it tests against just the zero
flag as opposed to the sign and overflow flags). Clearly, the compiler felt
differently, and in this case, it's probably right: the separate `inc`
instruction alone would eat up an entire additional **μop**, while the `Jcc`
family of instructions also takes the same time.
!!! tip What is a μop?
A micro-operation, or **μop**, is one way we use to measure how expensive an
assembly pseudo-operation is. These are decoded instructions from the
higher-level **macro-operations** (i.e. our `add`, `mul`, `cmp` instructions)
that tell the CPU what exactly to execute in terms of micro-operations. Some
macro-operations get broken down into more μops than others, and information
on their timings is available thanks to [Agner Fog's fantastic work](https://www.agner.org/optimize/instruction_tables.pdf).
There's plenty of CPU implementation details that tie into this (For example,
the decoded micro-ops go into a **μop cache** that can be used to avoid
having to incur the cost of a decode operation for a macro-op if the cache
still has the decoded μop in it), but the most important thing to note here
is that _just because you have fewer instructions in your assembly, it
doesn't mean that you actually will have better performance_. It depends on a
number of factors, and micro-operation timings are just one of them.
We also see that the compiler elected to use the `imul` operation instead of the
`mul` one that I did, which allows for signed multiply operations. This is also
correct behaviour according to the C standard, since our function _technically_
takes an unsigned integer, even if the factorial of a negative integer doesn't
make sense.
!!! note An interesting behaviour...
For a kicker, try modifying the C function so that all inputs, outputs and
operations are done unsigned, and re-generate the assembly from MSVC. You'll
notice that the multiplication operation _still_ uses the `imul` instruction
instead of `mul`. Why is this? I don't know for sure, but my guess is that
`mul` is actually a more limited instruction, whereas the `imul` instruction
allows for 2, or even 3 operands to be used for multiplication in one
instruction. They both should take the same amount of CPU μops in terms of performance.
# Debugging assembly #
Assembly is verbose. There's no denying that. It's also incredibly dense and
easy to get lost in when trying to figure things out. So how do you navigate
around and find the memory you want when things go wrong?
As it turns out, it takes a combination of experience and tooling to help us in
this regard.
## The tools ##
On Windows, we're lucky enough to have (in my opinion) much better tooling for
working with assembly than is offered on Linux or OS X. We'll go over some of
the ones I personally use when I need to drop to assembly.
### AsmDude ###
[AsmDude](https://marketplace.visualstudio.com/items?itemName=Henk-JanLebbink.AsmDude) is
an extension for Visual Studio that provides some _significant_ quality of life
improvements when working with assembly in Visual Studio. Code completion,
syntax highlighting, nice little popups that remind you of what a mnemonic means
or does...honestly, I don't know why the features in this extension don't come
standard with Visual Studio.
Now if only it could provide a better registers window than the glorified edit
control that comes standard...
### WinDbg Preview ###
The venerable tool of choice for debugging crash dumps on
Windows,
[WinDbg](https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/debugger-download-tools)
now has an upgraded user interface. Unfortunately, for some infuriating reason,
Microsoft decided that it would be a good idea to now limit its distribution to
only be via the Windows Store exclusively. While you should (hopefully) be fine
using it from there, if you're running into problems using it in corporate
environments like I am, do what you can to try and get it working. Trust me, the
improvements made are worth the trouble compared to using the old WinDbg.
It also has features that the Visual Studio debugger does not have, such as a
suite of useful
[commands](https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/commands)
for examining and disassembling memory, a proper disassembly window (that
doesn't require a 3rd-party extension to get syntax highlighting for!), a much
better registers window, among other things. You can take a look at how to use
WinDbg using the
[MSDN documentation](https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/standard-debugging-techniques),
which is fairly comprehensive.
### radare2/Cutter and IDA ###
For those of you who are more adventurous (or who work on Linux/OS X), there's
[radare2](https://rada.re/r/) and a front-end GUI for it
called [Cutter](https://github.com/radareorg/cutter). There's
also [IDA](https://www.hex-rays.com/products/ida/), which is favoured in the
Infosec community as a disassembler. I've dabbled with both, but I mostly use
WinDbg and Visual Studio on Windows, and radare2/gdb on Linux when I need to
drop to disassembly.
I've also heard good things about [Binary Ninja](https://binary.ninja/), but I
haven't had the chance to try it yet.
## Recognizing patterns ##
This is where the experience part comes in. By understanding both how assembly
code works, how the OS calling conventions work, and how the compilers and
optimizers work, we can make several guesses and inferences when looking at raw
assembly. This is not a comprehensive list, and you'll notice new cases over
time as you deal with more and more assembly code.
!!! note Halt, plagarist! You violated the law!
Most of this information is available
from
[Jorge Rodriguez's video on the subject](https://www.youtube.com/watch?v=MUNRvqpske0),
but in the interests of summarizing the video, I'll be posting some key points
here, along with my own notes on the topic.
* If you're noticing a section such as:
```
push rbp
mov rbp, rsp
sub rsp, 32 ; or some multiple of 16
```
You can be almost certain that it is the work of a function call creating a
home space, which means that it's the start of a function boundary. If it also
appears in an optimized build, it also provides a clue that this is most
likely a non-leaf function, since the compiler did not omit the home space
since the function could continue to call sub-routines that use it.
* Additionally, if you notice code that starts moving values from `rcx`, `rdx`,
`r8`, `r9`, or `xmm0` through `xmm3`, you can infer from knowing the Microsoft
x64 calling convention that you are most likely at the start of a function
call, since these are most likely values from arguments passed to the
function. Since this is usually a common cause of crashes (i.e. passing the
wrong arguments to a function), inspecting the memory of the registers here
(and the memory at `rsp` as well, if there are additional parameters passed
onto the stack) can be crucial to gathering clues about what might have gone wrong.
* Even though compilers are fairly good at inlining, they might not be able to
do so for more complicated cases (i.e. recursive functions). This means that
you might see the call instructions for those cases when looking at the
assembly, even if the first call is inlined, which you could use to identify
where you are at in the source code in terms of execution.
* In debug builds, switch statements will compile to **jump tables**; in optimized
builds, this doesn't always happen, so don't count on being able to identify
sections of your source code this way!
* In debug builds, values will in all likelihood be kept on the stack so that
the debugger is able to display the values to you interactively. In release
builds, values are more likely to be kept in the registers without copying
them to the stack since it's faster to perform operations on. One way to take
advantage of this fact is that if you're debugging assembly from a debug
build, you can check the memory around `rsp` to see if the memory there
provides any clues as to what might be happening in the current scope.
* For MSVC, uninitalized memory gets cleared to `0xCC` in debug builds. This is
not the case in release builds. So if you notice that the value in a register
or on the stack is equal to this, and you're getting an odd crash thanks to
it, you know that your C/C++ code likely has a bug in it related to
uninitialized memory.
* If you're unfortunate enough to be debugging C++ code that was written in an
"object-oriented" manner, remember that code such as:
```
obj.method(param)
```
implicitly passes the `this` pointer to the `method` being executed. I.e. the
call above effectively becomes:
```
method(&obj, param)
```
This means that if you're looking at a function call where the `rcx` register
contains a pointer address that can be casted to a `obj *`, you know that the
function is one of the methods available on that object being called.
* Furthermore, if you see a `call` to a method like the following:
```
call qword ptr [rdi + 18h]
```
This is most likely a **virtual method call**, since there is an offset in the
second operand off a base address. To find out which exact method it is, we
can inspect the `__vfptr` (which points to the **Virtual Function table
pointer**) and try to match the offset to the corresponding index in the
virtual function table. Since the virtual function table pointer exists at the
beginning of the `this` object, and since a pointer's size on the x64 ISA is
64 bits (or 8 bytes), for an offset of `18h` (or `24` in base 10), we know
that this call is for index 3 from the table, which you can then match up in a
debugger to find out which exact virtual method call the instruction is
referencing.
****************************************************************
* *
* OOP Object Virtual Function Table *
* +-------------------+ +-------------------+ *
* | Virtual Function |---. |8h (8 bytes) | *
* | Table Pointer | | +-------------------+ *
* +-------------------+ | |10h | *
* | | | +-------------------+ *
* | Class data | '------->|18h (obj->method) | *
* | | +-------------------+ *
* | | |20h | *
* | | +-------------------+ *
* +-------------------+ | ... | *
* | | +-------------------+ *
* | Subclass data | *
* | | *
* | | *
* | | *
* | | *
* +-------------------+ *
* *
****************************************************************
[Figure [diagram]: An illustration of the **Virtual Function Table** and how it works.]
This is an _incredibly_ useful tool to have in your arsenal when debugging
convoluted code, as "object-oriented" code is wont to be.
* When trying to find your code, don't always assume that a
`__declspec(noinline)` will work; the compiler has the right to make the final
decision on whether to inline your function or not. Additionally, other
optimizations such as
[tail-call optimization](https://medium.com/software-design/tail-call-optimization-in-c-829b4b257c9a)
can affect the inlining of your function as well.
* To add to that, remember that things
like [Return Value Optimization](https://en.wikipedia.org/wiki/Copy_elision),
where compilers may choose to elide a copy operation to a temporary object
before it is returned _do_ affect how the resulting assembly is generated; the
compiler may put return values of subroutines as arguments so that it looks
like they are in the registers as function parameters, or it may use the stack
frame for the caller to place the necessary memory for the to-be-returned
object as well. Every compiler implements RVO and copy elision differently,
and there's no guarantee about their behaviour between versions of the
compiler, either. Common cases for RVO include things like `return Object()`,
where the temporary object created by calling `Object`'s constructor gets
elided.
* As we've noted before, the optimizer can re-order your code around and change
the order of execution in order to improve performance. Don't always assume
that your source code will map in the same execution order to the assembly
generated in release builds!
* If you have a hunch that a value in one of the GPRs looks like a pointer to an
object or structure, you can try casting it in the debugger (i.e. `(Object
*)rcx`) and seeing if you get the debugger to display the contents
correctly. You could also navigate to the memory directly and view the
contents to inspect the memory layout and see if it matches that of the
object/structure's.
* If you see statements like:
```
mov ebx, dword ptr [rdx + rax + 3ch]
```
where one of the operands has a base register, an index register and an
immediate offset, what is most likely happening is that `rdx` would be
containing the base address of an array, `rax` would be the offset into that
array to a specific object, and `3ch` would be the access to a member of that
object. Thus, you can do things like:
```
&((Object *)0)->member
```
To see what the offset amount is for that particular `member`. If it matches
that of `3ch` you know that is the specific member that the instruction is
attempting to `mov` into `ebx`.
* If you see statements like:
```
mov rcx, qword ptr [rsp + 78h]
```
You know immediately that this is looking for a variable on the stack. Thus
you can check your source code and try to figure out what might have been
allocated on the stack, and keep comparing the memory at that address to what
you see in the source until the data layouts line up.
* If you suspect that you may have floating-point exception errors (i.e. caused
by quiet `NaN`s), you can try checking the `xmm` registers for clues; perhaps
the registers show `NaN` results in them. One thing to try is that if you
suspect that a value in an `xmm` register is a `NaN`, you can try zero-ing the
portion of the register that contains the `NaN` value and see if the exception
still gets raised. If it doesn't, you've found a clue as to what the problem
might be.
* Usually, when you see a combination of `cmp` and `jcc` instructions, you can
probably assume that you're in a loop. As we implemented a loop ourselves
earlier, you can use this to try to help you determine where you are at with
relation to the source code and determine the point of execution.
* On Windows, the
[stack size defaults to 1 MB](https://docs.microsoft.com/en-us/windows/desktop/procthread/thread-stack-size),
and is typically located below `0x400000`, so if you're looking at an address
region that `rsp` references beyond those limits, you can make a guess that
you might be about to hit a segmentation fault.
* Data/text segments for user processes are [limited to the first 2 GB of virtual
address space available](https://docs.microsoft.com/en-us/windows/desktop/memory/virtual-address-space)
if `/largeaddressaware` is set to `no`. Additionally, data is allocated only
when the process is loaded into memory.
These are just some of the notes that I've picked up from various tutorials and
my own experiences. It is by no means an exhaustive list of technqiues; the only
real thing to do here is to keep practising
and [reading](https://blogs.msdn.microsoft.com/oldnewthing/20041111-00/?p=37333)
other people's
[experiences](https://blogs.msdn.microsoft.com/oldnewthing/20180706-00/?p=99185)
on the subject.
## Debugging: A real life issue ##
Problems that occur only in release builds instead of debug builds are usually
tremendously annoying to solve, simply because it becomes much harder to just
attach a debugger and find what the problem is (especially if your release build
doesn't have debugging information or symbols!) However, sometimes even with the
power of a debugger, it's difficult to figure out the issue, since the
compiler/linker has done its round of optimizations and more than likely erased
whatever information you needed in order for the debugger to inspect the memory
you're looking for. Or, you might be dealing with interactive behaviour (i.e. a
bug that only happens when you move the mouse a certain speed while clicking
furiously with the left mouse button). In such situations, dropping to assembly
is one possible approach to triaging and solving such issues in a timely manner.
Now let's see a real-world example of using assembly knowledge to solve an
_actual_ problem that I encountered recently when working on an
application. I've chosen an example that, while relatively simple to diagnose
and fix, should illustrate the benefits of understanding the fundamentals when
trying to tackle larger, more obtuse problems in production.
### The symptoms ###
I was working on a 3D engine that was exhibiting an all-too-familiar sounding
problem: when the engine was built using debug compiler/linker settings (`/Od
/Zi /D_DEBUG /opt:noref /debug /pdb` for the exact flags), everything worked as
expected. Running the application would pop up the engine window instantly, I'd
see my 3D view, and there wouldn't be any problems.
However, when building in release mode (`/Ox /Zo /Zi /DNDEBUG /pdb /opt:ref`),
something odd happened: the engine window would start, but instead of displaying
my 3D view, the window would be blank, without the telltale pink default colour
that I had set my backbuffer render target clear colour to. I would hear audio,
indicating my audio engine had initialized and that I was in the main engine
loop, but whatever I did to try to gain focus on the window wouldn't work to
kickstart the draw of the 3D view _until_ I resized the window or moved it.
Because the code for handling [window messages](https://docs.microsoft.com/en-us/windows/desktop/learnwin32/window-messages)
in the engine was very straightforward (very similar to Raymond
Chen's [scratch program template](https://blogs.msdn.microsoft.com/oldnewthing/20030723-00/?p=43073)),
I was fairly confident the problem lay somewhere in the realm of how the
application was processing the messages that Windows was sending it. The problem
was, where was it? And why was it only happening in release, and not in debug builds?
### The suspect (code) ###
Here's a little challenge: I'm going to post the entirety of the window procedure
callback function, and you try to spot where the problem is. Take your best
guess, and read on to see if your instincts were right!
```
/**
* Callback for processing the messages that are sent to the window. This is executed
* each time the window receives a message from Windows.
*
* @param hwnd A handle to the window.
* @param uMsg The message.
* @param wParam Additional message information.
* @param lParam Additional message information.
*
* @return The result of the message processing.
*/
LRESULT CALLBACK WindowProc(HWND hwnd, UINT uMsg, WPARAM wParam, LPARAM lParam)
{
switch(uMsg) {
// NOTE: (sonictk) Sent whenever the window is activated/deactivated.
case WM_ACTIVATE:
{
WORD wParam = LOWORD(wParam);
switch (wParam) {
case WA_INACTIVE:
{
globalApplicationState = DX11_ENG_APPLICATION_PAUSED;
BOOL status = ClipCursor(NULL);
ShowCursor(TRUE);
if (status == 0) {
OutputDebugString("Unable to release the cursor!\n");
}
break;
}
case WA_CLICKACTIVE:
ShowCursor(FALSE);
case WA_ACTIVE:
globalApplicationState = DX11_ENG_APPLICATION_RUNNING;
break;
}
return 0;
}
case WM_SETCURSOR:
// NOTE: (sonictk) Checking against ``HTCLIENT`` ensures cursor is only
// applied when we are actually within the client area to avoid overriding built-in cursors.
if (LOWORD(lParam) == HTCLIENT) {
SetCursor(globalMouseCursorHandle);
return 0;
}
break;
// NOTE: (sonictk) Sent when the user is resizing the window.
case WM_SIZE:
// NOTE: (sonictk) Save the new client dimensions.
globalWindowWidth = (int)LOWORD(lParam);
globalWindowHeight = (int)HIWORD(lParam);
if (globalD3DDevice) {
switch (wParam) {
case SIZE_MINIMIZED:
globalApplicationState = DX11_ENG_APPLICATION_PAUSED;
globalWindowState |= DX11_ENG_WINDOW_STATE_MINIMIZED;
globalWindowState &= DX11_ENG_WINDOW_STATE_MAXIMIZED;
break;
case SIZE_MAXIMIZED:
globalApplicationState = DX11_ENG_APPLICATION_RUNNING;
globalWindowState |= DX11_ENG_WINDOW_STATE_MAXIMIZED;
globalWindowState &= DX11_ENG_WINDOW_STATE_MINIMIZED;
break;
case SIZE_RESTORED:
if (globalWindowState & DX11_ENG_WINDOW_STATE_MINIMIZED) {
globalApplicationState = DX11_ENG_APPLICATION_RUNNING;
globalWindowState |= DX11_ENG_WINDOW_STATE_MAXIMIZED;
globalWindowState &= DX11_ENG_WINDOW_STATE_MINIMIZED;
OnResize();
} else if (globalWindowState & DX11_ENG_WINDOW_STATE_MAXIMIZED) {
globalApplicationState = DX11_ENG_APPLICATION_PAUSED;
globalWindowState |= DX11_ENG_WINDOW_STATE_MINIMIZED;
globalWindowState &= DX11_ENG_WINDOW_STATE_MAXIMIZED;
OnResize();
} else if (globalWindowState & DX11_ENG_WINDOW_STATE_RESIZING) {
// NOTE: (sonictk) If a user is dragging the resize bars,
// do not resize the buffers since ``WM_SIZE`` messages are
// being sent continuously to the window. Instead, only resize
// the buffers once the user releases the bars, which sends a
// ``WM_EXITSIZEMOVE`` message.
break;
} else {
OnResize();
}
break;
default:
break;
}
}
return 0;
// NOTE: (sonictk) Sent when the user is grabbing the resize handles.
case WM_ENTERSIZEMOVE:
globalApplicationState = DX11_ENG_APPLICATION_PAUSED;
globalWindowState |= DX11_ENG_WINDOW_STATE_RESIZING;
return 0;
// NOTE: (sonictk) Sent when the user releases the resize handles.
case WM_EXITSIZEMOVE:
globalApplicationState = DX11_ENG_APPLICATION_RUNNING;
globalWindowState &= DX11_ENG_WINDOW_STATE_RESIZING;
return 0;
case WM_DESTROY:
PostQuitMessage(0);
return 0;
// NOTE: (sonictk) This is sent when a menu is active and the user presses a
// key that does not correspond to any mnemonic or accelerator.
case WM_MENUCHAR:
// NOTE: (sonictk) We do this to avoid the beep when hitting alt-enter.
return MAKELRESULT(0, MNC_CLOSE);
// NOTE: (sonictk) Gets sent when the size/pos of the window is about to change.
case WM_GETMINMAXINFO:
// NOTE: (sonictk) We intercept this to prevent the window from becoming too small.
((MINMAXINFO *)lParam)->ptMinTrackSize.x = globalMinimumWindowWidth;
((MINMAXINFO *)lParam)->ptMinTrackSize.y = globalMinimumWindowHeight;
return 0;
case WM_MOUSEMOVE:
OnMouseMove(wParam, GET_X_LPARAM(lParam), GET_Y_LPARAM(lParam));
return 0;
case WM_INPUT:
{
// NOTE: (sonictk) Only move the camera if the UI modifier key is not being
// held down actively.
if (globalModifiersDown & globalModifiersMask_Control) {
return 0;
}
UINT dwSize;
GetRawInputData((HRAWINPUT)lParam, RID_INPUT, NULL, &dwSize, sizeof(RAWINPUTHEADER));
globalVar LPBYTE lpb = (LPBYTE)HeapAlloc(globalProcessHeap, 0, (SIZE_T)dwSize);
if (lpb == NULL) {
return 0;
}
if (GetRawInputData((HRAWINPUT)lParam, RID_INPUT, lpb, &dwSize, sizeof(RAWINPUTHEADER)) != dwSize) {
OutputDebugString("GetRawInputData returns mismatched sizes!\n");
return 0;
};
RAWINPUT* raw = (RAWINPUT*)lpb;
if (raw->header.dwType == RIM_TYPEMOUSE) {
globalMouseDeltaX = raw->data.mouse.lLastX;
globalMouseDeltaY = raw->data.mouse.lLastY;
}
return 0;
}
case WM_LBUTTONDOWN:
case WM_MBUTTONDOWN:
case WM_RBUTTONDOWN:
OnMouseDown(wParam, GET_X_LPARAM(lParam), GET_Y_LPARAM(lParam));
return 0;
case WM_LBUTTONUP:
case WM_MBUTTONUP:
case WM_RBUTTONUP:
OnMouseUp(wParam, GET_X_LPARAM(lParam), GET_Y_LPARAM(lParam));
return 0;
case WM_MOUSEWHEEL:
{
short wheelRotation = GET_WHEEL_DELTA_WPARAM(wParam);
OnMouseWheelScroll(wheelRotation);
return 0;
}
case WM_KEYDOWN:
OnKeyPress(wParam);
return 0;
case WM_KEYUP:
OnKeyRelease(wParam);
return 0;
default:
break;
}
// NOTE: (sonictk) As per MSDN guidance, forward all other unhandled messages
// to the default window procedure to let Windows take care of it.
return DefWindowProc(hwnd, uMsg, wParam, lParam);
}
```
In case you're not familiar with Win32 programming, this is basically a callback
function that Windows will execute for you whenever it posts a message to your
window. This message can be posted on various types of events, input
(mouse/keyboard), window movement/resize/minimize/activation/etc... you get the
idea. Skimming the code should make it clear that there are a couple of global
variables that represent the application's state. The important one we're going
to focus on here is `globalApplicationState`, which is set to either the paused
or running enums depending on the message processing happening. If the
application is in the paused state, the engine stops rendering, as this snippet
from the main loop shows:
```
/**
* The main application loop.
*
* @return Additional information about the message that triggered the termination of the loop.
*/
#define globalVar static
/// Global variable that indicates the state(s) of the application.
globalVar DX11ApplicationState globalApplicationState = DX11_ENG_APPLICATION_RUNNING;
int Run()
{
bool bGotMsg;
MSG msg = {0};
msg.message = WM_NULL;
BOOL status = QueryPerformanceCounter(&globalHighResolutionTimer);
if (status == 0) {
DWORD winError = GetLastError();
MessageBox(0, "Failed to get high-resolution timing from the CPU! This hardware is not suppoorted.", 0, 0);
return (int)HRESULT_FROM_WIN32(winError);
}
globalVar float globalMaxTimeStep = 1.0f / globalTargetFPS;
while (msg.message != WM_QUIT) {
// NOTE: (sonictk) Process window events.
// We use PeekMessage() so that we can use idle time to render the scene.
// If we used GetMessage() instead, it would put the thread to sleep while
// waiting for a message; we do not want this behavior. If there are no
// Windows messages to be processed, we want to run our own rendering code instead.
bGotMsg = (PeekMessage(&msg, NULL, 0U, 0U, PM_REMOVE) != 0);
if (bGotMsg) {
TranslateMessage(&msg); // NOTE: (sonictk) Translates virtual key messages into character messages.
DispatchMessage(&msg); // NOTE: (sonictk) Dispatch the translated message.
continue;
}
// NOTE: (sonictk) No Windows messages remaining to be processed in the queue;
// time to run our own rendering game code now!
switch (globalApplicationState) {
case DX11_ENG_APPLICATION_PAUSED:
Sleep(100);
break;
case DX11_ENG_APPLICATION_RUNNING:
default:
CalculateApplicationFrameStats(globalCurrentApplicationFPS);
// NOTE: (sonictk) If there are no messages to be processed, run render code.
LARGE_INTEGER currentTime;
QueryPerformanceCounter(¤tTime);
float deltaTime = (float)(currentTime.QuadPart - globalHighResolutionTimer.QuadPart * 1.0f) / (float)globalCPUTimerFreq.QuadPart;
globalHighResolutionTimer = currentTime;
// NOTE: (sonictk) Cap delta time to the max time step
deltaTime = deltaTime < globalMaxTimeStep ? deltaTime : globalMaxTimeStep;
D3DUpdate(deltaTime);
X3DAudioUpdate(deltaTime);
D3DDraw();
D3DDrawGUI();
D3DPresent(globalVSyncEnabled);
break;
}
}
ShowCursor(TRUE);
return (int)msg.wParam; // NOTE: (sonictk) As guidance from MSDN dictates.
}
```
### Conventional debugging ###
There were some things that could be inferred, based on the symptoms presented:
* This is probably an issue of either unexpected compiler optimization, or
unexpected initialization of memory since it presents itself as a problem in
release builds, but not in debug ones.
* It's not certain, but likely that it is related to window messages because the
symptoms in release mode look exactly like what would happen if the
application never processed any window messages (i.e. the screen remains
blank). The fact that the 3D view gets drawn as expected once the window is
resized further lends credence to this theory, since a resize would post a
`WM_SIZE` event to the window, thus indicating that specific message had some
machinery within it that affects this issue.
The first thing, obviously after verifying that the problem only presented
itself in release mode, was to build the executable with debug information and
try to step through the code in the debugger. Unfortunately, as those of you
who've worked with gameplay/graphics code might already guess, this proved to be
an impossible task; since the problem would only present itself when the window
was in focus, setting a breakpoint and trying to go back to the application
again doesn't work, since you're going to trigger the wrong `WM_ACTIVATE`
message in the process, thus causing the application to go into the paused state
and stop rendering the backbuffer anyway (refer to the `Run()` engine loop
above).
The next thing to try was using [Spy++](https://docs.microsoft.com/en-us/visualstudio/debugger/introducing-spy-increment?view=vs-2017)
to try and take a look at the window messages that were being sent to the
application. Maybe by some freak accident, my application was being sent
messages with different data in debug versus release builds?
Using Spy++ revealed that the window was indeed being sent the correct messages
with the correct data at the correct times with the same given inputs. So while
that wasn't it, it did trigger a suspicion that perhaps the window message
handling code itself wasn't being resolved to the same instructions in both
builds. The easiest way to verify that, obviously, would be to place a bunch of
`OutputDebugString` calls to try and figure out which branches were being
executed in the giant `switch` statement, and then try to narrow down the scope
of things that had gone wrong.
A bunch of printing later, I had narrowed things down to this particular block
of code:
```
...
switch(uMsg) {
// NOTE: (sonictk) Sent whenever the window is activated/deactivated.
case WM_ACTIVATE:
{
WORD wParam = LOWORD(wParam);
switch (wParam) {
case WA_INACTIVE:
{
globalApplicationState = DX11_ENG_APPLICATION_PAUSED;
BOOL status = ClipCursor(NULL);
ShowCursor(TRUE);
if (status == 0) {
OutputDebugString("Unable to release the cursor!\n");
}
break;
}
case WA_CLICKACTIVE:
ShowCursor(FALSE);
case WA_ACTIVE:
globalApplicationState = DX11_ENG_APPLICATION_RUNNING;
break;
}
return 0;
}
...
```
For some reason, in the release build, the `WA_INACTIVE` branch would _always_
be chosen, no matter what events Spy++ said the window was getting. It made no
sense. There was nothing else that could affect the `wParam` parameter, so how
could it _not_ be getting set to the `WA_ACTIVE` flag, since Spy++ said it was?
Surely this meant that both the debug and release builds somehow used that
`wParam` parameter differently. Thus, I decided it was time to drop to the
assembly for both debug/release versions of the code and see what clues I could
gather.
### Dropping to assembly ###
In Visual Studio, entering the following function signature to the **Address window**:
```
WindowProc(HWND__ *, unsigned int, unsigned __int64, __int64)
```
got me the following disassembly output:
```
push rbx
sub rsp,50h
mov rax,qword ptr [__security_cookie (07FF764559A38h)]
xor rax,rsp
mov qword ptr [rsp+48h],rax
mov rbx,r9
mov r10d,edx
mov r11,rcx
cmp edx,200h
ja WindowProc+734h (07FF7644F8574h)
je WindowProc+67Ah (07FF7644F84BAh)
cmp edx,24h
ja WindowProc+265h (07FF7644F80A5h)
je WindowProc+240h (07FF7644F8080h)
mov eax,edx
sub eax,2
je WindowProc+223h (07FF7644F8063h)
sub eax,3
je WindowProc+114h (07FF7644F7F54h)
sub eax,1
je WindowProc+8Dh (07FF7644F7ECDh)
cmp eax,1Ah
jne $LN41+74h (07FF7644F865Eh)
cmp bx,1
jne $LN41+74h (07FF7644F865Eh)
mov rcx,qword ptr [globalMouseCursorHandle (07FF76455B198h)]
call qword ptr [__imp_SetCursor (07FF7645283A0h)]
xor eax,eax
mov rcx,qword ptr [rsp+48h]
xor rcx,rsp
call __security_check_cookie (07FF764504240h)
add rsp,50h
pop rbx
ret
movzx ecx,word ptr [rsp+30h]
test ecx,ecx
je WindowProc+0CBh (07FF7644F7F0Bh)
sub ecx,1
je WindowProc+0ACh (07FF7644F7EECh)
cmp ecx,1
jne $LN41+5Fh (07FF7644F8649h)
xor ecx,ecx
...
```
This was a bit much (as is expected, since `WindowProc` is a pretty big
function.) One nice thing about Visual Studio is that you can right-click on a
source file's line, and navigate directly to the disassembly for it. So I
right-clicked on the switch statement for `wParam`, and it pointed to the
following line:
```
push rbx
sub rsp,50h
mov rax,qword ptr [__security_cookie (07FF764559A38h)]
xor rax,rsp
mov qword ptr [rsp+48h],rax
mov rbx,r9
mov r10d,edx
mov r11,rcx
cmp edx,200h
ja WindowProc+734h (07FF7644F8574h)
je WindowProc+67Ah (07FF7644F84BAh)
cmp edx,24h
ja WindowProc+265h (07FF7644F80A5h)
je WindowProc+240h (07FF7644F8080h)
mov eax,edx
sub eax,2
je WindowProc+223h (07FF7644F8063h)
sub eax,3
je WindowProc+114h (07FF7644F7F54h)
sub eax,1
je WindowProc+8Dh (07FF7644F7ECDh)
cmp eax,1Ah
jne $LN41+74h (07FF7644F865Eh)
cmp bx,1
jne $LN41+74h (07FF7644F865Eh)
mov rcx,qword ptr [globalMouseCursorHandle (07FF76455B198h)]
call qword ptr [__imp_SetCursor (07FF7645283A0h)]
xor eax,eax
mov rcx,qword ptr [rsp+48h]
xor rcx,rsp
call __security_check_cookie (07FF764504240h)
add rsp,50h
pop rbx
ret
---> movzx ecx,word ptr [rsp+30h]
test ecx,ecx
je WindowProc+0CBh (07FF7644F7F0Bh)
sub ecx,1
je WindowProc+0ACh (07FF7644F7EECh)
cmp ecx,1
jne $LN41+5Fh (07FF7644F8649h)
xor ecx,ecx
...
```
!!! note Order, execution out of, the code is.
It may seem odd that the statement seems so far down in the assembly when it's
the first case in the switch statement, but remember, the optimizer has the
right to re-order the code in terms of what it thinks will give better
performance, not necessarily how the code is actually written!
Immediately, questions were raised. For one, why was there a `test ecx, ecx`
operation for a switch statement? The only possible reason is that it tests if
`ecx` is `0`, which could be possible (since I don't know what the enums for the
various window messages resolve to off the top of my head). The second question,
and more curious one, was: why is it doing that test against something that was
moved from `rsp + 30h` (`wParam`)? Since `wParam` is the 3rd parameter passed
into this function, shouldn't that have been available in the `r8` register? In
fact, looking back all the way up to the start of the function, what _did_
happen to the 3rd argument, anyway? It never got pushed onto the stack, never
got saved anywhere...it's almost as if the compiler thought that the 3rd
argument to this function wasn't needed for some reason.
Which would be _bad_. Since that _is_ `wParam`, which is what we're relying on.
Compiling again in debug mode and comparing the assembly revealed the following:
```
mov qword ptr [rsp+20h],r9
mov qword ptr [rsp+18h],r8
mov dword ptr [rsp+10h],edx
mov qword ptr [rsp+8],rcx
sub rsp,68h
mov eax,dword ptr [uMsg]
mov dword ptr [rsp+38h],eax
cmp dword ptr [rsp+38h],200h
ja WindowProc+0BBh (07FF7E090DECBh)
cmp dword ptr [rsp+38h],200h
je WindowProc+378h (07FF7E090E188h)
cmp dword ptr [rsp+38h],24h
ja WindowProc+7Eh (07FF7E090DE8Eh)
cmp dword ptr [rsp+38h],24h
je WindowProc+353h (07FF7E090E163h)
cmp dword ptr [rsp+38h],2
je WindowProc+33Ah (07FF7E090E14Ah)
cmp dword ptr [rsp+38h],5
je WindowProc+1A1h (07FF7E090DFB1h)
cmp dword ptr [rsp+38h],6
je WindowProc+0F2h (07FF7E090DF02h)
cmp dword ptr [rsp+38h],20h
je WindowProc+172h (07FF7E090DF82h)
jmp $LN41+47h (07FF7E090E3B2h)
cmp dword ptr [rsp+38h],0FFh
je WindowProc+3B6h (07FF7E090E1C6h)
cmp dword ptr [rsp+38h],100h
je $LN41+25h (07FF7E090E390h)
cmp dword ptr [rsp+38h],101h
je $LN41+36h (07FF7E090E3A1h)
cmp dword ptr [rsp+38h],120h
je WindowProc+349h (07FF7E090E159h)
jmp $LN41+47h (07FF7E090E3B2h)
mov eax,dword ptr [rsp+38h]
sub eax,201h
mov dword ptr [rsp+38h],eax
cmp dword ptr [rsp+38h],31h
ja $LN41+47h (07FF7E090E3B2h)
mov eax,dword ptr [rsp+38h]
lea rcx,[__ImageBase (07FF7E0870000h)]
movzx eax,byte ptr [rcx+rax+9E3F0h]
mov eax,dword ptr [rcx+rax*4+9E3D8h]
add rax,rcx
jmp rax
---> movzx eax,word ptr [rsp+30h]
and rax,0FFFFh
mov word ptr [rsp+30h],ax
movzx eax,word ptr [rsp+30h]
mov dword ptr [rsp+3Ch],eax
cmp dword ptr [rsp+3Ch],0
je WindowProc+122h (07FF7E090DF32h)
cmp dword ptr [rsp+3Ch],1
...
```
This started to make less sense. While the debug version now _did_ save the
contents of `r8` onto the stack (at `rsp + 18h`), it never got used anywhere in
the scope of what I was looking at. In fact, when inspecting the memory at
`rsp + 30h`, I got the following 64-bit value:
```
0000506575fd83c0
```
That _definitely_ didn't look like a enum for a window message...in which case
the code dropped out of all the possible cases and returned `0`. Which meant
that since the global variable was initialized to be in a running state, the
engine loop would continue to render!
Ok, so the debug version of the build was running correctly as a result of
luck. But why were both versions not making use of the 3rd parameter, `wParam`
then? Something must obviously be wrong with the code that makes use of
`wParam`.
This allowed me to narrow on the final few lines that contained the issue, since
`wParam` is referenced in only two places within the earlier scope:
```
LRESULT CALLBACK WindowProc(HWND hwnd, UINT uMsg, WPARAM wParam, LPARAM lParam)
{
switch(uMsg) {
// NOTE: (sonictk) Sent whenever the window is activated/deactivated.
case WM_ACTIVATE:
{
WORD wParam = LOWORD(wParam);
switch (wParam) {
...
```
Do you see it now?
```
WORD wParam = LOWORD(wParam);
```
Bingo! Shadowing the variable of the function argument must be causing the
optimizer to think that it no longer needs the _actual_ function argument
anymore! Simple fix; change `wParam` here to a different symbol name instead so
that it doesn't shadow and the problem is solved!
And sure enough, that was the issue. For completeness, this is the new assembly generated:
```
push rbx
sub rsp,50h
mov rax,qword ptr [__security_cookie (07FF6F7639A38h)]
xor rax,rsp
mov qword ptr [rsp+48h],rax
mov rbx,r9
mov r10d,edx
mov r11,rcx
cmp edx,200h
ja WindowProc+733h (07FF6F75D8573h)
je WindowProc+679h (07FF6F75D84B9h)
cmp edx,24h
ja WindowProc+264h (07FF6F75D80A4h)
je WindowProc+23Fh (07FF6F75D807Fh)
mov eax,edx
sub eax,2
je WindowProc+222h (07FF6F75D8062h)
sub eax,3
je WindowProc+113h (07FF6F75D7F53h)
sub eax,1
je WindowProc+8Dh (07FF6F75D7ECDh)
cmp eax,1Ah
jne $LN41+74h (07FF6F75D865Dh)
cmp bx,1
jne $LN41+74h (07FF6F75D865Dh)
mov rcx,qword ptr [globalMouseCursorHandle (07FF6F763B198h)]
call qword ptr [__imp_SetCursor (07FF6F76083A0h)]
xor eax,eax
mov rcx,qword ptr [rsp+48h]
xor rcx,rsp
call __security_check_cookie (07FF6F75E4240h)
add rsp,50h
pop rbx
ret
---> movzx ecx,r8w
test ecx,ecx
je WindowProc+0CAh (07FF6F75D7F0Ah)
sub ecx,1
```
We can see that the lower word of `r8` is moved to the `rcx` register to be
tested, which is exactly what we expect. And as it turns out, `WA_INACTIVE` is
defined as `0`, which is why the compiler generated the `test ecx, ecx`
instruction; it's testing if `r8w` (i.e. `wParam`) is `0` (`WA_INACTIVE`), which
matches the switch statement exactly. Mystery solved!
So that's an example of a real-world scenario in which knowledge of assembly can
be helpful; this time, it was thankfully a smaller issue that didn't require too
much digging into. Threading issues, `NULL` parameters that shouldn't have been,
"object-oriented" constructors causing segmentation faults, third-party API
function calls doing things they shouldn't have been, hardware issues...these
are all examples of problems that I've triaged and identified by dropping to the
assembly in certain scenarios to figure out what exactly the generated code was
doing. It's not all just about theory, here!
# Beating the compiler #
Let's take a break from debugging and take another look at an aspect of what
assembly can be used for: writing high-performance code. For this section, I
will be stealing Ray Seyfarth's [correlation example](http://rayseyfarth.com/asm/index.html),
and going a little more in-depth into explaining the assembly itself.
## A correlation function ##
Before we begin, for those of us who aren't statisticians (like me), let's recap
what
[correlation](https://www.countbayesie.com/blog/2015/2/21/variance-co-variance-and-correlation)
is. Since the article at that link does a great job of explaining the concepts
far better than any attempt I'm going to make, I'm just going to summarize it
here: basically, given two sets of values, (e.g. you could plot x against y in a
graph with a single line and have two sets of values there for a given number of
samples), their **correlation** measures how much **variance** there is between
the two sets of values. Variance is mathematically defined as _the square of the
standard deviation_, which it in itself is defined as:
\begin{equation}
s = \sqrt{\frac{\sum_{i=1}^{n} (X_{i} - \bar{X})^2}{n - 1}}
\end{equation}
Where `s` is the standard deviation, `n` is the number of values in the set `X`,
and $$ \bar{X} $$ is the **mean** of the set of values in `X`.
!!! note The official definition
The **standard deviation** is defined as **the average distance from the mean
of a data set to a point**.
In plain english, then, this would read "compute the squares of the distance
from each data point to the mean of the set's values, add them all up, divide by
`n - 1`, and then take the square root of that" to get your standard deviation.
The **variance**, then, is just the standard deviation squared, while the
**correlation** is the normalized covariance. As such, we can arrive at the
following formula for correlation:
\begin{equation}
Correlation(X, Y) = \frac{Covariance(X, Y)}{\sqrt{Variance(X) \times Variance(Y)}}
\end{equation}
!!! tip Yes, I would like to geek out more over math.
If you're interested, since covariance and correlation are very closely
related to the concept of **principal components analysis**, there is a [great
paper written by Lindsay I. Smith](http://www.iro.umontreal.ca/~pift6080/H09/documents/papers/pca_tutorial.pdf)
on the topic that I would recommend reading. Again, this tutorial is not one
for statistics, so I am only summarizing the formulas here without going into
too in-depth of an explanation.
Written in C, this comes out to:
```
double correlation_ref(const double x[], const double y[], int n)
{
double sumX = 0.0;
double sumY = 0.0;
double sumXX = 0.0;
double sumYY = 0.0;
double sumXY = 0.0;
for (int i=0; i < n; ++i) {
sumX += x[i];
sumY += y[i];
sumXX += x[i] * x[i];
sumYY += y[i] * y[i];
sumXY += x[i] * y[i];
}
double covXY = (n * sumXY) - (sumX * sumY);
double varX = (n * sumXX) - (sumX * sumX);
double varY = (n * sumYY) - (sumY * sumY);
return covXY / sqrt(varX * varY);
}
```
This doesn't seem too bad, does it?
Well...this is the assembly version that I'm going to be explaining next, in its
full entirety:
```
%include "macros.inc"
bits 64
default rel
segment .text
global correlation_avx
correlation_avx:
; Store the non-volatile registers to restore them later
multipush_ymm ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15
; rcx: x array
; rdx: y array
; r8: num of samples
; r9 : sample counter
; r10: loop counter
; ymm0: 4 parts of sumX (work on 4 values at a time)
; ymm1: 4 parts of sumY
; ymm2: 4 parts of sumXX
; ymm3: 4 parts of sumYY
; ymm4: 4 parts of sumXY
; ymm5; 4 x values - later squared
; ymm6: 4 y values - later squared
; ymm7: 4 xy values
xor r9d, r9d
mov r10, r8 ; r10 = num of samples (copy)
vzeroall ; zeros the contents of all the ymm registers.
.loop:
vmovupd ymm5, [rcx + r9] ; ymm5 = 4 doubles from x
vmovupd ymm6, [rdx + r9] ; ymm5 = 4 doubles from y
; AVX instructions ymm7 = ymm5 * ymm6: (can name same register twice)
; Having 3 operands reduces the register pressure and allows using 2 registers
; as sources in an instruction while preserving their values.
vmulpd ymm7, ymm5, ymm6 ; ymm7 = x * y (calc. this first so we can just act on ymm5/6 in-place later)
vaddpd ymm0, ymm0, ymm5 ; ymm0 = sumX (SIMD add the 4 doubles ymm0 + ymm5 and store in ymm0)
vaddpd ymm1, ymm1, ymm6 ; ymm1 = sumY
vmulpd ymm5, ymm5, ymm5 ; ymm5 = x * x
vmulpd ymm6, ymm6, ymm6 ; ymm6 = y * y
vaddpd ymm2, ymm2, ymm5 ; ymm2 = sumXX
vaddpd ymm3, ymm3, ymm6 ; ymm3 = sumYY
vaddpd ymm4, ymm4, ymm7 ; ymm4 = sumXY
; Load the next 256 bits into the registers and apply the same math
vmovupd ymm13, [rcx + r9 + 32] ; ymm13 = next 256 bits after current x
vmovupd ymm14, [rdx + r9 + 32] ; ymm14 = next 256 bits after current y
vmulpd ymm15, ymm13, ymm14 ; ymm15 = x * y
vaddpd ymm8, ymm8, ymm13 ; ymm8 = sumX
vaddpd ymm9, ymm9, ymm14 ; ymm9 = sumY
vmulpd ymm13, ymm13, ymm13 ; ymm13 = x * x
vmulpd ymm14, ymm14, ymm14 ; ymm14 = y * y
vaddpd ymm10, ymm10, ymm13 ; ymm10 = sumXX
vaddpd ymm11, ymm11, ymm14 ; ymm11 = sumYY
vaddpd ymm12, ymm12, ymm15 ; ymm12 = sumXY
add r9, 64 ; We're doing 32 + 32 bytes per loop iteration (i.e. 8 doubles)
sub r10, 8 ; Decrement the sample counter by the number of doubles processed
jnz .loop ; Continue looping until all samples have been processed
vaddpd ymm0, ymm0, ymm8 ; SIMD add up both sumX totals
vaddpd ymm1, ymm1, ymm9 ; Same for sumY
vaddpd ymm2, ymm2, ymm10 ; SIMD add up both sumXX totals
vaddpd ymm3, ymm3, ymm11 ; Same for sumYY
vaddpd ymm4, ymm4, ymm12 ; Same for sumXY
; vhaddpd differs from haddpd a little; this is the operation (4 x 64bit doubles)
;
; input x3 x2 x1 x0
; input2 y3 y2 y1 y0
; result y2 + y3 x2 + x3 y0 + y1 x0 + x1
vhaddpd ymm0, ymm0, ymm0 ; ymm0 = sumX (sum up the doubles)
vhaddpd ymm1, ymm1, ymm1 ; ymm1 = sumY
vhaddpd ymm2, ymm2, ymm2 ; ymm2 = sumXX
vhaddpd ymm3, ymm3, ymm3 ; ymm3 = sumYY
vhaddpd ymm4, ymm4, ymm4 ; ymm4 = sumXY
; vextractf128: Extracts 128 bits of packed floats from second operand and stores results in dest.
; xmm5 = sumX (copy of the lower 2 doubles)
vextractf128 xmm5, ymm0, 1 ; 3rd operand determines offset to start extraction (128bit offset from operand specified)
; add scalar double
vaddsd xmm0, xmm0, xmm5 ; xmm0 = lower 2 doubles of sumX * 2
vextractf128 xmm6, ymm1, 1 ; Do same thing for sumY
vaddsd xmm1, xmm1, xmm6
vmulsd xmm6, xmm0, xmm0 ; xmm6 = sumX * sumX
vmulsd xmm7, xmm1, xmm1 ; xmm7 = sumY * sumY
vextractf128 xmm8, ymm2, 1 ; xmm8 = sumXX
vaddsd xmm2, xmm2, xmm8 ; xmm2 = sumXX
vextractf128 xmm9, ymm3, 1 ; xmm9 = sumYY
vaddsd xmm3, xmm3, xmm9 ; xmm3 = sumYY
cvtsi2sd xmm8, r8 ; convert n from int to double and store in xmm8
vmulsd xmm2, xmm2, xmm8 ; xmm2 = n * sumXX
vmulsd xmm3, xmm3, xmm8 ; xmm3 = n * sumYY
vsubsd xmm2, xmm2, xmm6 ; xmm2 = (n * sumXX) - (sumX * sumX)
vsubsd xmm3, xmm3, xmm7 ; xmm3 = (n * sumYY) - (sumY * sumY)
vmulsd xmm2, xmm2, xmm3 ; xmm2 = varX * varY
vsqrtsd xmm2, xmm2, xmm2 ; xmm2 = sqrt(varX * varY)
vextractf128 xmm6, ymm4, 1 ; xmm6 = lower 2 doubles of sumXY
vaddsd xmm4, xmm4, xmm6 ; xmm4 = lower 2 doubles of sumXY + other 2 doubles of sumXY
vmulsd xmm4, xmm4, xmm8 ; xmm4 = n * sumXY
vmulsd xmm0, xmm0, xmm1 ; xmm0 = sumX * sumY
vsubsd xmm4, xmm4, xmm0 ; xmm4 = (n * sumXY) - (sumX * sumY)
vdivsd xmm0, xmm4, xmm2 ; xmm0 = covXY / sqrt(varX * varY)
; Restore the original volatile registers
multipop_ymm ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15
ret
```
If you just scrolled past immediately without taking a second glance to get
here, I don't blame you. But don't worry, I'm going to walk through this
implementation bit-by-bit so that everything's clear, even with the added
comments. And I'm going to do it _right now_.
### NASM macros ###
Let's start with the first line of the `correlation_avx` function:
```
correlation_avx:
; Store the non-volatile registers to restore them later
multipush_ymm ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15
```
Interesting. There doesn't seem to be a mmenomic called `multipush_ymm` in any
reference. What is this exactly? Well, it's one of my own personal macros that I
defined for the purpose of being able to push/pop multiple registers at once. If
you look at the top of the assembly, you'll see the line:
```
%include "macros.inc"
```
Just like in C/C++, NASM has support for
[includes](https://www.nasm.us/xdoc/2.14.02/html/nasmdoc4.html#section-4.6.1)
and
[macros](https://www.nasm.us/xdoc/2.14.02/html/nasmdoc4.html#section-4.3). While
the `macros.inc` file is included with the sample code available in the Code
Repository section, the relevant macro is reproduced below:
```
%macro multipush_ymm 1-*
%rep %0
sub rsp, 32
vmovdqu yword [rsp], %1
%rotate 1
%endrep
%endmacro
```
Some of this should be self-explantory. For example, the `%macro` and
`%endmacro` directives indicate the start/end of a macro. The `multipush_ymm`
acts as the identifier for the macro itself, and the `1-*` argument after that
indicates that this is a _variadic macro_ (i.e. a macro that can take any number
of arguments).
Moving on, the `%rep` and `%endrep` directives tells NASM to repeat the text
inbetwen them for a specified number of times; in this case, passing `%0` as the
argument means that the text will be repeated as many times as there are
arguments passed to the macro.
Once we've gone past those NASM-specific directives, the first real instruction
appears: the `sub` pseudo-op, to expand the stack by 32 bytes (256 bits, which
is the size of a `ymm` register, if you recall from the Floating point registers
section). This is so that we can push the contents of the register onto the
stack, which is what we do next with the
[`vmovdqu`](https://www.felixcloutier.com/x86/movdqu:vmovdqu8:vmovdqu16:vmovdqu32:vmovdqu64)
instruction (Which stands for VEX 256 move double quadword unaligned). The `%1`
argument passed to the instruction indicates that the current argument in the
1st position will get passed to that instruction.
After that, we see the `%rotate 1` directive, which basically tells NASM to
shift the arguments passed to the macro to the left by 1 place, and then the
macro repeats. This essentially allows us to rotate through all the arguments
that were passed into the macro, and all of them getting pushed onto the stack
in turn, which is exactly what we want.
Just as we push these registers onto the stack frame, there is also the
symmetric version of popping them off the stack:
```
%macro multipop_ymm 1-*
%rep %0
%rotate -1
vmovdqu %1, yword [rsp]
add rsp, 32
%endrep
%endmacro
```
As you can see, the code is very similar, except that instead of rotating the
arguments to the left, we rotate them the other way round by passing `-1` to the
`%rotate` directive (this is so that we can pass the registers in the same order
to both macros and have them be pushed/popped in the correct order), and we add
to the `rsp` stack pointer to shrink the stack instead of expanding it.
Thus, it should be fairly obvious now what the first `multipush_ymm` statement
in the code does: it's responsible for pushing the non-volatile `ymm` registers
onto the stack.
For the purposes of understanding our correlation function, these are the only
two macros that we will be using.
### SIMD in assembly ###
Before delving into the assembly again, let's recap a little on
what [SIMD operations](https://en.wikipedia.org/wiki/SIMD) are first. For C/C++
programmers who are already familiar with SIMD programming, you may wish to skip
this section and head straight to the further breakdown of the assembly example
given above.
#### What is SIMD? ####
SIMD stands
for [single instruction, multiple data](https://en.wikipedia.org/wiki/SIMD),
which to explain simply would mean "to perform one operation across multiple
streams of data at once in order to reduce the amount of work done". Thus far,
the work we've done in this tutorial has mostly been to load a single object or
value in one register at a time, and perform mathematical operations much the
same way as well, where we act on the memory in a single register at a time.
However, when writing high-performance code, we generally want to focus on two
areas of optimization: one is to reduce the amount of work _required_, and the
other is to reduce the amount of work _done_. The former generally is achieved
by analysis of the work that's being done, and finding an algorithm to help
reduce the amount of computation that's required to achieve a similar
result. The latter can be achieved through re-structuring your memory layouts to
take advantage of SIMD.
Recall that from the Floating point registers section, we saw that a single
`xmm` register was 128 bits wide, and that most CPUs have support for these
registers (SSE). Recall, also, that a `float` according to the IEEE 754 standard
is 32 bits wide. Ergo, we should be able to fit 4 floats into one `xmm` register
(or two `double`s).
As it turns out, there are special instructions for doing exactly that. There
are also special instructions and **intrinsics** for performing SIMD math
operations across multiple portions of the same register at once (thus
performing a math operation on 4 floats at once), which can produce tremendous
savings in performance. For example, the operation:
```
mulps xmm1, xmm2, xmm3
```
Performs a **multiplication of packed single-precision floating point values**
using 3 `xmm` registers in the process like so:
*************************************************************************
* *
* +-------------+---------------+--------------+---------------+ *
* | 3.0 | 40.0 | 80.0 | 5.0 | xmm3 *
* | | | | | *
* +-------------+---------------+--------------+---------------+ *
* x x x x *
* +-------------+---------------+--------------+---------------+ *
* | 3.0 | 2.0 | 3.0 | 3.0 | xmm2 *
* | | | | | *
* +-----+-------+-------+-------+------+-------+-------+-------+ *
* | | | | *
* v v v v *
* +-------------+---------------+--------------+---------------+ *
* | 9.0 | 80.0 | 240.0 | 15.0 | xmm1 *
* | | | | | *
* +-------------+---------------+--------------+---------------+ *
* *
*************************************************************************
[Figure [mulps]: An illustration of how the `mulps` SIMD instruction works.]
It's important to note here that this happens in **one single operation**, as
opposed to 4 seperate atomic operations, which is where SIMD gets its name from:
a single instruction being applied to multiple data points
simultaneously. Additionally, by performing the operation in registers and
avoiding **register spill** to main memory, along with using an instruction that
allows for 3 operands allows us to be able to use 2 registers as sources in an
instruction at once while preserving their values, thus reducing **register
pressure**, overall performance is improved.
!!! note Text book-worm
**Register spill** refers to when memory contents don't fit into the
available CPU registers, and "spills" over to main memory. This is bad
because the cost of accessing memory from a register is _far_ lower than the
cost of accessing main memory, especially if it's not in the CPU caches (and
_especially_ if it has to go to RAM, or even page to disk).
**Register pressure** refers to the availability of CPU registers that are
free for operations to take place. When you have high register pressure, it
means that most or all of the CPU registers are currently in use, and the
more register spill has to happen when working with memory since the
registers are already full.
Both concepts are within our control as programmers to minimize as much as
possible in order to improve performance. When working in assembly in
particular, you must be coignizant of your usage of the available hardware,
and do your best to reduce both the register pressure and register spill
when running your code.
Identifying areas in your code where SIMD becomes applicable is more of an art
than a science, but generally areas where you notice lots of floating-point math
operations taking place (especially when working on image processing, 3D
geometry, etc.) are good candidates for applying this optimization
technique. We'll talk more about how you can utilize the compiler to help you
identify these areas for improvement in the later Working with the compiler
section of this tutorial.
Now that we understand what SIMD is, let's continue taking a look at the
assembly for the correlation example.
### Packing values in SIMD ###
Normally, I skip the comments in my examples, but here we're going to take a
look at the first big block, since it's somewhat relevant:
```
; register usage:
; rcx: x array
; rdx: y array
; r8: num of samples (n)
; r9 : sample counter
; r10: loop counter
; ymm0: 4 parts of sumX (work on 4 values at a time)
; ymm1: 4 parts of sumY
; ymm2: 4 parts of sumXX
; ymm3: 4 parts of sumYY
; ymm4: 4 parts of sumXY
; ymm5; 4 x values - later squared
; ymm6: 4 y values - later squared
; ymm7: 4 xy values
```
One thing I learned from reading Ray Seyfarth's book is that it's OK to write
out a "plan of attack" of sorts in the code; not only does it give you a
reference to go back to when you're lost in a sea of registers and values, but
it also forces you to assess your usage of registers and memory and to see if
you have areas where you can help to reduce the register pressure (i.e. perhaps
re-use an earlier register for a counter, or avoid a copy to another register).
In this case, since the signature of our correlation function is:
```
double correlation_ref(const double x[], const double y[], int n);
```
We know therefore according to the calling convention that `rcx` will contain a
pointer to the samples of the first, or `x` array, `rdx` will contain a pointer
to the samples of the second, or `y` array, and `r8` will contain the dimension
of either array. (See the What is a calling convention? section for details.)
Since we also know that we're going to be using SIMD for these math operations,
we start preparing mentally as to which registers are going to be used to hold
which results, and how.
```
xor r9d, r9d
mov r10, r8 ; r10 = num of samples (copy)
vzeroall ; zeros the contents of all the ymm registers.
```
Because `r10` will be used to store the current loop iteration index (`i` from
the C version), we copy over the number of samples to it (as you would in a
normal `for` loop). We clear the `r9` register as well, since it's going to act
as our counter for the number of samples that we've currently processed from
both arrays. We then call the [`vzeroall`](https://www.felixcloutier.com/x86/vzeroall)
psuedo-op, which zeroes out the contents of all the `ymm` registers before we
begin any of our math. Zero-ing out the `ymm` registers is fine, since we've
already saved the values of the non-volatile registers on the stack.
At this point, in the C implementation, we're about to implement the following
section:
```
...
for (int i=0; i < n; ++i) {
sumX += x[i];
sumY += y[i];
sumXX += x[i] * x[i];
sumYY += y[i] * y[i];
sumXY += x[i] * y[i];
}
...
```
We begin by defining the start of the loop by using the `.loop` label, followed
by **packing** our samples into the `ymm` registers, as according to our plan:
```
.loop:
vmovupd ymm5, [rcx + r9] ; ymm5 = 4 doubles from x
vmovupd ymm6, [rdx + r9] ; ymm6 = 4 doubles from y
```
The [`vmovupd`](https://www.felixcloutier.com/x86/movupd) pseudo-op, which
stands for **VEX-256 move unaligned packed double-precision floating point
values**, is what moves 256 bits at a time from the second operand to the
first. In this case, what we will have after the instruction has run above is:
**************************************************************************
* *
* memory +----+ *
* +----------+ +-------------------+ | | *
* | rcx +---------->| x[0] +------------>|x[0]| *
* +----------+ |-------------------| | | *
* | x[1] +-------. +----+ *
* |-------------------| | | | *
* | x[2] +-------. '-->|x[1]| *
* |-------------------| | | | *
* | . . . +--. | +----+ *
* +-------------------+ | | | | ymm5 *
* |///////////////////| | '-->|x[2]| *
* |///////////////////| | | | *
* |///////////////////| | +----+ *
* |///////////////////| | | | *
* +----------+ +-------------------+ '------->|x[3]| *
* | rdx +------->| y[0] +-------. | | *
* +----------+ |-------------------| | +----+ *
* | y[1] +------. | *
* +---------+ |-------------------| || *
* | r9 (0) | | y[2] +---. || +----+ *
* +---------+ |-------------------| | || | | *
* (sample counter) | . . . +-. | | '-->|y[0]| *
* +-------------------+ | | | | | *
* |///////////////////| | | | | | *
* |///////////////////| | | | +----+ *
* +-------------------+ | | | | | *
* | | '--->|y[1]| *
* | | | | *
* | | | | *
* | | +----+ ymm6 *
* | | | | *
* (this diagram is not drawn to scale) | '------>|y[2]| *
* (in fact, none of these diagrams are) | | | *
* (it's snowing on Mt. Fuji) | | | *
* | +----+ *
* | | | *
* '-------->|y[3]| *
* | | *
* +----+ *
* *
**************************************************************************
[Figure [diagram]: An illustration of how the `vmovupd` SIMD instruction works.]
As you can see, `ymm5` contains the first 4 64-bit `double` samples from the
`x` array passed in, while `ymm6` contains the first 4 samples from the `y`
array. `r9`, as mentioned earlier, holds the smaple counter, which starts at `0`
since we haven't processed any of the samples yet.
### SIMD math operations in assembly ###
The next part of the code begins the math:
```
vmulpd ymm7, ymm5, ymm6 ; ymm7 = x * y
vaddpd ymm0, ymm0, ymm5 ; ymm0 = sumX
vaddpd ymm1, ymm1, ymm6 ; ymm1 = sumY
vmulpd ymm5, ymm5, ymm5 ; ymm5 = x * x
vmulpd ymm6, ymm6, ymm6 ; ymm6 = y * y
vaddpd ymm2, ymm2, ymm5 ; ymm2 = sumXX
vaddpd ymm3, ymm3, ymm6 ; ymm3 = sumYY
vaddpd ymm4, ymm4, ymm7 ; ymm4 = sumXY
```
The [`vmulpd`](https://www.felixcloutier.com/x86/mulpd)
and [`vaddpd`](https://www.felixcloutier.com/x86/addpd)
instructions, as their names imply, perform packed SIMD multiplication and
addition operations respectively in a similar fashion as shown in figure [mulps].
As the comments indicate, we're still on-track with our plan for the various
uses of the different `ymm` registers. The reason we perform the first
instruction to store the result of `x * y` in `ymm7` is so we can act on the
`ymm5` and `ymm6` registers in-place later on without having to incur expensive
operations to move/copy the data in the registers (remember, we have to move a
_lot_ more data than normal, since these `ymm` registers are 256 bits wide!)
The next part of the code does essentially the same math, except that we use
additional `ymm` registers that are still available:
```
; Load the next 256 bits into the registers and apply the same math
vmovupd ymm13, [rcx + r9 + 32] ; ymm13 = next 256 bits after current x
vmovupd ymm14, [rdx + r9 + 32] ; ymm14 = next 256 bits after current y
vmulpd ymm15, ymm13, ymm14 ; ymm15 = x * y
vaddpd ymm8, ymm8, ymm13 ; ymm8 = sumX
vaddpd ymm9, ymm9, ymm14 ; ymm9 = sumY
vmulpd ymm13, ymm13, ymm13 ; ymm13 = x * x
vmulpd ymm14, ymm14, ymm14 ; ymm14 = y * y
vaddpd ymm10, ymm10, ymm13 ; ymm10 = sumXX
vaddpd ymm11, ymm11, ymm14 ; ymm11 = sumYY
vaddpd ymm12, ymm12, ymm15 ; ymm12 = sumXY
```
Even though we didn't plan on using these registers originally, since they're
available, why not? Since our correlation function is a leaf function and
doesn't call any subroutines, we don't have to worry about leaving registers
open for other functions to use. This is the nice thing about working at the
assembly level: you get to decide just how much register pressure you want to
create on the hardware. Any time you see an opportunity for running multiple
instruction branches in parallel, especially in SIMD, you should seize it;
otherwise, you're just leaving free available compute on the table.
The next part of the code increments all counter variables and concludes the
loop:
```
add r9, 64
sub r10, 8
jnz .loop
```
We add 64 bytes to the `r9` sample counter, since we're processing 8 `double`s
at once (remember, a `double` is 64 bits = 8 bytes, so 8 of them is 64
bytes). We subtract `8` from `r10`, which, if you will recall, is our `n`
counter that we copied from `r8` (which remains unaffected and holds the
original value of `n`). We then check the loop condition and continue the loop
if necessary. If you've been following along so far in the earlier examples,
this shouldn't seem like anything new.
Continuing on:
```
vaddpd ymm0, ymm0, ymm8 ; SIMD add up both sumX totals
vaddpd ymm1, ymm1, ymm9 ; Same for sumY
vaddpd ymm2, ymm2, ymm10 ; SIMD add up both sumXX totals
vaddpd ymm3, ymm3, ymm11 ; Same for sumYY
vaddpd ymm4, ymm4, ymm12 ; Same for sumXY
```
These instructions add up both the totals from both sets of registers that we
were operating on. Since the instruction branches were completely independent of
each other, it's likely that the CPU could have run them using **out-of-order
execution (O3E)** and executed more than one AVX2 instruction per-CPU cycle. At
this point, however, since we _do_ need the summed total from both sets of
registers we were using, this marks the point where the two instruction branches
will need to reconcile.
The next part of the code does something that might seem a little confusing at
first glance:
```
vhaddpd ymm0, ymm0, ymm0 ; ymm0 = sumX (sum up the doubles)
vhaddpd ymm1, ymm1, ymm1 ; ymm1 = sumY
vhaddpd ymm2, ymm2, ymm2 ; ymm2 = sumXX
vhaddpd ymm3, ymm3, ymm3 ; ymm3 = sumYY
vhaddpd ymm4, ymm4, ymm4 ; ymm4 = sumXY
```
What is going on here? Well,
the [`vhaddpd`](https://www.felixcloutier.com/x86/haddpd) instruction is
essentially performing the following operation:
******************************************************************************
* *
* vhaddpd ymm1, ymm2, ymm3 *
* *
* +----------+-----------+-----------+--------+ *
* | x3 | x2 | x1 | x0 | ymm3 *
* | | | | | *
* +---+------+-----+-----+----+------+----+---+ *
* | | | | *
* '-----+----' '-----+---' *
* | | *
* .------------' '-------------------------------. *
* | | *
* | | *
* | +----------+-----------+-----------+--------+ | *
* | ymm2 | y3 | y2 | y1 | y0 | | *
* | | | | | | | *
* | +----+-----+-----+-----+-----+-----+---+----+ | *
* | | | | | | *
* | '----+----' '---+---' | *
* | | | | *
* | .------------------------' | | *
* | | .-----------------------' | *
* | | | | *
* | v v | *
* | +----------+-----------+-----------+--------+ | *
* | | y2 + y3 | x2 + x3 | y0 + y1 | x0 + x1| ymm1 | *
* | | | | | | | *
* | +----------+-----------+-----------+--------+ | *
* | ^ ^ | *
* | | | | *
* '-----------------' '---------------------------' *
* *
******************************************************************************
[Figure [diagram]: An illustration of how the `vhaddpd` SIMD instruction works.]
Now things should be clear: we're making use of the functionality of a
**horizontal-add** operation in order to sum up the `double`s across the `ymm`
registers in fewer operations than it would extract to extract each of the
values manually and sum them up that way. Again, this is an example of SIMD at
work.
The next part of the code starts the process of unpacking the registers and
performing the final math operations in the C implementation:
```
double covXY = (n * sumXY) - (sumX * sumY);
double varX = (n * sumXX) - (sumX * sumX);
double varY = (n * sumYY) - (sumY * sumY);
return covXY / sqrt(varX * varY);
```
The first part of this code follows:
```
vextractf128 xmm5, ymm0, 1
vaddsd xmm0, xmm0, xmm5 ; xmm0 = lower 2 doubles of sumX * 2
vextractf128 xmm6, ymm1, 1 ; Do same thing for sumY
vaddsd xmm1, xmm1, xmm6
vmulsd xmm6, xmm0, xmm0 ; xmm6 = sumX * sumX
vmulsd xmm7, xmm1, xmm1 ; xmm7 = sumY * sumY
vextractf128 xmm8, ymm2, 1 ; xmm8 = sumXX
vaddsd xmm2, xmm2, xmm8 ; xmm2 = sumXX
vextractf128 xmm9, ymm3, 1 ; xmm9 = sumYY
vaddsd xmm3, xmm3, xmm9 ; xmm3 = sumYY
```
The [`vextractf128`](https://www.felixcloutier.com/x86/vextractf128:vextractf32x4:vextractf64x2:vextractf32x8:vextractf64x4)
psuedo-op extracts 128 bits of packed floating-point values from the second
operand passed to it, and stores the results in the first operand. The third
operand passed to it is normally used as a writemask; in this case, using the
first two lines as an example, we specify a value of `1` in order to start with
a 128 bit offset, since we're copying the lower 2 doubles to the `xmm5` register.
Basically, because we stored `sumX`, `sumY`, `sumXX`, `sumYY` and `sumXY` in the
`ymm` registers in a packed fashion (4 `double`s per-register), we're moving the
lower 2 `double`s of each register to an `xmm` one instead (which, if you'll
recall, is 128 bits wide).
Next, we perform the math required:
```
cvtsi2sd xmm8, r8 ; convert n from int to double and store in xmm8
vmulsd xmm2, xmm2, xmm8 ; xmm2 = n * sumXX
vmulsd xmm3, xmm3, xmm8 ; xmm3 = n * sumYY
vsubsd xmm2, xmm2, xmm6 ; xmm2 = (n * sumXX) - (sumX * sumX)
vsubsd xmm3, xmm3, xmm7 ; xmm3 = (n * sumYY) - (sumY * sumY)
vmulsd xmm2, xmm2, xmm3 ; xmm2 = varX * varY
vsqrtsd xmm2, xmm2, xmm2 ; xmm2 = sqrt(varX * varY)
vextractf128 xmm6, ymm4, 1 ; xmm6 = lower 2 doubles of sumXY
vaddsd xmm4, xmm4, xmm6 ; xmm4 = lower 2 doubles of sumXY + other 2 doubles of sumXY
vmulsd xmm4, xmm4, xmm8 ; xmm4 = n * sumXY
vmulsd xmm0, xmm0, xmm1 ; xmm0 = sumX * sumY
vsubsd xmm4, xmm4, xmm0 ; xmm4 = (n * sumXY) - (sumX * sumY)
vdivsd xmm0, xmm4, xmm2 ; xmm0 = covXY / sqrt(varX * varY)
```
The [`cvtsi2sd`](https://www.felixcloutier.com/x86/cvtsi2sd) instruction
converts our `int` value of `n` to to be stored in an `xmm` register instead, so
that we can use it for multiplication with the other `double` values that are
now in the `xmm` registers (see why we preserved it now by making the copy to
`r10` instead of using it outright earlier?).
The other math operations used here,
[`vsubsd`](https://www.felixcloutier.com/x86/subsd),
[`vsqrtsd`](https://www.felixcloutier.com/x86/sqrtsd)
and [`vdivsd`](https://www.felixcloutier.com/x86/divsd) are fairly
self-explanatory at this point; they perform SIMD subtract, square root and
division operations respectively, and they are used to follow the math from the
C implementation.
!!! tip SIMD in C/C++
In C/C++, on the x64 ISA, Intel provides a set of
[compiler intrinsics](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#)
that you can use for performing SIMD operations as well just like how we're
doing here from within C/C++ code, that should generate similar assembly
instructions to what you'll see in the example given. It's worth taking a
few hours to study if you're completely unfamiliar with the topic, given the
potential for performance improvements.
The final few lines of the assembly version are as follows, which as described
earlier, restores the non-volatile `ymm` registers to their original values
before returning from the function.
```
; Restore the original volatile registers
multipop_ymm ymm6, ymm7, ymm8, ymm9, ymm10, ymm11, ymm12, ymm13, ymm14, ymm15
ret
```
To help us test this, I've provided a simple C program that tests the
correlation by generating random values for the data, and measures the time
taken. Comparing the reference implementation against the assembly version is as
simple as swapping out the `correlation_avx` for the `correlation_ref` function call.
```
#include
#include
#define NUM_OF_SAMPLES 500000000
// NOTE: (sonictk) x64 asm implementation
extern double correlation_avx(const double x[], const double y[], int n);
double correlation_ref(const double x[], const double y[], int n);
// NOTE: (sonictk) Reference implementation
double correlation_ref(const double x[], const double y[], int n)
{
double sumX = 0.0;
double sumY = 0.0;
double sumXX = 0.0;
double sumYY = 0.0;
double sumXY = 0.0;
for (int i=0; i < n; ++i) {
sumX += x[i];
sumY += y[i];
sumXX += x[i] * x[i];
sumYY += y[i] * y[i];
sumXY += x[i] * y[i];
}
double covXY = (n * sumXY) - (sumX * sumY);
double varX = (n * sumXX) - (sumX * sumX);
double varY = (n * sumYY) - (sumY * sumY);
return covXY / sqrt(varX * varY);
}
int main(int argc, char *argv[])
{
srand(1);
double xDataSet[NUM_OF_SAMPLES];
double yDataSet[NUM_OF_SAMPLES];
for (int i=0; i < NUM_OF_SAMPLES; ++i) {
int xBaseVal = rand();
int yBaseVal = rand();
xDataSet[i] = (double)xBaseVal;
yDataSet[i] = (double)yBaseVal;
}
LARGE_INTEGER startTimerValue;
LARGE_INTEGER endTimerValue;
LARGE_INTEGER timerFreq;
QueryPerformanceFrequency(&timerFreq);
QueryPerformanceCounter(&startTimerValue);
double result = correlation_avx(xDataSet, yDataSet, NUM_OF_SAMPLES);
QueryPerformanceCounter(&endTimerValue);
double deltaTime = (double)(endTimerValue.QuadPart - startTimerValue.QuadPart * 1.0) / (double)timerFreq.QuadPart;
printf("Time taken is: %f\nThe correlation is: %f\n", deltaTime, result);
return 0;
}
```
Testing this on my hardware at work (Core i7-4790K @ 4.00 Ghz w 8 MB L3 cache)
gave me an average of **660ms** for the C implementation (compiled with `/Ox
/arch:AVX2`), versus **390ms** on average for the assembly version. That's
pretty significant (1.7x speedup), especially if you're working in finance or
high-frequency trading!
If you'd like to compile and test this for yourself, this program and its build
script are included in the Code Repository section.
### Known limitations ###
If you're paying attention, you might have noticed that there are some obvious
limitations of the assembly implementation that the C version does not have:
* It requires AVX2 support. Not all CPUs are guaranteed to have this.
* What happens if you use a sample set that has fewer than 8 samples? You'd
likely end up moving garbage data into the registers and acting upon them at
best, or at worst, triggering a segmentation fault since you're reading beyond
the bounds of an array.
While these are certainly valid concerns, the decision to drop to assembly
should be influenced by your target platform constraints and what kind of data
you are expecting to transform. If you can guarantee that your targeted
platform(s) all support the AVX2 instruction set, and that you are always
calculating the correlation of a sample size more than 8 samples at a time,
there's no reason _not_ to favour a higher-performance version of the
implementation; you can always drop back to the C implementation if needed.
# Working with the compiler #
So we're done, right? We wrote the assembly version of a function that's used
very widely across multiple disciplines and showed that at the end of the day,
we're still 10x programmers that know better than the compiler teams at
Microsoft! Go us!
...[Is that right](https://www.youtube.com/watch?v=5Kek3GqbsTk)?
While compilers have their share of warts, they're worked on by some of the
smartest people in the world. Also, MSDN
[talks specifically about auto-vectorization](https://docs.microsoft.com/en-us/cpp/parallel/auto-parallelization-and-auto-vectorization?view=vs-2017) as
a feature in the MSVC compiler (GCC and Clang have it, too). Surely we can't be
the first people in the world to have discovered a faster way to beat the
compiler at its own game?
As it turns out, the answer (as always) is a little more complicated. To start
understanding why our code beat the compiler, let's drop to to the assembly that
the compiler generated for the reference C implementation, again using the `/FA`
compilation flag that MSVC has:
```
correlation_ref PROC
$LN17:
sub rsp, 104 ; 00000068H
vmovaps XMMWORD PTR [rsp+80], xmm6
xor r10d, r10d
vmovaps XMMWORD PTR [rsp+64], xmm7
mov r9, rcx
vmovaps XMMWORD PTR [rsp+48], xmm8
vmovaps XMMWORD PTR [rsp+32], xmm9
vxorpd xmm9, xmm9, xmm9
vxorpd xmm5, xmm5, xmm5
vxorpd xmm7, xmm7, xmm7
vxorpd xmm8, xmm8, xmm8
vxorpd xmm4, xmm4, xmm4
cmp r8d, 4
jl $LC12@correlatio
mov r11, r9
lea eax, DWORD PTR [r8-4]
sub r11, rdx
shr eax, 2
inc eax
lea rcx, QWORD PTR [rdx+8]
lea r10d, DWORD PTR [rax*4]
npad 3
$LL13@correlatio:
vmovsd xmm3, QWORD PTR [rcx+r11-8]
vmovsd xmm2, QWORD PTR [rcx-8]
lea rcx, QWORD PTR [rcx+32]
vaddsd xmm5, xmm5, xmm3
vmulsd xmm0, xmm3, xmm3
vaddsd xmm7, xmm7, xmm0
vmulsd xmm0, xmm2, xmm3
vaddsd xmm3, xmm4, xmm0
vmovsd xmm4, QWORD PTR [rcx+r11-32]
vmulsd xmm0, xmm4, xmm4
vaddsd xmm7, xmm7, xmm0
vaddsd xmm6, xmm9, xmm2
vmulsd xmm1, xmm2, xmm2
vmovsd xmm2, QWORD PTR [rcx-32]
vmulsd xmm0, xmm2, xmm4
vaddsd xmm8, xmm8, xmm1
vaddsd xmm5, xmm5, xmm4
vaddsd xmm4, xmm3, xmm0
vmovsd xmm3, QWORD PTR [rcx+r11-24]
vmulsd xmm1, xmm2, xmm2
vaddsd xmm8, xmm8, xmm1
vmulsd xmm0, xmm3, xmm3
vaddsd xmm7, xmm7, xmm0
vaddsd xmm6, xmm6, xmm2
vmovsd xmm2, QWORD PTR [rcx-24]
vmulsd xmm0, xmm2, xmm3
vmulsd xmm1, xmm2, xmm2
vaddsd xmm5, xmm5, xmm3
vaddsd xmm3, xmm4, xmm0
vmovsd xmm4, QWORD PTR [rcx+r11-16]
vmulsd xmm0, xmm4, xmm4
vaddsd xmm6, xmm6, xmm2
vmovsd xmm2, QWORD PTR [rcx-16]
vaddsd xmm8, xmm8, xmm1
vaddsd xmm7, xmm7, xmm0
vmulsd xmm0, xmm2, xmm4
vmulsd xmm1, xmm2, xmm2
vaddsd xmm5, xmm5, xmm4
vaddsd xmm4, xmm3, xmm0
vaddsd xmm9, xmm6, xmm2
vaddsd xmm8, xmm8, xmm1
sub rax, 1
jne $LL13@correlatio
$LC12@correlatio:
cmp r10d, r8d
jge SHORT $LN11@correlatio
sub r9, rdx
movsxd rax, r10d
lea rcx, QWORD PTR [rdx+rax*8]
mov edx, r8d
sub edx, r10d
$LC8@correlatio:
vmovsd xmm3, QWORD PTR [rcx+r9]
vmovsd xmm2, QWORD PTR [rcx]
lea rcx, QWORD PTR [rcx+8]
vmulsd xmm0, xmm3, xmm3
vaddsd xmm7, xmm7, xmm0
vmulsd xmm0, xmm3, xmm2
vmulsd xmm1, xmm2, xmm2
vaddsd xmm4, xmm4, xmm0
vaddsd xmm5, xmm5, xmm3
vaddsd xmm9, xmm9, xmm2
vaddsd xmm8, xmm8, xmm1
sub rdx, 1
jne SHORT $LC8@correlatio
$LN11@correlatio:
vxorps xmm3, xmm3, xmm3
vcvtsi2sd xmm3, xmm3, r8d
vmulsd xmm1, xmm3, xmm4
vmulsd xmm0, xmm9, xmm5
vsubsd xmm6, xmm1, xmm0
vmulsd xmm1, xmm9, xmm9
vmulsd xmm2, xmm3, xmm8
vmulsd xmm0, xmm5, xmm5
vsubsd xmm4, xmm2, xmm1
vmulsd xmm3, xmm3, xmm7
vsubsd xmm1, xmm3, xmm0
vmulsd xmm0, xmm4, xmm1
call sqrt
vmovaps xmm7, XMMWORD PTR [rsp+64]
vmovaps xmm8, XMMWORD PTR [rsp+48]
vmovaps xmm9, XMMWORD PTR [rsp+32]
vdivsd xmm0, xmm6, xmm0
vmovaps xmm6, XMMWORD PTR [rsp+80]
add rsp, 104 ; 00000068H
ret 0
correlation_ref ENDP
```
Ok. That is a lot of code, and quite frankly, it's a lot to parse. But when
looking at assembly code, as mentioned in the Dropping to assembly section, it's
better to look at the code as a whole and try to spot patterns, rather than
going line-by-line through the code.
If you've been paying attention to the code we've written so far, something
should have jumped out at you by now. If you need a hint, remember, we compiled
the C implementation using the `/arch:AVX2` flag explicitly, telling the
compiler that it has full reign to generate instructions that support the AVX2
instruction set. Go ahead and look at the code above one more time.
Do you see it yet?
_There are no references to `ymm` registers anywhere at all!_
This is most odd, because we explicitly _told_ the compiler that it was OK to go
ahead and generate instructions that make use of the `ymm` registers. Why would
it be disobeying our orders?
## Auto-vectorization ##
We know that this is most likely a problem of the compiler failing to generate
the appropriate vectorization instructions for our C implementation of the code,
since our assembly version is fairly similar in nature to the code generated by
the compiler, apart from our usage of the `ymm` registers.
Furthermore, since [auto-vectorization](https://en.wikipedia.org/wiki/Automatic_vectorization)
is supported in every major compiler in use today, each compiler provides
facilities to help debug this feature when things don't go quite as expected. In
MSVC's case, this is done through the use of the
[`/Qvec-report`](https://docs.microsoft.com/en-us/cpp/build/reference/qvec-report-auto-vectorizer-reporting-level?view=vs-2017)
flag, which we'll re-compile our reference implementation with:
```
--- Analyzing function: correlation_ref
covariance_test\covariance_test_main.c(28) : info C5002: loop not vectorized due to reason '1105'
```
In that endearing way that the MSVC compiler does, it generates a error code for
the reason instead of using the actual string, which we have to [look up](https://docs.microsoft.com/en-us/cpp/error-messages/tool-errors/vectorizer-and-parallelizer-messages?view=vs-2017):
> 1105 Loop includes a unrecognized reduction operation.
Interesting. Well, this doesn't mean much by itself, but if we look at the
comments in the code example that MSDN provides:
```
int code_1105(int *A)
{
// The compiler performs an optimization that's known as "reduction"
// when it operates on each element of an array and computes
// a resulting scalar value - for example, in this piece of code, which
// computes the sum of each element in the array:
int s = 0;
for (int i=0; i<1000; ++i)
{
s += A[i]; // vectorizable
}
// The reduction pattern must resemble the loop in the example. The
// compiler emits code 1105 if it cannot deduce the reduction
// pattern, as shown in this example:
for (int i=0; i<1000; ++i)
{
s += A[i] + s; // code 1105
}
// Similarly, reductions of "float" or "double" types require
// that the /fp:fast switch is thrown. Strictly speaking,
// the reduction optimization that the compiler performs uses
// "floating point reassociation". Reassociation is only
// allowed when /fp:fast is thrown.
return s;
}
```
Suddenly things make sense! [Floating-point math is _not associative_](http://www.walkingrandomly.com/?p=5380);
reductions using SIMD need to be assocative in nature. Like the comments say, we
can use the `/fp:fast` switch to tell the MSDN compiler that we will allow
reassociation for floating-point math to generate the new assembly:
```
correlation_ref PROC
$LN21:
mov rax, rsp
sub rsp, 136 ; 00000088H
vmovaps XMMWORD PTR [rax-24], xmm6
xor r9d, r9d
vmovaps XMMWORD PTR [rax-40], xmm7
mov r11, rdx
vmovaps XMMWORD PTR [rax-56], xmm8
mov r10, rcx
vxorpd xmm8, xmm8, xmm8
vxorpd xmm5, xmm5, xmm5
vxorpd xmm6, xmm6, xmm6
vxorpd xmm7, xmm7, xmm7
vxorpd xmm4, xmm4, xmm4
test r8d, r8d
jle $LN13@correlatio
cmp r8d, 8
jb $LN9@correlatio
vmovaps XMMWORD PTR [rax-72], xmm9
vmovaps XMMWORD PTR [rax-88], xmm10
vmovaps XMMWORD PTR [rax-104], xmm11
vmovaps XMMWORD PTR [rax-120], xmm12
mov eax, r8d
vmovaps XMMWORD PTR [rsp], xmm13
and eax, -2147483641 ; ffffffff80000007H
jge SHORT $LN19@correlatio
dec eax
or eax, -8
inc eax
$LN19@correlatio:
mov edx, r8d
sub edx, eax
lea rax, QWORD PTR [r11+32]
sub rcx, r11
vxorpd xmm9, xmm9, xmm9
vxorpd xmm10, xmm10, xmm10
vxorpd xmm11, xmm11, xmm11
vxorpd xmm12, xmm12, xmm12
vxorpd xmm13, xmm13, xmm13
npad 7
$LL4@correlatio:
vmovupd ymm3, YMMWORD PTR [rcx+rax-32] ; Notice how the compiler has realized that this can now be vectorized
vmovupd ymm2, YMMWORD PTR [rax-32]
vfmadd231pd ymm13, ymm2, ymm3
vaddpd ymm5, ymm5, ymm3
vaddpd ymm10, ymm10, ymm2
vfmadd231pd ymm11, ymm3, ymm3
vmovupd ymm3, YMMWORD PTR [rcx+rax]
vfmadd231pd ymm12, ymm2, ymm2
vmovupd ymm2, YMMWORD PTR [rax]
add r9d, 8
lea rax, QWORD PTR [rax+64]
vaddpd ymm9, ymm9, ymm3
vaddpd ymm6, ymm6, ymm2
vfmadd231pd ymm7, ymm3, ymm3
vfmadd231pd ymm8, ymm2, ymm2
vfmadd231pd ymm4, ymm3, ymm2
cmp r9d, edx
jl SHORT $LL4@correlatio
vaddpd ymm0, ymm4, ymm13
vmovaps xmm13, XMMWORD PTR [rsp]
vhaddpd ymm2, ymm0, ymm0
vextractf128 xmm1, ymm2, 1
vaddpd xmm4, xmm1, xmm2
vaddpd ymm0, ymm8, ymm12
vmovaps xmm12, XMMWORD PTR [rsp+16]
vhaddpd ymm2, ymm0, ymm0
vextractf128 xmm1, ymm2, 1
vaddpd xmm8, xmm1, xmm2
vaddpd ymm0, ymm7, ymm11
vmovaps xmm11, XMMWORD PTR [rsp+32]
vhaddpd ymm2, ymm0, ymm0
vextractf128 xmm1, ymm2, 1
vaddpd ymm0, ymm6, ymm10
vmovaps xmm10, XMMWORD PTR [rsp+48]
vaddpd xmm7, xmm1, xmm2
vhaddpd ymm2, ymm0, ymm0
vextractf128 xmm1, ymm2, 1
vaddpd ymm0, ymm9, ymm5
vmovaps xmm9, XMMWORD PTR [rsp+64]
vaddpd xmm6, xmm1, xmm2
vhaddpd ymm2, ymm0, ymm0
vextractf128 xmm1, ymm2, 1
vaddpd xmm5, xmm1, xmm2
$LN9@correlatio:
cmp r9d, r8d
jge $LN13@correlatio
mov eax, r8d
sub eax, r9d
cmp eax, 4
jl $LC14@correlatio
movsxd rcx, r9d
mov eax, r8d
sub eax, r9d
inc rcx
sub eax, 4
mov rdx, r10
shr eax, 2
sub rdx, r11
inc eax
lea rcx, QWORD PTR [r11+rcx*8]
lea r9d, DWORD PTR [r9+rax*4]
npad 12
$LL15@correlatio:
vmovsd xmm3, QWORD PTR [rdx+rcx-8]
vmovsd xmm2, QWORD PTR [rcx-8]
lea rcx, QWORD PTR [rcx+32]
vmulsd xmm0, xmm2, xmm3
vaddsd xmm5, xmm5, xmm3
vfmadd231sd xmm7, xmm3, xmm3
vaddsd xmm3, xmm4, xmm0
vmovsd xmm4, QWORD PTR [rcx+rdx-32]
vaddsd xmm6, xmm6, xmm2
vfmadd231sd xmm8, xmm2, xmm2
vmovsd xmm2, QWORD PTR [rcx-32]
vmulsd xmm0, xmm4, xmm2
vaddsd xmm5, xmm5, xmm4
vfmadd231sd xmm7, xmm4, xmm4
vaddsd xmm4, xmm3, xmm0
vmovsd xmm3, QWORD PTR [rdx+rcx-24]
vaddsd xmm6, xmm6, xmm2
vfmadd231sd xmm8, xmm2, xmm2
vmovsd xmm2, QWORD PTR [rcx-24]
vmulsd xmm0, xmm2, xmm3
vaddsd xmm5, xmm5, xmm3
vfmadd231sd xmm7, xmm3, xmm3
vaddsd xmm3, xmm4, xmm0
vmovsd xmm4, QWORD PTR [rdx+rcx-16]
vaddsd xmm6, xmm6, xmm2
vfmadd231sd xmm8, xmm2, xmm2
vmovsd xmm2, QWORD PTR [rcx-16]
vmulsd xmm0, xmm2, xmm4
vaddsd xmm5, xmm5, xmm4
vfmadd231sd xmm7, xmm4, xmm4
vaddsd xmm4, xmm3, xmm0
vaddsd xmm6, xmm6, xmm2
vfmadd231sd xmm8, xmm2, xmm2
sub rax, 1
jne $LL15@correlatio
$LC14@correlatio:
cmp r9d, r8d
jge SHORT $LN13@correlatio
movsxd rax, r9d
sub r10, r11
lea rcx, QWORD PTR [r11+rax*8]
mov eax, r8d
sub eax, r9d
movsxd rdx, eax
$LC8@correlatio:
vmovsd xmm3, QWORD PTR [rcx+r10]
vmovsd xmm2, QWORD PTR [rcx]
lea rcx, QWORD PTR [rcx+8]
vmovaps xmm0, xmm3
vfmadd213sd xmm0, xmm3, xmm7
vmovapd xmm7, xmm0
vmovaps xmm0, xmm2
vmovaps xmm1, xmm2
vfmadd213sd xmm0, xmm3, xmm4
vfmadd213sd xmm1, xmm2, xmm8
vmovapd xmm4, xmm0
vaddsd xmm5, xmm5, xmm3
vaddsd xmm6, xmm6, xmm2
vmovapd xmm8, xmm1
sub rdx, 1
jne SHORT $LC8@correlatio
$LN13@correlatio:
vxorps xmm2, xmm2, xmm2
vcvtsi2sd xmm2, xmm2, r8d
vmulsd xmm0, xmm6, xmm5
vfmsub213sd xmm4, xmm2, xmm0
vmulsd xmm1, xmm6, xmm6
vmovaps xmm3, xmm8
vfmsub213sd xmm3, xmm2, xmm1
vmulsd xmm2, xmm2, xmm7
vmulsd xmm0, xmm5, xmm5
vsubsd xmm1, xmm2, xmm0
vmulsd xmm2, xmm3, xmm1
vsqrtsd xmm3, xmm2, xmm2
vdivsd xmm0, xmm4, xmm3
vzeroupper
vmovaps xmm6, XMMWORD PTR [rsp+112]
vmovaps xmm7, XMMWORD PTR [rsp+96]
vmovaps xmm8, XMMWORD PTR [rsp+80]
add rsp, 136 ; 00000088H
ret 0
correlation_ref ENDP
```
Now we see usage of the `ymm` registers. Excellent!
Benchmarking this now gives me an average of **358ms**, which is even faster
than the handwritten assembly version!
## Correlation, even faster? ##
So if the compiler could generate a faster version of the code than we wrote by
hand, is that it, then? What was the point of all of this?
Well, recall that we have an _even_ larger set of registers available on the
hardware: the `zmm` registers, which are 512 bits wide. At present, the MSVC
compiler does not support generating AVX-512 instructions when doing
auto-vectorization. So if you were attempting to make use of those registers,
you'd _need_ to write your assembly by hand.
!!! tip Crossing the platforms
[Clang](https://llvm.org/docs/Vectorizers.html) has support for generating
AVX-512 instructions during vectorization, as does the [ISPC](https://ispc.github.io/)
compiler. While we won't be discussing these compilers in the scope of this
tutorial, I highly recommend taking a look at them if you have the chance.
Additionally, while MSVC could always get support for these instruction sets in
the future once such register widths become more mainstream, remember, our
objective _isn't_ to beat the compiler into submission and prove our superiority
as assembly programmers. The whole point of understanding how assembly works in
this context is so that we can identify areas where the compiler failed to apply
a possible optimization, investigate the root cause, and then either modify our
code to help the compiler, or modify the compiler to generate better code. The
_real_ takeaway you should get from this is that while compilers provide us with
a great abstraction from the assembly code, they're not infallible.
I will leave writing the AVX-512 version of the correlation function as an
exercise for the reader (mainly because I don't have a Xeon at home that
supports it to test on). And while you're at it, I will leave you with this
final lingering question: why is the compiler's version of the AVX2 code above
faster than the hand-written assembly version?
Have fun!
# Closing thoughts #
This tutorial has already become the longest one I've ever written (at time of
publish), and I've just begun scratching the surface of just _how_ much there is
to learn about this subject. I've barely even learned the basics myself.
One thing that has become very clear to me through this entire process, though,
is the renewed importance of _keeping your code
simple_. C/C++/Python/Rust/C#/whatever have plenty of abstractions that can help
to improve the efficiency at which we churn out features, but it's critically
important that programmers understand the full cost of each and every one of
these abstractions in order to be more effective at crafting their
products. Run-time, compile-time and debug-time costs are all important in their
own ways, and they should never be underestimated. A plus point is that simple
code is usually simpler to debug, as we've demonstrated over the course of this
tutorial. The day a user-mode dump drops across your desk from a release build
with no other reproducible information available, you'll realize just _how_ much
of a force-multiplier it is to understand even a little bit of assembly when it
comes to solving problems.
Always remember that **you are not shipping code; you are shipping a product**. If
making a better product for your end-users means not implementing your
whatever-oriented framework design due to its added overhead, if making one of
your features faster means writing a C binding to your Python code to improve
run-time performance, if implementing a threading feature means debugging the
code later on in release builds becomes an utter nightmare, you should be
constantly weighing these factors in your mind, and making decisions based on
what will make your _product_ better, _not_ what makes your code "cleaner". I
cannot emphasize this point enough.
# Additional Resources #
The following resources were invaluable to me throughout my own studies into
this topic. I would highly recommend perusing through them as well; this
tutorial would not have been possible to write without the prior work that these
people put into writing about these topics.
* [Introduction to 64 Bit Assembly Language Programming for Linux and OS X](http://rayseyfarth.com/)
by Ray Seyfarth
* [What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf)
by Ulrich Drepper
* [Modern x64 Assembly](https://www.youtube.com/watch?v=rxsBghsrvpI)
by What's a Creel?
* [Performance Programming: x64 Caches](https://www.youtube.com/watch?v=bHzrhH7yySA)
by What's a Creel?
* [A Comprehensive Guide To Debugging Optimized x64 Code](https://www.youtube.com/watch?v=MUNRvqpske0)
by Jorge Rodriguez
* [Introduction to x64 Assembly](https://software.intel.com/en-us/articles/introduction-to-x64-assembly)
by Chris Lomont
* [Challenges of Debugging Optimized x64 code](https://blogs.msdn.microsoft.com/ntdebugging/2009/01/09/challenges-of-debugging-optimized-x64-code/)
by Microsoft
Additionally, you will probably find yourself needing the reference
documentation from the following locations as you work your way through this
tutorial, or even in the future as well:
* [Microsoft x64 Software Conventions](https://docs.microsoft.com/en-us/cpp/build/x64-software-conventions?view=vs-2017)
This will be your eternal companion and enforcer as you work on Windows.
* [The Netwide Assembler manual](https://www.nasm.us/doc/)
Contains all the information needed about programming with NASM syntax.
* [Intel 64 and IA-32 Architectures Software Developer Manuals](https://software.intel.com/en-us/articles/intel-sdm)
Contains all the technical information regarding the CPU architecture,
instructions, and timings.
* [x86 and amd64 instruction reference](https://www.felixcloutier.com/x86/)
A nice reference listing of all the instructions available on the x86-64
instruction set. Be warned, not everything maps 1:1 in either NASM/MASM syntax!
* [Intel Intrinsics Guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/)
A nice guide to the instrinsic functions available for Intel CPUs.
* [Instruction Tables](https://www.agner.org/optimize/instruction_tables.pdf)
Reference instruction timings for various CPU generations.
# Credits #
* [Marco Giordano](https://twitter.com/MGDev91), for getting me into this mess
in the first place.
* [`xephyris`](https://twitter.com/xephyris), for putting up with all my
questions, rants and nonsense, while returning none of those; just solid
knowledge and advice.
* [Raffaele Fragapane](https://twitter.com/ThE_JacO), for the same thing.
* [John Calsbeek](https://twitter.com/jcalsbeek), for his uncanny ability to
spot the gaps in my knowledge, and more importantly, to fill them up.
* [Morgan McGuire](https://twitter.com/CasualEffects), for the Markdeep syntax
used in this tutorial.
* [Aras Pranckevičius](https://twitter.com/aras_p), for the Markdeep template
used in this tutorial.
* Michael H. Su, for helping point out an error in one of the diagrams.