x86 assembly programming (floating-point)
Table of Contents
1 Introduction
The aim of this lab is to introduce the floating-point instructions available on the x86 architecture. You are given 6 examples showcasing the use of floating-point instructions in scalar and vector mode for 32-bit and 64-bit floating-point values.
For this lab, you have to convert a given set of C functions within the nbody0.c simulation code into their respective assembly versions.
2 SSE, AVX2, and AVX512
SSE (Streaming SIMD Extension) and AVX (Advanced Vector eXtension) are SIMD extensions added to the x86 instruction set in order to speed up certain categories of code patterns by introducing new instructions operating not only on scalars but on vectors (packets of elements). These two instruction sets provide both scalar and vector instructions covering the single and double precision floating-point formats.
For this lab, only SSE instructions are needed.
The SSE and AVX instructions have a predefined nomenclature depending on the scalar/vector nature of the operation as well as the data types. Scalar single precision operations are suffixed with SS (Scalar Single precision) and double precision operations with SD (Scalar Double precision). For packed, or vector, operations the suffix can be either PS (Packed Single precision) for single precision, or PD (Packed Double precision) for double precision. AVX instructions must start with a V (VEX instruction extension).
The examples provided showcase multiple arithmetic and memory instructions using the previously described naming convention.
2.1 SSE registers
The SSE instruction set extends the x86 instruction set not only with new operations but also additional registers. Eight 128-bit (16 bytes) registers (from XMM0 to XMM7) are available for the SSE instructions to operate on. These registers can hold 4 single precision floating-point values, or 2 double precision floating-point values.
2.2 AVX2
The AVX2 instruction set adds 16 256-bit (32 bytes) new registers to the mix: YMM0 to YMM15. The first 8 YMM registers overlap with the first 8 XMM registers.
2.3 AVX512
The AVX512 instruction set adds 32 512-bit (64 bytes) registers. The first 16 registers overlap with the AVX2 registers. The table below covers register overlapping over all instruction sets:
Instruction set | AVX512 | AVX2 | SSE |
---|---|---|---|
Bits | 511..256 | 255..28 | 127..0 |
ZMM0 | YMM0 | XMM0 | |
ZMM1 | YMM1 | XMM1 | |
ZMM2 | YMM2 | XMM2 | |
ZMM3 | YMM3 | XMM3 | |
ZMM4 | YMM4 | XMM4 | |
ZMM5 | YMM5 | XMM5 | |
ZMM6 | YMM6 | XMM6 | |
ZMM7 | YMM7 | XMM7 | |
ZMM8 | YMM8 | XMM8 | |
ZMM9 | YMM9 | XMM9 | |
ZMM10 | YMM10 | XMM10 | |
ZMM11 | YMM11 | XMM11 | |
ZMM12 | YMM12 | XMM12 | |
ZMM13 | YMM13 | XMM13 | |
ZMM14 | YMM14 | XMM14 | |
ZMM15 | YMM15 | XMM15 | |
ZMM16 | YMM16 | XMM16 | |
ZMM17 | YMM17 | XMM17 | |
ZMM18 | YMM18 | XMM18 | |
ZMM19 | YMM19 | XMM19 | |
ZMM20 | YMM20 | XMM20 | |
ZMM21 | YMM21 | XMM21 | |
ZMM22 | YMM22 | XMM22 | |
ZMM23 | YMM23 | XMM23 | |
ZMM24 | YMM24 | XMM24 | |
ZMM25 | YMM25 | XMM25 | |
ZMM26 | YMM26 | XMM26 | |
ZMM27 | YMM27 | XMM27 | |
ZMM28 | YMM28 | XMM28 | |
ZMM29 | YMM29 | XMM29 | |
ZMM30 | YMM30 | XMM30 | |
ZMM31 | YMM31 | XMM31 |
3 Deliverable
For this lab, you have to convert the following C functions in the N-Body interaction simulation provided in the todo/nbody0.c directory into multiple assembly versions using scalar and vector operations.
// vector add_vectors(vector a, vector b) { vector c = { a.x + b.x, a.y + b.y }; return c; } // vector scale_vector(double b, vector a) { vector c = { b * a.x, b * a.y }; return c; } // vector sub_vectors(vector a, vector b) { vector c = { a.x - b.x, a.y - b.y }; return c; } // double mod(vector a) { return sqrt(a.x * a.x + a.y * a.y); }
The provided simulation code uses the RDTSC instruction to measure the performance of the simulation routine for every iteration. The RDTSC instruction returns the number of cycles elapsed starting from when the CPU was started. I nthis case, it used to evaluate the number of cycles elapsed during the execution of the simulation function. This instruction is VERY dependent on CPU frequency and can only be precise when measured target takes at least 500 cycles.
In order for the measurements to be valid, you have to follow to following steps:
0 - If you are using a laptop, plug it to the wall socket
1 - CPU governor and frequency
The CPU governor is the part of the OS that handles the dynamic frequency management of CPU cores. There are multiple governors available under the two most common CPU drivers:
- The intelpstate driver provides the following governors: performance, powersave
- The acpi-cpufreq driver provides the following governors: conservative, ondemand, userspace, powersave, performance, schedutil
In order to check the CPU driver and governor configurations, you can use the following command:
$ sudo cpupower frequency-info
This command will return, depending on your CPU driver, the following:
1.1 - The Intel Pstate driver
analyzing CPU 0: driver: intel_pstate CPUs which run at the same hardware frequency: 0 CPUs which need to have their frequency coordinated by software: 0 maximum transition latency: Cannot determine or is not supported. hardware limits: 800 MHz - 3.60 GHz available cpufreq governors: performance powersave current policy: frequency should be within 800 MHz and 3.60 GHz. The governor "powersave" may decide which speed to use within this range. current CPU frequency: Unable to call hardware current CPU frequency: 955 MHz (asserted by call to kernel) boost state support: Supported: no Active: no
If this case, you should use the following command to set the CPU governor for all CPU cores:
$ sudo cpupower -c all -g performance
1.2 - The ACPI driver
analyzing CPU 0: driver: acpi-cpufreq CPUs which run at the same hardware frequency: 0 CPUs which need to have their frequency coordinated by software: 0 maximum transition latency: Cannot determine or is not supported. hardware limits: 2.20 GHz - 3.70 GHz available frequency steps: 3.70 GHz, 3.20 GHz, 2.20 GHz available cpufreq governors: conservative ondemand userspace powersave performance schedutil current policy: frequency should be within 2.20 GHz and 3.70 GHz. The governor "schedutil" may decide which speed to use within this range. current CPU frequency: 2.20 GHz (asserted by call to hardware) boost state support: Supported: yes Active: yes Boost States: 0 Total States: 3 Pstate-P0: 3700MHz Pstate-P1: 3200MHz Pstate-P2: 2200MHz
In this case, you should set the frequency of the target code to the maximum frequency available in your CPU using the following command:
$ sudo cpupower -c all -g userspace $ sudo cpupower -c TARGET_CORE -f MAX_FREQ
2 - Run the program using the taskset command to pin the process on the target core and redirect the output containing the performance measurement into a file:
$ sudo taskset -c TARGET_CORE ./nbody0 > out0.dat
Once you have produced the multiple assembly versions (scalar and vector)of the specified C functions in the N-Body simulation, you can draw comparison plots of the performance of each version using GNUPlot.
An example of a GNUPlot script to compare the C, SSE scalar, and SSE packed versions:
set term png size 1900,1000 set grid set ylabel "Latency in cycles" set xlabel "Simulation iteration" plot "out0.dat" w lp "C version", "out0_sd.dat" w lp "SSE scalar", "out0_pd.dat" w lp "SSE packed"
4 Important note
If you are using a virtual machine, the performance measurements will most likely be wrong/invalid.