Index of /iatic4/aob/TP2

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory   -  
[DIR]todo/ 2024-11-14 10:28 -  

x86 assembly programming (floating-point)

x86 assembly programming (floating-point)

Table of Contents

1 Introduction

The aim of this lab is to introduce the floating-point instructions available on the x86 architecture. You are given 6 examples showcasing the use of floating-point instructions in scalar and vector mode for 32-bit and 64-bit floating-point values.

For this lab, you have to convert a given set of C functions within the nbody0.c simulation code into their respective assembly versions.

2 SSE, AVX2, and AVX512

SSE (Streaming SIMD Extension) and AVX (Advanced Vector eXtension) are SIMD extensions added to the x86 instruction set in order to speed up certain categories of code patterns by introducing new instructions operating not only on scalars but on vectors (packets of elements). These two instruction sets provide both scalar and vector instructions covering the single and double precision floating-point formats.

For this lab, only SSE instructions are needed.

The SSE and AVX instructions have a predefined nomenclature depending on the scalar/vector nature of the operation as well as the data types. Scalar single precision operations are suffixed with SS (Scalar Single precision) and double precision operations with SD (Scalar Double precision). For packed, or vector, operations the suffix can be either PS (Packed Single precision) for single precision, or PD (Packed Double precision) for double precision. AVX instructions must start with a V (VEX instruction extension).

The examples provided showcase multiple arithmetic and memory instructions using the previously described naming convention.

2.1 SSE registers

The SSE instruction set extends the x86 instruction set not only with new operations but also additional registers. Eight 128-bit (16 bytes) registers (from XMM0 to XMM7) are available for the SSE instructions to operate on. These registers can hold 4 single precision floating-point values, or 2 double precision floating-point values.

2.2 AVX2

The AVX2 instruction set adds 16 256-bit (32 bytes) new registers to the mix: YMM0 to YMM15. The first 8 YMM registers overlap with the first 8 XMM registers.

2.3 AVX512

The AVX512 instruction set adds 32 512-bit (64 bytes) registers. The first 16 registers overlap with the AVX2 registers. The table below covers register overlapping over all instruction sets:

Instruction set AVX512 AVX2 SSE
Bits 511..256 255..28 127..0
  ZMM0 YMM0 XMM0
  ZMM1 YMM1 XMM1
  ZMM2 YMM2 XMM2
  ZMM3 YMM3 XMM3
  ZMM4 YMM4 XMM4
  ZMM5 YMM5 XMM5
  ZMM6 YMM6 XMM6
  ZMM7 YMM7 XMM7
  ZMM8 YMM8 XMM8
  ZMM9 YMM9 XMM9
  ZMM10 YMM10 XMM10
  ZMM11 YMM11 XMM11
  ZMM12 YMM12 XMM12
  ZMM13 YMM13 XMM13
  ZMM14 YMM14 XMM14
  ZMM15 YMM15 XMM15
  ZMM16 YMM16 XMM16
  ZMM17 YMM17 XMM17
  ZMM18 YMM18 XMM18
  ZMM19 YMM19 XMM19
  ZMM20 YMM20 XMM20
  ZMM21 YMM21 XMM21
  ZMM22 YMM22 XMM22
  ZMM23 YMM23 XMM23
  ZMM24 YMM24 XMM24
  ZMM25 YMM25 XMM25
  ZMM26 YMM26 XMM26
  ZMM27 YMM27 XMM27
  ZMM28 YMM28 XMM28
  ZMM29 YMM29 XMM29
  ZMM30 YMM30 XMM30
  ZMM31 YMM31 XMM31

3 Deliverable

For this lab, you have to convert the following C functions in the N-Body interaction simulation provided in the todo/nbody0.c directory into multiple assembly versions using scalar and vector operations.

//
vector add_vectors(vector a, vector b)
{
  vector c = { a.x + b.x, a.y + b.y };

  return c;
}

//
vector scale_vector(double b, vector a)
{
  vector c = { b * a.x, b * a.y };

  return c;
}

//
vector sub_vectors(vector a, vector b)
{
  vector c = { a.x - b.x, a.y - b.y };

  return c;
}

//
double mod(vector a)
{
  return sqrt(a.x * a.x + a.y * a.y);
}

The provided simulation code uses the RDTSC instruction to measure the performance of the simulation routine for every iteration. The RDTSC instruction returns the number of cycles elapsed starting from when the CPU was started. I nthis case, it used to evaluate the number of cycles elapsed during the execution of the simulation function. This instruction is VERY dependent on CPU frequency and can only be precise when measured target takes at least 500 cycles.

In order for the measurements to be valid, you have to follow to following steps:

0 - If you are using a laptop, plug it to the wall socket

1 - CPU governor and frequency

The CPU governor is the part of the OS that handles the dynamic frequency management of CPU cores. There are multiple governors available under the two most common CPU drivers:

  • The intelpstate driver provides the following governors: performance, powersave
  • The acpi-cpufreq driver provides the following governors: conservative, ondemand, userspace, powersave, performance, schedutil

In order to check the CPU driver and governor configurations, you can use the following command:

$ sudo cpupower frequency-info

This command will return, depending on your CPU driver, the following:

1.1 - The Intel Pstate driver


analyzing CPU 0:
driver: intel_pstate
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency:  Cannot determine or is not supported.
hardware limits: 800 MHz - 3.60 GHz
available cpufreq governors: performance powersave
current policy: frequency should be within 800 MHz and 3.60 GHz.
                The governor "powersave" may decide which speed to use
                within this range.
current CPU frequency: Unable to call hardware
current CPU frequency: 955 MHz (asserted by call to kernel)
boost state support:
  Supported: no
  Active: no

If this case, you should use the following command to set the CPU governor for all CPU cores:

$ sudo cpupower -c all -g performance

1.2 - The ACPI driver


analyzing CPU 0:
driver: acpi-cpufreq
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency:  Cannot determine or is not supported.
hardware limits: 2.20 GHz - 3.70 GHz
available frequency steps:  3.70 GHz, 3.20 GHz, 2.20 GHz
available cpufreq governors: conservative ondemand userspace powersave performance schedutil
current policy: frequency should be within 2.20 GHz and 3.70 GHz.
                The governor "schedutil" may decide which speed to use
                within this range.
current CPU frequency: 2.20 GHz (asserted by call to hardware)
boost state support:
  Supported: yes
  Active: yes
  Boost States: 0
  Total States: 3
  Pstate-P0:  3700MHz
  Pstate-P1:  3200MHz
  Pstate-P2:  2200MHz

In this case, you should set the frequency of the target code to the maximum frequency available in your CPU using the following command:

$ sudo cpupower -c all -g userspace
$ sudo cpupower -c TARGET_CORE -f MAX_FREQ

2 - Run the program using the taskset command to pin the process on the target core and redirect the output containing the performance measurement into a file:

$ sudo taskset -c TARGET_CORE ./nbody0 > out0.dat

Once you have produced the multiple assembly versions (scalar and vector)of the specified C functions in the N-Body simulation, you can draw comparison plots of the performance of each version using GNUPlot.

An example of a GNUPlot script to compare the C, SSE scalar, and SSE packed versions:


set term png size 1900,1000

set grid

set ylabel "Latency in cycles"

set xlabel "Simulation iteration"

plot "out0.dat" w lp "C version", "out0_sd.dat" w lp "SSE scalar", "out0_pd.dat" w lp "SSE packed"

4 Important note

If you are using a virtual machine, the performance measurements will most likely be wrong/invalid.

Date: 2021

Author: yaspr

Created: 2021-11-02 Tue 15:10

Validate