# Taking advantage of Arm Advanced SIMD instructions ## Background Arm-v8 architecture include Advanced-SIMD instructions (NEON) helping boost performance for many applications that can take advantage of the wide registers. A lot of the applications and libraries already taking advantage of Arm's Advanced-SIMD, yet this guide is written for developers writing new code or libraries. We'll guide on various ways to take advantage of these instructions, whether through compiler auto-vectorization or writing intrinsics. Later we'll explain how to build portable code, that would detect at runtime which instructions are available at the specific cores, so developers can build one binary that supports cores with different capabilities. For example, to support one binary that would run on Graviton1, Graviton2, and arbitrary set of Android devices with Arm v8.x support. ## Compiler-driven auto-vectorization Compilers keep improving to take advantage of the SIMD instructions without developers explicit guidance or specific coding style. In general, GCC 9 has good support for auto-vectorization, while GCC 10 has shown impressive improvement over GCC 9 in most cases. Compiling with *-fopt-info-vec-missed* is good practice to check which loops were not vectorized. ### How minor code changes improve auto-vectorization The following example was run on Graviton2, with Ubuntu 20.04 and gcc 9.3. Different combinations of server and compiler version may show different results Starting code looked like: ``` 1 // test.c ... 5 float a[1024*1024]; 6 float b[1024*1024]; 7 float c[1024*1024]; ..... 37 for (j=0; j<128;j++) { // outer loop, not expected to be vectorized 38 for (i=0; i #include #endif ``` ## Runtime detection of supported SIMD instructions While Arm architecture version mandates specific instructions support, certain instructions are optional for a specific version of the architecture. For example, a cpu core compliant with Arm-v8.4 architecture must support dot-product, but dot-products are optional in Arm-v8.2 and Arm-v8.3. Graviton2 is Arm-v8.2 compliant, but supports both CRC and dot-product instructions. A developer wanting to build an application or library that can detect the supported instructions in runtime, can follow this example: ``` #include ...... uint64_t hwcaps = getauxval(AT_HWCAP); has_crc_feature = hwcaps & HWCAP_CRC32 ? true : false; has_lse_feature = hwcaps & HWCAP_ATOMICS ? true : false; has_fp16_feature = hwcaps & HWCAP_FPHP ? true : false; has_dotprod_feature = hwcaps & HWCAP_ASIMDDP ? true : false; has_sve_feature = hwcaps & HWCAP_SVE ? true : false; ``` The full list of arm64 hardware capabilities is defined in [glibc header file](https://github.com/bminor/glibc/blob/master/sysdeps/unix/sysv/linux/aarch64/bits/hwcap.h) and in the [Linux kernel](https://github.com/torvalds/linux/blob/master/arch/arm64/include/asm/hwcap.h). ## Porting codes with SSE/AVX intrinsics to NEON ### Detecting arm64 systems Projects may fail to build on arm64 with `error: unrecognized command-line option '-msse2'`, or `-mavx`, `-mssse3`, etc. These compiler flags enable x86 vector instructions. The presence of this error means that the build system may be missing the detection of the target system, and continues to use the x86 target features compiler flags when compiling for arm64. To detect an arm64 system, the build system can use: ``` # (test $(uname -m) = "aarch64" && echo "arm64 system") || echo "other system" ``` Another way to detect an arm64 system is to compile, run, and check the return value of a C program: ``` # cat << EOF > check-arm64.c int main () { #ifdef __aarch64__ return 0; #else return 1; #endif } EOF # gcc check-arm64.c -o check-arm64 # (./check-arm64 && echo "arm64 system") || echo "other system" ``` ### Translating x86 intrinsics to NEON When programs contain code with x64 intrinsics, the following procedure can help to quickly obtain a working program on Arm, assess the performance of the program running on Graviton processors, profile hot paths, and improve the quality of code on the hot paths. To quickly get a prototype running on Arm, one can use [SIMDe (SIMD everywhere)](https://github.com/simd-everywhere/simde) a translator of x64 intrinsics to NEON. For example, to port code using AVX2 intrinsics to Graviton, a developer could add the following code: ``` #define SIMDE_ENABLE_NATIVE_ALIASES #include "simde/x86/avx2.h" ``` SIMDe provides a quick starting point to port performance critical codes to Arm. It shortens the time needed to get an Arm working program that then can be used to extract profiles and to identify hot paths in the code. Once a profile is established, the hot paths can be rewritten directly with NEON intrinsics to avoid the overhead of the generic translation. ## Additional resources * [Neon Intrinsics](https://developer.arm.com/architectures/instruction-sets/intrinsics/) * [Coding for Neon](https://developer.arm.com/documentation/102159/latest/) * [Neon Programmer's Guide for Armv8-A](https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/neon-programmers-guide-for-armv8-a)