arm neon优化

neon是simd的一种实现使用neon的方式有：1.neon library 使用第三方开源库，直接函数调用2.auto-vectorization使用编译器自动auto-vectorizationgcc相关的选项有：-mcpu=cpu-name, where cpu-name is the name of the processor in lower case...

淡泊的猪

3804人浏览 · 2018-09-13 21:00:00

淡泊的猪 · 2018-09-13 21:00:00 发布

neon是simd的一种实现

使用neon的方式有：

1.neon library

使用第三方开源库，直接函数调用

2.auto-vectorization

使用编译器自动auto-vectorization

gcc相关的选项有：

-mcpu=cpu-name, where cpu-name is the name of the processor in lower case. . If you do not specify the processor to use, GCC will use its built-in default. The default can vary depending on how the compiler was originally built and the generated code might not execute or might execute slowly on the CPU that you have.

–mfpu=neon， It does not try to determine the FPU from the -mcpu option. The -mfpu option controls what floating-point and SIMD instructions are available

-mfloat-abi

The option -mfloat-abi takes three possible options

-mfloat-abi=soft Does not use any FPU and NEON instructions. Uses only the core register set. Emulates all floating-point operations using library calls.

-mfloat-abi=softfp Uses the same calling conventions as -mfloat-abi=soft, but uses floating-point and NEON instructions as appropriate. Applications compiled with this option can be linked with a soft float library. If the relevant hardware instructions are available, then you can use this option to improve the performance of code and still have the code conform to a soft-float environment.

-mfloat-abi=hard Uses the floating-point and NEON instructions as appropriate and also changes the ABI calling conventions in order to generate more efficient function calls. Floating-point and vector types can be passed between functions in the NEON registers which significantly reduces the amount of copying. This also means that fewer calls are need to pass arguments on the stack.

-ftree-vectorize

Compiling at optimization level -O3 implies -ftree-vectorize

IEEE compliance

For armv7 ISA (and variants)

Floating-point operations in the NEON unit use the IEEE single-precision format for holding normal operating range values. To minimize the amount of power needed by the NEON unit and to maximize performance, the NEON unit does not fully comply to the IEEE standard if the inputs or the results are denormal or NaN values, which are outside the normal operating range. GCC's default configuration is to generate code that strictly conforms to the rules for IEEE floating-point arithmetic. Hence even with the vectorizer enabled, GCC does not use NEON instructions to vectorize floating-point code by default. GCC provides a number of command-line options to precisely control which level of adherence to the IEEE standard is required. In most cases, it is safe to use the option -ffast-math to relax the rules and enable vectorization. Alternatively you can use the option -Ofast on GCC 4.6 or later to achieve the same effect. It switches on -O3 and a number of other optimizations to get the best performance from your code. To understand all the effects of using the -ffast-math option, see the GCC documentation.

For armv8+ ISA (and variants) [Update]

NEON is now fully IEE-754 compliant, and from a programmer (and compiler's) point of view, there is actually not too much difference. Double precision has been vectorized. From a micro-architecture point of view I kind of doubt they are even different hardware units. ARM does document scalar and vector instructions separately but both are part of "Advanced SIMD."

If you want more information about what GCC is doing, use the options -fdump-tree-vect and -ftree-vectorizer-verbose=level where level is a number in the range of 1 to 9. Lower values output less information. These options control the amount of information that is generated. While most of the information this generates is only of interest to compiler developers, you might find hints in the output that explain why your code is not being vectorized as expected.

优点：无需改变代码，代码一致性强

缺点：优化程度有限，除非在编码的时候注意给编译器提供更多的信息，具体需要注意哪些可查看neon_programmers_guide_V1.0.pdf文档

3.compiler intrinsics

NEON intrinsics are function calls that the compiler replaces with an appropriate NEON instruction or sequence of NEON instructions. Intrinsics provide almost as much control as writing assembly language, but leave the allocation of registers to the compiler, so that developers can focus on the algorithms. It can also perform instruction scheduling to remove pipeline stalls for the specified target processor. This leads to more maintainable source code than using assembly language. NEON Intrinsics is supported by Arm Compilers, gcc and LLVM.

4.assembler code

直接使用NEON instructions

硬件开启neon

armV7

From reset, the VFP extension is disabled. Any attempt to execute a VFP instruction results in an Undefined Instruction exception being taken. To enable software to access VFP features ensure that:

Access to CP10 and CP11 is enabled for the appropriate privilege level. See VFP register access.
If Non-secure access to the VFP features is required, the access flags for CP10 and CP11 in the NSACR must be set to 1. See VFP register access.

In addition, software must set the FPEXC.EN bit to 1 to enable most VFP operations. See Floating-Point Exception Register.

When VFP operation is disabled because FPEXC.EN is 0, all VFP instructions are treated as undefined instructions except for execution of the following in privileged modes:

a VMSR to the FPEXC or FPSID register
a VMRS from the FPEXC, FPSID, MVFR0, or MVFR1 registers.

To use the FPU in Secure state only

To use the FPU in Secure state only, define the CPACR and Floating-Point Exception (FPEXC) registers to enable the FPU:

Set the CPACR for access to CP10 and CP11 (the FPU coprocessors):
```
LDR r0, =(0xF << 20)
```
```
MCR p15, 0, r0, c1, c0, 2
```
Set the FPEXC EN bit to enable the FPU:
```
MOV r3, #0x40000000 
```
```
VMSR FPEXC, r3
```

To use the FPU in Secure state and Non-secure state

To use the FPU in Secure state and Non-secure state, first define the NSACR and then define the CPACR and FPEXC registers to enable the FPU.

Set bits [11:10] of the NSACR for access to CP10 and CP11 from both Secure and Non-secure states:
```
MRC p15, 0, r0, c1, c1, 2
```
```
ORR r0, r0, #2_11<<10 ; enable fpu
```
```
MCR p15, 0, r0, c1, c1, 2
```

Set the CPACR for access to CP10 and CP11:

LDR r0, =(0xF << 20)

MCR p15, 0, r0, c1, c0, 2

Set the FPEXC EN bit to enable the FPU:
```
MOV r3, #0x40000000 
```
```
VMSR FPEXC, r3
```

At this point the Cortex-A5 processor can execute VFP instructions.

Note

Operation is UNPREDICTABLE if you configure the Coprocessor Access Control Register (CPACR) such that CP10 and CP11 do not have identical access permissions.

Enabling the NEON unit in a Linux custom kernel

If you use a Linux custom kernel to run your application, you must enable the NEON unit. To enable the NEON unit, you must use the kernel configuration settings to select:

• Floating point emulation → VFP-format floating point maths

• Floating point emulation → Advanced SIMD (NEON) Extension support

If the NEON unit is disabled and the application tries to execute a NEON instruction, it throws an Undefined Instruction exception. The kernel uses this exception to enable the NEON unit and then executes the NEON instruction. The NEON unit remains enabled until there is a context switch. When a context switch is required, the kernel might disable the NEON unit to save time and power

ARMV8

neon是必须支持的部件，所有默认是enable的，不需要额外的指令来使能

To ensure that the processor supports the NEON extension, you can issue the command:

cat /proc/cpuinfo | grep neon

If it supports the NEON extension, the output shows neon, for example:

Features :

swp half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt

测试程序：

#include <stdio.h>

void add_int (int* restrict pa, int* restrict pb, unsigned int n, int x)
{
unsigned int i;
for(i = 0; i < (n&~3); i++)
pa[i] = pb[i] + x;

}

int sum_int(int* restrict pa,unsigned int n)
{
unsigned int i;
int sum=0;
for(i = 0; i < (n&~3); i++)
sum+=pa[i];
return sum;
}


void main(void)
{

int pa[10000],pb[10000];
int sum=0;
for(int i=0;i<10000;i++)
{
pb[i]=i;
}

for(int j=0;j<10000;j++)
{
add_int(pa,pb,10000,j);
sum=sum_int(pa,10000);
}
printf("sum=%d\n", sum);
}

arm-linux-gnueabihf-gcc neon_test.c -o neon_test

arm-linux-gnueabihf-gcc -O3 -mfpu=neon-vfpv4 -mfloat-abi=hard -mcpu=cortex-a53 neon_test.c -o neon_test_neon

objdump -d neon_test_neon查看是否生成了neon指令

NEON instructions (and VFP instructions) all begin with the letter V

Instructions are generally able to operate on different data types. The type is specified in the instruction encoding. The size is indicated with a suffix to the instruction. The number of elements is indicated by the specified register size.

For example, for the instruction VADD.I8 D0, D1, D2 • VADD indicates a NEON ADD operation.

• The I8 suffix indicates that 8-bit integers are to be added.

• D0, D1 and D2 specify the 64-bit registers used (D0 for the result destination, D1 and D2 for the operands).

This instruction therefore performs eight additions in parallel as there are eight 8-bit lanes in a 64-bit register

VFP和NEON共享部分指令，处于同一power domain

Neon and VFP share the same large register file，These registers are separate from the ARM core registers. The Neon/VFP register file is 256 bytes as shown in the diagram.

Neon/VFP register File

The Neon Register file has a dual view:

32 - 64 bit registers (The Dx registers)
16 - 128 bit registers (The Qx registers)

The VFP Register file also has a dual view:

32 - 64 bit registers (The Dx registers)
32 - 32 bit registers (The Sx registers - Only 1/2 of the registers may be viewed as 32 bit)

From the Neon point of view: register Q0 may be accessed as either Q0 or D0:D1

From the VFP point of view: register D0 may be accessed as either D0 or S0:S1

There are 2 paths or pipelines through Neon:

Integer and fixed point (supports 8 bit, 16 bit, 32 bit integers)
Single precision floating point (supports 32 bit floating point)

VFP has a single path:

Single or double precision floating point (supports 32 bit and 64 bit floating point)

如果指令中使用Q寄存器，那肯定是neon指令

另外使用neon特有的指令，关于neon和vfp指令的总结可以参考arm官方文档：

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489i/CJABFHEJ.html

测试结果：

[root@buildroot data]# time ./neon_test
sum=-827379968

real 0m3.253s
user 0m3.236s
sys 0m0.020s
[root@buildroot data]# time ./neon_test_neon
sum=-827379968

real 0m0.270s
user 0m0.252s
sys 0m0.016s

参考文档：

1.neon_programmers_guide_V1.0.pdf

2.https://developer.arm.com/technologies/neon

http://processors.wiki.ti.com/index.php/Cortex-A8