Friday, 16 February 2018

Exploring single instruction/multiple data (SIMD) vectorization, and the auto-vectorization capabilities of the GCC compiler

What is SIMD VEctorization

A vector may be a direction book operand holding An situated of information components stuffed under a one-dimensional exhibit. The components camwood a chance to be basic alternately floating-point qualities. Practically Vector/SIMD media development Also SPU educational work on vector operands. Vectors need aid also known as SIMD operands alternately stuffed operands.

What is Auto Vectorization?

Programmed vectorization, clinched alongside parallel computing, may be an extraordinary the event about programmed parallelization, the place a workstation project will be changed over from An scalar implementation, which methods a single match of operands In a time, will a vector implementation, which methods person operation looking into numerous pairs from claiming operands without a moment's delay.

So the general purpose of this post is to show to how to implement SIMD vectorization and autovectorization in C code and understanding it by breaking down the code using assembly language and will be compiling on GCC compiler to know its capabilities which is integral part. Here we will be creating a short program with  two 1000-element integer arrays and fills them with random numbers in the range -1000 to +1000, then sums those two arrays element-by-element to a third array, and finally sums the third array and prints the result. So, our first step would be to create such program.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main() {
 
int array1[1000];
int array2[1000];
long array3[1000];
long arraySum = 0;

srand(time(NULL));

for (int i = 0; i < 1000; i++) {
        array1[i] = (rand()% 2001) - 1000;
        array2[i] = (rand()% 2001) - 1000;
        array3[i] = array1[i] + array2[i];
        arraySum += array3[i];
}
        printf("The total array sum is: %li\n", arraySum);

return 0;
}

Compiling this code through gcc compiler it using command
 //gcc -O3 -fopt-info-vec-missed=vect_v0.miss vect_v0.c -o vect_v0  gives us something like this
 vector.c:14:1: note: not vectorized: loop contains function calls or data references that cannot be analyzed
vector.c:12:1: note: not vectorized: not enough data-refs in basic block.
vector.c:16:22: note: not vectorized: not enough data-refs in basic block.
vector.c:14:1: note: not vectorized: not enough data-refs in basic block.
vector.c:20:9: note: not vectorized: not enough data-refs in basic block.

 But if we make few changes in our code

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main() {
 
int array1[1000];
int array2[1000];
long array3[1000];
long arraySum = 0;

srand(time(NULL));

for (int i = 0; i < 1000; i++) {
        array1[i] = (rand()% 2001) - 1000;
        array2[i] = (rand()% 2001) - 1000;
}

for (int i = 0; i < 1000; i++) {
        array3[i] = array1[i] + array2[i];
        arraySum += array3[i];
}
        printf("The total array sum is: %li\n", arraySum);

return 0;
}
 after compililation with the command //gcc -O3 -fopt-info-vec-missed=vect_v0.miss vect_v0.c -o vect_v0 note: loop vectorized

Here we can see that our loop got vectorized.

 Now disassembling the code which is autovectorized using simd objdump -d nameof yourfile

0000000000400560 <main>:
// Here we reserve space on the stack for local variables
  400560:       d283f010        mov     x16, #0x1f80                    // #8064
  400564:       cb3063ff        sub     sp, sp, x16
  400568:       d2800000        mov     x0, #0x0                        // #0
  40056c:       a9007bfd        stp     x29, x30, [sp]
  400570:       910003fd        mov     x29, sp
  400574:       a90153f3        stp     x19, x20, [sp, #16]
  400578:       529a9c74        mov     w20, #0xd4e3                    // #54499
  40057c:       a9025bf5        stp     x21, x22, [sp, #32]
  400580:       72a83014        movk    w20, #0x4180, lsl #16
  400584:       f9001bf7        str     x23, [sp, #48]
  400588:       910103b5        add     x21, x29, #0x40
  40058c:       913f83b6        add     x22, x29, #0xfe0
  400590:       5280fa33        mov     w19, #0x7d1                     // #2001
  400594:       d2800017        mov     x23, #0x0                       // #0
  400598:       97ffffd6        bl      4004f0 <time@plt>
  40059c:       97ffffe9        bl      400540 <srand@plt>
  4005a0:       97ffffdc        bl      400510 <rand@plt>
  4005a4:       9b347c01        smull   x1, w0, w20
  4005a8:       9369fc21        asr     x1, x1, #41
  4005ac:       4b807c21        sub     w1, w1, w0, asr #31
  4005b0:       1b138020        msub    w0, w1, w19, w0
  4005b4:       510fa000        sub     w0, w0, #0x3e8
  4005b8:       b8376aa0        str     w0, [x21, x23]
  4005bc:       97ffffd5        bl      400510 <rand@plt>
  4005c0:       9b347c01        smull   x1, w0, w20
  4005c4:       9369fc21        asr     x1, x1, #41
  4005c8:       4b807c21        sub     w1, w1, w0, asr #31
  4005cc:       1b138020        msub    w0, w1, w19, w0
  4005d0:       510fa000        sub     w0, w0, #0x3e8
  4005d4:       b8376ac0        str     w0, [x22, x23]
  4005d8:       910012f7        add     x23, x23, #0x4
  4005dc:       f13e82ff        cmp     x23, #0xfa0
  4005e0:       54fffe01        b.ne    4005a0 <main+0x40>  // b.any
  4005e4:       4f000401        movi    v1.4s, #0x0
  4005e8:       d2800000        mov     x0, #0x0                        // #0
  4005ec:       3ce06ac0        ldr     q0, [x22, x0]
  4005f0:       3ce06aa2        ldr     q2, [x21, x0]
  4005f4:       91004000        add     x0, x0, #0x10
  4005f8:       f13e801f        cmp     x0, #0xfa0
// This is what it's all for: vector addition
  4005fc:       4ea28400        add     v0.4s, v0.4s, v2.4s
  400600:       0ea01021        saddw   v1.2d, v1.2d, v0.2s
  400604:       4ea01021        saddw2  v1.2d, v1.2d, v0.4s
  400608:       54ffff21        b.ne    4005ec <main+0x8c>  // b.any
  40060c:       5ef1b821        addp    d1, v1.2d
  400610:       90000000        adrp    x0, 400000 <_init-0x4b8>
  400614:       91200000        add     x0, x0, #0x800
// Move the first and second 64-bit elements from vector 1 to two separate registers
// This might be so that they can be used as arguments for printf?
  400618:       4e083c21        mov     x1, v1.d[0]
  40061c:       97ffffcd        bl      400550 <printf@plt>
  400620:       f9401bf7        ldr     x23, [sp, #48]
  400624:       a94153f3        ldp     x19, x20, [sp, #16]
  400628:       52800000        mov     w0, #0x0                        // #0
  40062c:       a9425bf5        ldp     x21, x22, [sp, #32]
  400630:       d283f010        mov     x16, #0x1f80                    // #8064
  400634:       a9407bfd        ldp     x29, x30, [sp]
  400638:       8b3063ff        add     sp, sp, x16
  40063c:       d65f03c0        ret


In spite of the fact that gcc’s auto-vectorization could build raise execution it might not be useful for certain provisions. But auto vectorization can't be trusted. There are huge numbers confinements states will think about auto-vectorization. Gcc needs affirmation that arrays would adjusted Furthermore information may be adjusted. Also, code will well on the way must make re-written should rearrange circle purpose.


No comments:

Post a Comment