Monday 26 February 2018

Inline Assembler

This blog will guide you through inline assembler in AArch64 but first let me introduce you to What inline assembler is? In workstation programming, a inline constructing agent is An characteristic about a portion compilers that permits low-level code composed in low level computing construct will be installed inside a program, "around code that generally need been aggregated from a higher-level dialect for example, such that c alternately ada. In simple words, inline assembler allows us to apply assembly language code into our high-level code (such as C) and have it work more efficiently, due to the nature of assembly language.

Firstly we will test the performance(build ,compile and run) of this program below
//this file is vol_simd.c
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"

int main() {

        int16_t*                in;             // input array
        int16_t*                limit;          // end of input array
        int16_t*                out;            // output array

        // these variables will be used in our assembler code, so we're going
        // to hand-allocate which register they are placed in
        // Q: what is an alternate approach?
        register int16_t*       in_cursor       asm("r20");     // input cursor
        register int16_t*       out_cursor      asm("r21");     // output cursor
        register int16_t        vol_int         asm("r22");     // volume as int16_t

        int                     x;              // array interator
        int                     ttl;            // array total

        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

        srand(-1);
        printf("Generating sample data.\n");
        for (x = 0; x < SAMPLES; x++) {
                in[x] = (rand()%65536)-32768;
        }

// --------------------------------------------------------------------

        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES ;

        // set vol_int to fixed-point representation of 0.75
        // Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t) (0.75 * 32767.0);

        printf("Scaling samples.\n");

        // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

        while ( in_cursor < limit ) {
                __asm__ (
                        "ldr q0, [%[in]],#16            \n\t"
                        // load eight samples into q0 (v0.8h)
                        // from in_cursor, and post-increment
                        // in_cursor by 16 bytes

                        "sqdmulh v0.8h, v0.8h, v1.8h    \n\t"
                        // multiply each lane in v0 by v1*2
                        // saturate results
                        // store upper 16 bits of results into v0

                        "str q0, [%[out]],#16           \n\t"
                        // store eight samples to out_cursor
                        // post-increment out_cursor by 16 bytes

                        // Q: what happens if we remove the following
                        // two lines? Why?
                        : [in]"+r"(in_cursor)
                        : "0"(in_cursor),[out]"r"(out_cursor)
                        );
        }

// --------------------------------------------------------------------

        printf("Summing samples.\n");
        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

        return 0;

}

#define SAMPLES 500000 in vol.h this is the sample size we are using currently to test above code
Compiling it with gcc vol_simd.c and running it with ./a.out we will get output

Generating sample data.
Scaling samples.
Summing samples.
Result: -894

real    0m0.033s
user    0m0.033s
sys     0m0.000s

Now changing sample size in vol.h from 500000 to 2500000 we will get

Generating sample data.
Scaling samples.
Summing samples.
Result: 249

real    0m0.145s
user    0m0.125s
sys     0m0.019s

Processing times without any banner appears to make code quicker to run than those of inline constructing agent code. Gathering for an -O3 banner produces generally comparable run-times for both projects. I find it because of the motivation behind a -O3 makes a run-time that is not exclusively "code-dependent" inasmuch as no flags will depend wholly on the caliber for code (and asm purpose coded) done it.

Q: what is an alternate approach?
An alternate approach would be not to assign registers for the variables in_cursor, out_cursor. By implementing this approach  will result in the allocating registers on its own.

Q: should we use 32767 or 32768 in next line? why?
32767, for the reason it is the maximum allowed value of data type int16_t.

Q: what does it mean to "duplicate" values here?
copying what's in register 22 vol_int into vector v1.8h this is what duplicate values means

Q: what happens if we remove the following lines? Why?
Those accompanying lines are answerable for relegating comparing qualities under those code's variables for scaling purpose. Removing them results in
 vol_simd.c: In function ‘int main()’:
vol_simd.c:69:5: error: undefined named operand ‘in’
    );
     ^
vol_simd.c:69:5: error: undefined named operand ‘out.

Q: are the results usable? are they correct?
Yes they are correct as we change the scaling value our main output is changing too

Part 2
I have picked Busy Box for this particular section
A bit of background for BusyBox from the official website:

"BusyBox combines tiny versions of many common UNIX utilities into a single small executable. It provides replacements for most of the utilities you usually find in GNU fileutils, shellutils, etc. The utilities in BusyBox generally have fewer options than their full-featured GNU cousins; however, the options that are included provide the expected functionality and behave very much like their GNU counterparts. BusyBox provides a fairly complete environment for any small or embedded system."

How much assembley-language code is present?
To see I used egrep command and found some of the assembly language code in the package. Code I found was based on networking and include directory.

Which platform it is used on?
Supported by gcc but cannot support arm , x86_64

Why it is there (what it does)?
The functions can be found in "networking/tls_pstm_montgomery_reduce.c". The file holds a handful from claiming inline constructing agent works whose motivation may be on improve operations inside different platforms/architectures.

What happens on other platforms?
the directory networking checks  if the software runs properly on all platforms using its own script based on inline assembler code

Your opinion of the value of the assembler code VS the loss of portability/increase in complexity of the code.

Similarly as a wide margin Similarly on my observation see, constructing agent code might a chance to be greatly effective for ventures Anyway could a chance to be exact drawn out Furthermore emptying. Those many-sided nature for constructing agent code and need to conform it for every each construction modeling make it impeding to huge scale ventures Furthermore make standard assignments for example, such that debugging alternately support a chance to be exact modest What's more taxis. However, the quality about constructing agent code could doubtlessly make necessary for ventures for example, "BusyBox" that need aid striving to be all-enveloping operations faster, additional proficient and general preferred.

No comments:

Post a Comment