This blog will guide you through inline assembler in AArch64 but first let me introduce you to What inline assembler is? In workstation programming, a inline constructing agent is An characteristic about a portion compilers that permits low-level code composed in low level computing construct will be installed inside a program, "around code that generally need been aggregated from a higher-level dialect for example, such that c alternately ada. In simple words, inline assembler allows us to apply assembly language code into our high-level code (such as C) and have it work more efficiently, due to the nature of assembly language.
Firstly we will test the performance(build ,compile and run) of this program below
//this file is vol_simd.c
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"
int main() {
int16_t* in; // input array
int16_t* limit; // end of input array
int16_t* out; // output array
// these variables will be used in our assembler code, so we're going
// to hand-allocate which register they are placed in
// Q: what is an alternate approach?
register int16_t* in_cursor asm("r20"); // input cursor
register int16_t* out_cursor asm("r21"); // output cursor
register int16_t vol_int asm("r22"); // volume as int16_t
int x; // array interator
int ttl; // array total
in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
srand(-1);
printf("Generating sample data.\n");
for (x = 0; x < SAMPLES; x++) {
in[x] = (rand()%65536)-32768;
}
// --------------------------------------------------------------------
in_cursor = in;
out_cursor = out;
limit = in + SAMPLES ;
// set vol_int to fixed-point representation of 0.75
// Q: should we use 32767 or 32768 in next line? why?
vol_int = (int16_t) (0.75 * 32767.0);
printf("Scaling samples.\n");
// Q: what does it mean to "duplicate" values in the next line?
__asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h
while ( in_cursor < limit ) {
__asm__ (
"ldr q0, [%[in]],#16 \n\t"
// load eight samples into q0 (v0.8h)
// from in_cursor, and post-increment
// in_cursor by 16 bytes
"sqdmulh v0.8h, v0.8h, v1.8h \n\t"
// multiply each lane in v0 by v1*2
// saturate results
// store upper 16 bits of results into v0
"str q0, [%[out]],#16 \n\t"
// store eight samples to out_cursor
// post-increment out_cursor by 16 bytes
// Q: what happens if we remove the following
// two lines? Why?
: [in]"+r"(in_cursor)
: "0"(in_cursor),[out]"r"(out_cursor)
);
}
// --------------------------------------------------------------------
printf("Summing samples.\n");
for (x = 0; x < SAMPLES; x++) {
ttl=(ttl+out[x])%1000;
}
// Q: are the results usable? are they correct?
printf("Result: %d\n", ttl);
return 0;
}
#define SAMPLES 500000 in vol.h this is the sample size we are using currently to test above code
Compiling it with gcc vol_simd.c and running it with ./a.out we will get output
Generating sample data.
Scaling samples.
Summing samples.
Result: -894
real 0m0.033s
user 0m0.033s
sys 0m0.000s
Now changing sample size in vol.h from 500000 to 2500000 we will get
Generating sample data.
Scaling samples.
Summing samples.
Result: 249
real 0m0.145s
user 0m0.125s
sys 0m0.019s
Processing times without any banner appears to make code quicker to run than those of inline constructing agent code. Gathering for an -O3 banner produces generally comparable run-times for both projects. I find it because of the motivation behind a -O3 makes a run-time that is not exclusively "code-dependent" inasmuch as no flags will depend wholly on the caliber for code (and asm purpose coded) done it.
Q: what is an alternate approach?
An alternate approach would be not to assign registers for the variables in_cursor, out_cursor. By implementing this approach will result in the allocating registers on its own.
Q: should we use 32767 or 32768 in next line? why?
32767, for the reason it is the maximum allowed value of data type int16_t.
Q: what does it mean to "duplicate" values here?
copying what's in register 22 vol_int into vector v1.8h this is what duplicate values means
Q: what happens if we remove the following lines? Why?
Those accompanying lines are answerable for relegating comparing qualities under those code's variables for scaling purpose. Removing them results in
vol_simd.c: In function ‘int main()’:
vol_simd.c:69:5: error: undefined named operand ‘in’
);
^
vol_simd.c:69:5: error: undefined named operand ‘out.
Q: are the results usable? are they correct?
Yes they are correct as we change the scaling value our main output is changing too
Part 2
I have picked Busy Box for this particular section
A bit of background for BusyBox from the official website:
"BusyBox combines tiny versions of many common UNIX utilities into a single small executable. It provides replacements for most of the utilities you usually find in GNU fileutils, shellutils, etc. The utilities in BusyBox generally have fewer options than their full-featured GNU cousins; however, the options that are included provide the expected functionality and behave very much like their GNU counterparts. BusyBox provides a fairly complete environment for any small or embedded system."
How much assembley-language code is present?
To see I used egrep command and found some of the assembly language code in the package. Code I found was based on networking and include directory.
Which platform it is used on?
Supported by gcc but cannot support arm , x86_64
Why it is there (what it does)?
The functions can be found in "networking/tls_pstm_montgomery_reduce.c". The file holds a handful from claiming inline constructing agent works whose motivation may be on improve operations inside different platforms/architectures.
What happens on other platforms?
the directory networking checks if the software runs properly on all platforms using its own script based on inline assembler code
Your opinion of the value of the assembler code VS the loss of portability/increase in complexity of the code.
Similarly as a wide margin Similarly on my observation see, constructing agent code might a chance to be greatly effective for ventures Anyway could a chance to be exact drawn out Furthermore emptying. Those many-sided nature for constructing agent code and need to conform it for every each construction modeling make it impeding to huge scale ventures Furthermore make standard assignments for example, such that debugging alternately support a chance to be exact modest What's more taxis. However, the quality about constructing agent code could doubtlessly make necessary for ventures for example, "BusyBox" that need aid striving to be all-enveloping operations faster, additional proficient and general preferred.
Monday, 26 February 2018
Wednesday, 21 February 2018
Algorithm Selection using Assembly Language
This blog is based on selection of two algorithms for adjusting the volume of PCM audio samples based on bench-marking of two possible approaches. Digital sound is typically represented, uncompressed, as signed 16-bit integer signal samples.
There is one stream of samples for the left and right stereo channels, at typical sample rates of 44.1 or 48 thousand samples per second, for a total of 88.2 or 96 thousand samples per second. Since there are 16 bits (2 bytes) per sample, the data rate is 88.2 * 1000 * 2 = 176,400 bytes/second (~172 KiB/sec) or 96 * 1000 * 2 = 192,000 bytes/second (~187.5 KiB/sec).
To change the volume of sound, each sample can be scaled by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).
On a mobile device, the amount of processing required to scale sound will affect battery life.
Our first step would be to start with ARMv8 AARch64 SPO600 Server. Here we have a file vol1.c and vol.h. vol.h has define 500000 Samples in it while vol1.c creates 500,000 random "sound samples" in an input array (the number of samples is set in the vol.h file). Scales those samples by the volume factor 0.75 and stores them in an output array.Sums the output array and prints the sum.
Our method 1 would be simple to test the script below simply by multiplying each sample by the floating point volume factor 0.75 to get the scaled sample value
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"
// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
return (int16_t) (volume_factor * (float) sample);
}
int main() {
// Allocate memory for large in and out arrays
int16_t* in;
int16_t* out;
clock_t startTime, endTime;
double totalTime;
in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
int x;
int ttl;
// Seed the pseudo-random number generator
srand(-1);
// Fill the array with random data
for (x = 0; x < SAMPLES; x++) {
in[x] = (rand()%65536)-32768;
}
startTime = clock();
// ######################################
// This is the interesting part!
// Scale the volume of all of the samples
for (x = 0; x < SAMPLES; x++) {
out[x] = scale_sample(in[x], 0.75);
}
// ######################################
startTime = clock();
// Sum up the data
for (x = 0; x < SAMPLES; x++) {
ttl = (ttl+out[x])%1000;
}
totalTime= (double)(endTime-startTime)/CLOCKS_PER_SEC;
// Print the sum
printf("Result: %d\n", ttl);
printf("CPU time used to scale samples: %1f seconds\n",totalTime);
return 0;
}
The clock function is used to determine the quantity of CPU processing time this is used to calculate the scaled sample values and shop them in any other array. i will first run all my tests on Aarchie. I collect my program with no optimization. I take advantage of the time command to determine how long it takes to run my software. The time command shows the real time, the person CPU time and the system CPU time. here is the end result:
By compiling the code of vol1.c using command g++ -o vol1 vol1.c
running it with:- time ./vol1 we get
Result: -86
CPU time used to scale samples: -0.032502 seconds
real 0m0.038s
user 0m0.037s
sys 0m0.000s
An alternate approach to this approach would be to pre-create a lookup table of all possible sample values multiplies by the volume factor, and look up each sample in that table to get the scaled values
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"
// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
return (int16_t) (volume_factor * (float) sample);
}
int main() {
// Allocate memory for large in and out arrays
int16_t* in;
int16_t* out;
int16_t lookupTable[65536];
clock_t startTime, endTime;
double totalTime;
in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
int x;
int ttl;
for(x=0;x<65536;x++){
lookupTable[x]=(x-32768)*.75;
}
// Seed the pseudo-random number generator
srand(-1);
// Fill the array with random data
for (x = 0; x < SAMPLES; x++) {
in[x] = (rand()%65536)-32768;
}
for(x=0;x<SAMPLES;x++){
out[1]=lookupTable[in[1]+32768];
}
startTime = clock();
// ######################################
// This is the interesting part!
// Scale the volume of all of the samples
for (x = 0; x < SAMPLES; x++) {
out[x] = scale_sample(in[x], 0.75);
}
// ######################################
startTime = clock();
// Sum up the data
for (x = 0; x < SAMPLES; x++) {
ttl = (ttl+out[x])%1000;
}
totalTime= (double)(endTime-startTime)/CLOCKS_PER_SEC;
// Print the sum
printf("Result: %d\n", ttl);
printf("CPU time used to scale samples: %1f seconds\n",totalTime);
return 0;
}
Compiling and running this code with the same command as previous one we get
Result: -86
CPU time used to scale samples: -0.041233 seconds
real 0m0.046s
user 0m0.046s
sys 0m0.000s
The result summing of the samples is the same but time to calculate those samples has pretty much increased by 0.008731 seconds compared to previous result
The last and final approach is to convert the volume factor 0.75 to a fix-point integer by multiplying by a binary number representing a fixed-point value "1". For example, you could use 0b100000000 (= 256 in decimal) to represent 1.00. Shift the result to the right the required number of bits after the multiplication (>>8 if you're using 256 as the multiplier).
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"
// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
return (int16_t) (volume_factor * (float) sample);
}
int main() {
// Allocate memory for large in and out arrays
int16_t* in;
int16_t* out;
clock_t startTime, endTime;
double totalTime;
in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
short volumeFactor = 0b100000000 * .75;
int x;
int ttl;
// Seed the pseudo-random number generator
srand(-1);
// Fill the array with random data
for (x = 0; x < SAMPLES; x++) {
in[x] = (rand()%65536)-32768;
}
startTime = clock();
// ######################################
// This is the interesting part!
// Scale the volume of all of the samples
for (x = 0; x < SAMPLES; x++) {
//out[x] = scale_sample(in[x], 0.75);
out[x]=in[x]*volumeFactor>>8;
}
// ######################################
startTime = clock();
// Sum up the data
for (x = 0; x < SAMPLES; x++) {
ttl = (ttl+out[x])%1000;
}
totalTime= (double)(endTime-startTime)/CLOCKS_PER_SEC;
// Print the sum
printf("Result: %d\n", ttl);
printf("CPU time used to scale samples: %1f seconds\n",totalTime);
return 0;
}
Result: -386
CPU time used to scale samples: 4.167165 seconds
real 0m0.035s
user 0m0.035s
sys 0m0.000s
Here the result of scaled value is different from previous values because of conversion. Time has increased too.
By implementing -O3 optimization our result and timing quite varies each time.
1) Result: -126
CPU time used to scale samples: -0.024013 seconds
real 0m0.027s
user 0m0.027s
sys 0m0.000s
2) Result: -942
CPU time used to scale samples: -0.024040 seconds
real 0m0.027s
user 0m0.027s
sys 0m0.000s
3) Result: -506
CPU time used to scale samples: -0.023942 seconds
real 0m0.027s
user 0m0.027s
sys 0m0.000s
The entirety of all scaled tests does alter after utilizing -O3 choice. For the moment strategy, it took approximately more time than the to begin with strategy to calculate the scaled test values and store them in another cluster. For the third strategy, it is approximately nearly 2 times speedier than the to begin with strategy and nearly 3 times quicker than the moment strategy. The third strategy has the most brief genuine time and client time and the moment strategy has the longest genuine time and client time, which are the same comes about as some time recently. After optimization is empowered utilizing -O3 choice, the preparing time has been definitely diminished for all three strategies. When I compare to no optimization, the to begin with strategy is approximately 3.5 times speedier, the moment strategy is almost 2.4 times speedier and the third strategy is approximately 5.4 times quicker.
“/usr/bin/time -v” to find out how much memory is used to run my program. for all three methods using -O3 option during compilation:
Result: -126
CPU time used to scale samples: -0.023778 seconds
Command being timed: "./lab5"
User time (seconds): 0.01
System time (seconds): 0.00
Percent of CPU this job got: 96%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3240
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 551
Voluntary context switches: 1
Involuntary context switches: 0
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Result: -942
CPU time used to scale samples: -0.023968 seconds
Command being timed: "./lab5_2"
User time (seconds): 0.02
System time (seconds): 0.00
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3364
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 583
Voluntary context switches: 1
Involuntary context switches: 0
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Result: -506
CPU time used to scale samples: -0.023344 seconds
Command being timed: "./lab5_3"
User time (seconds): 0.02
System time (seconds): 0.00
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3244
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 551
Voluntary context switches: 1
Involuntary context switches: 1
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Conclusion:
From the comes about over, we see that the third strategy creates the most limited CPU handling time and the moment strategy creates the longest CPU handling time in any case of whether or not optimization is empowered when compiling my program. We get the same comes about when we compare the three strategies utilizing the genuine time and client time from the time command. This implies that the approach to alter volume that comes about in the leading execution is the third strategy. The third strategy changes over the volume figure to a fix-point numbers by duplicating it by a parallel number speaking to a fixed-point , increases this result by each test esteem and shifts the result to the right by the redress number of bits to get the scaled test esteem.
There is one stream of samples for the left and right stereo channels, at typical sample rates of 44.1 or 48 thousand samples per second, for a total of 88.2 or 96 thousand samples per second. Since there are 16 bits (2 bytes) per sample, the data rate is 88.2 * 1000 * 2 = 176,400 bytes/second (~172 KiB/sec) or 96 * 1000 * 2 = 192,000 bytes/second (~187.5 KiB/sec).
To change the volume of sound, each sample can be scaled by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).
On a mobile device, the amount of processing required to scale sound will affect battery life.
Our first step would be to start with ARMv8 AARch64 SPO600 Server. Here we have a file vol1.c and vol.h. vol.h has define 500000 Samples in it while vol1.c creates 500,000 random "sound samples" in an input array (the number of samples is set in the vol.h file). Scales those samples by the volume factor 0.75 and stores them in an output array.Sums the output array and prints the sum.
Our method 1 would be simple to test the script below simply by multiplying each sample by the floating point volume factor 0.75 to get the scaled sample value
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"
// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
return (int16_t) (volume_factor * (float) sample);
}
int main() {
// Allocate memory for large in and out arrays
int16_t* in;
int16_t* out;
clock_t startTime, endTime;
double totalTime;
in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
int x;
int ttl;
// Seed the pseudo-random number generator
srand(-1);
// Fill the array with random data
for (x = 0; x < SAMPLES; x++) {
in[x] = (rand()%65536)-32768;
}
startTime = clock();
// ######################################
// This is the interesting part!
// Scale the volume of all of the samples
for (x = 0; x < SAMPLES; x++) {
out[x] = scale_sample(in[x], 0.75);
}
// ######################################
startTime = clock();
// Sum up the data
for (x = 0; x < SAMPLES; x++) {
ttl = (ttl+out[x])%1000;
}
totalTime= (double)(endTime-startTime)/CLOCKS_PER_SEC;
// Print the sum
printf("Result: %d\n", ttl);
printf("CPU time used to scale samples: %1f seconds\n",totalTime);
return 0;
}
The clock function is used to determine the quantity of CPU processing time this is used to calculate the scaled sample values and shop them in any other array. i will first run all my tests on Aarchie. I collect my program with no optimization. I take advantage of the time command to determine how long it takes to run my software. The time command shows the real time, the person CPU time and the system CPU time. here is the end result:
By compiling the code of vol1.c using command g++ -o vol1 vol1.c
running it with:- time ./vol1 we get
Result: -86
CPU time used to scale samples: -0.032502 seconds
real 0m0.038s
user 0m0.037s
sys 0m0.000s
An alternate approach to this approach would be to pre-create a lookup table of all possible sample values multiplies by the volume factor, and look up each sample in that table to get the scaled values
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"
// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
return (int16_t) (volume_factor * (float) sample);
}
int main() {
// Allocate memory for large in and out arrays
int16_t* in;
int16_t* out;
int16_t lookupTable[65536];
clock_t startTime, endTime;
double totalTime;
in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
int x;
int ttl;
for(x=0;x<65536;x++){
lookupTable[x]=(x-32768)*.75;
}
// Seed the pseudo-random number generator
srand(-1);
// Fill the array with random data
for (x = 0; x < SAMPLES; x++) {
in[x] = (rand()%65536)-32768;
}
for(x=0;x<SAMPLES;x++){
out[1]=lookupTable[in[1]+32768];
}
startTime = clock();
// ######################################
// This is the interesting part!
// Scale the volume of all of the samples
for (x = 0; x < SAMPLES; x++) {
out[x] = scale_sample(in[x], 0.75);
}
// ######################################
startTime = clock();
// Sum up the data
for (x = 0; x < SAMPLES; x++) {
ttl = (ttl+out[x])%1000;
}
totalTime= (double)(endTime-startTime)/CLOCKS_PER_SEC;
// Print the sum
printf("Result: %d\n", ttl);
printf("CPU time used to scale samples: %1f seconds\n",totalTime);
return 0;
}
Compiling and running this code with the same command as previous one we get
Result: -86
CPU time used to scale samples: -0.041233 seconds
real 0m0.046s
user 0m0.046s
sys 0m0.000s
The result summing of the samples is the same but time to calculate those samples has pretty much increased by 0.008731 seconds compared to previous result
The last and final approach is to convert the volume factor 0.75 to a fix-point integer by multiplying by a binary number representing a fixed-point value "1". For example, you could use 0b100000000 (= 256 in decimal) to represent 1.00. Shift the result to the right the required number of bits after the multiplication (>>8 if you're using 256 as the multiplier).
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"
// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
return (int16_t) (volume_factor * (float) sample);
}
int main() {
// Allocate memory for large in and out arrays
int16_t* in;
int16_t* out;
clock_t startTime, endTime;
double totalTime;
in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
short volumeFactor = 0b100000000 * .75;
int x;
int ttl;
// Seed the pseudo-random number generator
srand(-1);
// Fill the array with random data
for (x = 0; x < SAMPLES; x++) {
in[x] = (rand()%65536)-32768;
}
startTime = clock();
// ######################################
// This is the interesting part!
// Scale the volume of all of the samples
for (x = 0; x < SAMPLES; x++) {
//out[x] = scale_sample(in[x], 0.75);
out[x]=in[x]*volumeFactor>>8;
}
// ######################################
startTime = clock();
// Sum up the data
for (x = 0; x < SAMPLES; x++) {
ttl = (ttl+out[x])%1000;
}
totalTime= (double)(endTime-startTime)/CLOCKS_PER_SEC;
// Print the sum
printf("Result: %d\n", ttl);
printf("CPU time used to scale samples: %1f seconds\n",totalTime);
return 0;
}
Result: -386
CPU time used to scale samples: 4.167165 seconds
real 0m0.035s
user 0m0.035s
sys 0m0.000s
Here the result of scaled value is different from previous values because of conversion. Time has increased too.
By implementing -O3 optimization our result and timing quite varies each time.
1) Result: -126
CPU time used to scale samples: -0.024013 seconds
real 0m0.027s
user 0m0.027s
sys 0m0.000s
2) Result: -942
CPU time used to scale samples: -0.024040 seconds
real 0m0.027s
user 0m0.027s
sys 0m0.000s
3) Result: -506
CPU time used to scale samples: -0.023942 seconds
real 0m0.027s
user 0m0.027s
sys 0m0.000s
The entirety of all scaled tests does alter after utilizing -O3 choice. For the moment strategy, it took approximately more time than the to begin with strategy to calculate the scaled test values and store them in another cluster. For the third strategy, it is approximately nearly 2 times speedier than the to begin with strategy and nearly 3 times quicker than the moment strategy. The third strategy has the most brief genuine time and client time and the moment strategy has the longest genuine time and client time, which are the same comes about as some time recently. After optimization is empowered utilizing -O3 choice, the preparing time has been definitely diminished for all three strategies. When I compare to no optimization, the to begin with strategy is approximately 3.5 times speedier, the moment strategy is almost 2.4 times speedier and the third strategy is approximately 5.4 times quicker.
“/usr/bin/time -v” to find out how much memory is used to run my program. for all three methods using -O3 option during compilation:
Result: -126
CPU time used to scale samples: -0.023778 seconds
Command being timed: "./lab5"
User time (seconds): 0.01
System time (seconds): 0.00
Percent of CPU this job got: 96%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3240
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 551
Voluntary context switches: 1
Involuntary context switches: 0
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Result: -942
CPU time used to scale samples: -0.023968 seconds
Command being timed: "./lab5_2"
User time (seconds): 0.02
System time (seconds): 0.00
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3364
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 583
Voluntary context switches: 1
Involuntary context switches: 0
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Result: -506
CPU time used to scale samples: -0.023344 seconds
Command being timed: "./lab5_3"
User time (seconds): 0.02
System time (seconds): 0.00
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3244
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 551
Voluntary context switches: 1
Involuntary context switches: 1
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Conclusion:
From the comes about over, we see that the third strategy creates the most limited CPU handling time and the moment strategy creates the longest CPU handling time in any case of whether or not optimization is empowered when compiling my program. We get the same comes about when we compare the three strategies utilizing the genuine time and client time from the time command. This implies that the approach to alter volume that comes about in the leading execution is the third strategy. The third strategy changes over the volume figure to a fix-point numbers by duplicating it by a parallel number speaking to a fixed-point , increases this result by each test esteem and shifts the result to the right by the redress number of bits to get the scaled test esteem.
Friday, 16 February 2018
Exploring single instruction/multiple data (SIMD) vectorization, and the auto-vectorization capabilities of the GCC compiler
What is SIMD VEctorization
A vector may be a direction book operand holding An situated of information components stuffed under a one-dimensional exhibit. The components camwood a chance to be basic alternately floating-point qualities. Practically Vector/SIMD media development Also SPU educational work on vector operands. Vectors need aid also known as SIMD operands alternately stuffed operands.What is Auto Vectorization?
Programmed vectorization, clinched alongside parallel computing, may be an extraordinary the event about programmed parallelization, the place a workstation project will be changed over from An scalar implementation, which methods a single match of operands In a time, will a vector implementation, which methods person operation looking into numerous pairs from claiming operands without a moment's delay.So the general purpose of this post is to show to how to implement SIMD vectorization and autovectorization in C code and understanding it by breaking down the code using assembly language and will be compiling on GCC compiler to know its capabilities which is integral part. Here we will be creating a short program with two 1000-element integer arrays and fills them with random numbers in the range -1000 to +1000, then sums those two arrays element-by-element to a third array, and finally sums the third array and prints the result. So, our first step would be to create such program.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main() {
int array1[1000];
int array2[1000];
long array3[1000];
long arraySum = 0;
srand(time(NULL));
for (int i = 0; i < 1000; i++) {
array1[i] = (rand()% 2001) - 1000;
array2[i] = (rand()% 2001) - 1000;
array3[i] = array1[i] + array2[i];
arraySum += array3[i];
}
printf("The total array sum is: %li\n", arraySum);
return 0;
}
Compiling this code through gcc compiler it using command
//gcc -O3 -fopt-info-vec-missed=vect_v0.miss vect_v0.c -o vect_v0 gives us something like this
vector.c:14:1: note: not vectorized: loop contains function calls or data references that cannot be analyzed
vector.c:12:1: note: not vectorized: not enough data-refs in basic block.
vector.c:16:22: note: not vectorized: not enough data-refs in basic block.
vector.c:14:1: note: not vectorized: not enough data-refs in basic block.
vector.c:20:9: note: not vectorized: not enough data-refs in basic block.
But if we make few changes in our code
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main() {
int array1[1000];
int array2[1000];
long array3[1000];
long arraySum = 0;
srand(time(NULL));
for (int i = 0; i < 1000; i++) {
array1[i] = (rand()% 2001) - 1000;
array2[i] = (rand()% 2001) - 1000;
}
for (int i = 0; i < 1000; i++) {
array3[i] = array1[i] + array2[i];
arraySum += array3[i];
}
printf("The total array sum is: %li\n", arraySum);
return 0;
}
after compililation with the command //gcc -O3 -fopt-info-vec-missed=vect_v0.miss vect_v0.c -o vect_v0 note: loop vectorized
Here we can see that our loop got vectorized.
Now disassembling the code which is autovectorized using simd objdump -d nameof yourfile
0000000000400560 <main>:
// Here we reserve space on the stack for local variables
400560: d283f010 mov x16, #0x1f80 // #8064
400564: cb3063ff sub sp, sp, x16
400568: d2800000 mov x0, #0x0 // #0
40056c: a9007bfd stp x29, x30, [sp]
400570: 910003fd mov x29, sp
400574: a90153f3 stp x19, x20, [sp, #16]
400578: 529a9c74 mov w20, #0xd4e3 // #54499
40057c: a9025bf5 stp x21, x22, [sp, #32]
400580: 72a83014 movk w20, #0x4180, lsl #16
400584: f9001bf7 str x23, [sp, #48]
400588: 910103b5 add x21, x29, #0x40
40058c: 913f83b6 add x22, x29, #0xfe0
400590: 5280fa33 mov w19, #0x7d1 // #2001
400594: d2800017 mov x23, #0x0 // #0
400598: 97ffffd6 bl 4004f0 <time@plt>
40059c: 97ffffe9 bl 400540 <srand@plt>
4005a0: 97ffffdc bl 400510 <rand@plt>
4005a4: 9b347c01 smull x1, w0, w20
4005a8: 9369fc21 asr x1, x1, #41
4005ac: 4b807c21 sub w1, w1, w0, asr #31
4005b0: 1b138020 msub w0, w1, w19, w0
4005b4: 510fa000 sub w0, w0, #0x3e8
4005b8: b8376aa0 str w0, [x21, x23]
4005bc: 97ffffd5 bl 400510 <rand@plt>
4005c0: 9b347c01 smull x1, w0, w20
4005c4: 9369fc21 asr x1, x1, #41
4005c8: 4b807c21 sub w1, w1, w0, asr #31
4005cc: 1b138020 msub w0, w1, w19, w0
4005d0: 510fa000 sub w0, w0, #0x3e8
4005d4: b8376ac0 str w0, [x22, x23]
4005d8: 910012f7 add x23, x23, #0x4
4005dc: f13e82ff cmp x23, #0xfa0
4005e0: 54fffe01 b.ne 4005a0 <main+0x40> // b.any
4005e4: 4f000401 movi v1.4s, #0x0
4005e8: d2800000 mov x0, #0x0 // #0
4005ec: 3ce06ac0 ldr q0, [x22, x0]
4005f0: 3ce06aa2 ldr q2, [x21, x0]
4005f4: 91004000 add x0, x0, #0x10
4005f8: f13e801f cmp x0, #0xfa0
// This is what it's all for: vector addition
4005fc: 4ea28400 add v0.4s, v0.4s, v2.4s
400600: 0ea01021 saddw v1.2d, v1.2d, v0.2s
400604: 4ea01021 saddw2 v1.2d, v1.2d, v0.4s
400608: 54ffff21 b.ne 4005ec <main+0x8c> // b.any
40060c: 5ef1b821 addp d1, v1.2d
400610: 90000000 adrp x0, 400000 <_init-0x4b8>
400614: 91200000 add x0, x0, #0x800
// Move the first and second 64-bit elements from vector 1 to two separate registers
// This might be so that they can be used as arguments for printf?
400618: 4e083c21 mov x1, v1.d[0]
40061c: 97ffffcd bl 400550 <printf@plt>
400620: f9401bf7 ldr x23, [sp, #48]
400624: a94153f3 ldp x19, x20, [sp, #16]
400628: 52800000 mov w0, #0x0 // #0
40062c: a9425bf5 ldp x21, x22, [sp, #32]
400630: d283f010 mov x16, #0x1f80 // #8064
400634: a9407bfd ldp x29, x30, [sp]
400638: 8b3063ff add sp, sp, x16
40063c: d65f03c0 ret
In spite of the fact that gcc’s auto-vectorization could build raise execution it might not be useful for certain provisions. But auto vectorization can't be trusted. There are huge numbers confinements states will think about auto-vectorization. Gcc needs affirmation that arrays would adjusted Furthermore information may be adjusted. Also, code will well on the way must make re-written should rearrange circle purpose.
Friday, 9 February 2018
Experiment with Assembler on the x86_64 and aarch64 platforms.
At our system Previously, gathering language, we were specifically modifying those "bare metal" equipment. This implies that a number of the compile-time Also run-time checks, slip messages, Furthermore diagnostics that would display in other dialects would not accessible. The machine will take after your guidelines exactly, regardless of they need aid totally bad (like executing data), What's more when something dives wrong, your system won't end until it tries should do something That's not permitted, for example, execute a invalid opcode or endeavor on right An ensured alternately unmapped district for memory. When that happens, those cpu will sign an exception, Also By and large the working framework will close down the culpable procedure.
X86 is a crew for backward-compatible direction book set architectures dependent upon those Intel cpu What's more its Intel variant. Likewise a fully 16-bit development about Intel's 8-bit-based microprocessor, for memory division Concerning illustration an answer to tending to more memory over a chance to be secured by a plain 16-bit deliver.
New direction book set, A64. Need 31 universally useful 64-bit registers. Need committed zero alternately stack pointer register contingent upon direction book. The system counter may be no more straightforwardly receptive Likewise a register. Guidelines are still 32 odds in length What's more basically the same Likewise A32. Need matched loads/stores. No predication for practically educational. The majority educational make 32-bit or 64-bit contentions.
Addresses expected will make 64-bit.
(1) Our first step is to build 3 C versions of the Hello world program for x86_64
# Portable C versions: # write() version
Output should be similar to this in x86_64 architecture
183 -lines x86_64 in total
0000000000400594 <main>:
400594: a9bf7bfd stp x29, x30, [sp, #-16]!
400598: 910003fd mov x29, sp
40059c: 90000000 adrp x0, 400000 <_init-0x418>
4005a0: 9119c000 add x0, x0, #0x670
4005a4: 97ffffb7 bl 400480 <printf@plt>////printf()
4005a8: 52800000 mov w0, #0x0 // #0
4005ac: a8c17bfd ldp x29, x30, [sp], #16
4005b0: d65f03c0 ret
4005b4: 00000000 .inst 0x00000000 ; undefined
185-lines in total
0000000000400594 <main>:
400594: a9bf7bfd stp x29, x30, [sp, #-16]!
400598: 910003fd mov x29, sp
40059c: 90000000 adrp x0, 400000 <_init-0x418>
4005a0: 9119e000 add x0, x0, #0x678
4005a4: d28001a2 mov x2, #0xd // #13
4005a8: aa0003e1 mov x1, x0
4005ac: 52800020 mov w0, #0x1 // #1
4005b0: 97ffffb0 bl 400470 <write@plt>////Write()
4005b4: 52800000 mov w0, #0x0 // #0
4005b8: a8c17bfd ldp x29, x30, [sp], #16
4005bc: d65f03c0 ret
187-lines in total
0000000000400594 <main>:
400594: a9bf7bfd stp x29, x30, [sp, #-16]!
400598: 910003fd mov x29, sp
40059c: 90000000 adrp x0, 400000 <_init-0x418>
4005a0: 911a0000 add x0, x0, #0x680
4005a4: 528001a3 mov w3, #0xd // #13
4005a8: aa0003e2 mov x2, x0
4005ac: 52800021 mov w1, #0x1 // #1
4005b0: d2800800 mov x0, #0x40 // #64
4005b4: 97ffffb3 bl 400480 <syscall@plt>////syscall()
4005b8: 52800000 mov w0, #0x0 // #0
4005bc: a8c17bfd ldp x29, x30, [sp], #16
4005c0: d65f03c0 ret
4005c4: 00000000 .inst 0x00000000 ; undefined
(2) Our second step would be to Review, build, and run the aarchie64 assembly language programs. Take a look at the code using
.text
.globl _start
_start:
mov x0, 1 /* file descriptor: 1 is stdout */
adr x1, msg /* message location (memory address) */
mov x2, len /* message length (bytes) */
mov x8, 64 /* write is syscall #64 */
svc 0 /* invoke syscall */
mov x0, 0 /* status -> 0 */
mov x8, 93 /* exit is syscall #93 */
svc 0 /* invoke syscall */
.data
msg: .ascii "Hello, world!\n"
len= . - msg
hello.o: file format elf64-littleaarch64
Disassembly of section .text:
0000000000000000 <_start>:
0: d2800020 mov x0, #0x1 // #1
4: 10000001 adr x1, 0 <_start>
8: d28001c2 mov x2, #0xe // #14
c: d2800808 mov x8, #0x40 // #64
10: d4000001 svc #0x0
14: d2800000 mov x0, #0x0 // #0
18: d2800ba8 mov x8, #0x5d // #93
1c: d4000001 svc #0x0
The goal for this lab may be with execute a basic c circle composed Previously, both x86_64 What's more aarch64 constructing agent dialect. We were given an void x86_64 piece which loops from 0 will 9, utilizing r15 Similarly as those list (loop control) counter.
Below code is EMPTY BLOCK
text
.globl _start
start = 0 /* starting value for the loop index; note that this constant, not a variable */
max = 10 /* loop exits when the index hits this number loop condition is i<max */
_start:
mov $start,%r15 /* loop index */
loop:
/* ... body of the loop ... do something useful here ... */
inc %r15 /* increment index */
cmp $max,%r15 /* see if we're done */
jne loop /* loop if we're not */
mov $0,%rdi /* exit status */
mov $60,%rax /* syscall sys_exit */
syscall
.global _start
start = 0
max = 10
_start:
mov $start,%r15
loop:
mov $len,%rdx
mov $msg,%rsi
mov $1,%rdi
mov $1,%rax
syscall
inc %r15
mov %r15,%r14
add $'0',%r14
mov $num,%r13
mov %r14b,(%r13)
cmp $max,%r15
jne loop
mov $0,%rdi
mov $60,%rax
syscall
.section .data
msg: .ascii "Loop: 0\n"
len = . - msg
num = msg + 6
x86_9_loop: file format elf64-x86-64
Disassembly of section .text:
00000000004000b0 <_start>:
4000b0: 49 c7 c7 00 00 00 00 mov $0x0,%r15
00000000004000b7 <loop>:
4000b7: 48 c7 c2 08 00 00 00 mov $0x8,%rdx
4000be: 48 c7 c6 00 01 60 00 mov $0x600100,%rsi
4000c5: 48 c7 c7 01 00 00 00 mov $0x1,%rdi
4000cc: 48 c7 c0 01 00 00 00 mov $0x1,%rax
4000d3: 0f 05 syscall
4000d5: 49 ff c7 inc %r15
4000d8: 4d 89 fe mov %r15,%r14
4000db: 49 83 c6 30 add $0x30,%r14
4000df: 49 c7 c5 06 01 60 00 mov $0x600106,%r13
4000e6: 45 88 75 00 mov %r14b,0x0(%r13)
4000ea: 49 83 ff 0a cmp $0xa,%r15
4000ee: 75 c7 jne 4000b7 <loop>
4000f0: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
4000f7: 48 c7 c0 3c 00 00 00 mov $0x3c,%rax
4000fe: 0f 05 syscall
In spite of aarch64 gathering need an alternate punctuation over its x86_64 partner they even now have their likenesses Previously, rationale.
.globl _start
start = 0 /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10 /* loop exits when the index hits this number (loop condition is i<max) */
ten = 10
_start:
mov x19, start
mov x22, ten
loop:
adr x1, msg
mov x2, len
mov x0, 1
mov x8, 64
svc 0
adr x23, msg
add x19, x19, 1
udiv x20, x19, x22
msub x21, x22, x20, x19
cmp x20, 0
beq skip
add x20, x20, '0'
strb w20, [x23,6]
skip:
add x21, x21, '0'
strb w21, [x23,7]
cmp x19, max
bne loop
mov x0, 0
mov x8, 93
svc 0
.data
msg: .ascii "Loop: 0\n"
len = . - msg
aarch_9_loop: file format elf64-littleaarch64
Disassembly of section .text:
00000000004000b0 <_start>:
4000b0: d2800013 mov x19, #0x0 // #0
4000b4: d2800156 mov x22, #0xa // #10
00000000004000b8 <loop>:
4000b8: 10080281 adr x1, 410108 <msg>
4000bc: d2800122 mov x2, #0x9 // #9
4000c0: d2800020 mov x0, #0x1 // #1
4000c4: d2800808 mov x8, #0x40 // #64
4000c8: d4000001 svc #0x0
4000cc: 100801f7 adr x23, 410108 <msg>
4000d0: 91000673 add x19, x19, #0x1
4000d4: 9ad60a74 udiv x20, x19, x22
4000d8: 9b14ced5 msub x21, x22, x20, x19
4000dc: f100029f cmp x20, #0x0
4000e0: 54000060 b.eq 4000ec <skip>
4000e4: 9100c294 add x20, x20, #0x30
4000e8: 39001af4 strb w20, [x23,#6]
00000000004000ec <skip>:
4000ec: 9100c2b5 add x21, x21, #0x30
4000f0: 39001ef5 strb w21, [x23,#7]
4000f4: f1002a7f cmp x19, #0xa
4000f8: 54fffe01 b.ne 4000b8 <loop>
4000fc: d2800000 mov x0, #0x0 // #0
400100: d2800ba8 mov x8, #0x5d // #93
400104: d4000001 svc #0x0
To expand on our past 9 cycle circle we were tasked with growing the check with 30. Resonances straightforward sufficient. Simply increment the max quatity.With circle should a double-digit number, you need will change over those circle record number by separating it Eventually perusing 10.
With print those circle list value, you will have on change over from a basic should digit character. On ASCII/ISO-8859-1/Unicode UTF-8, the digit characters would in the extent 48-57 (0x30-0x39). You will also requirement on amass the message to a chance to be printed for every accordance – you camwood would this by composing those digit under those message support When outputting it should stdout, which will be likely those best approach, alternately you could perform an arrangement of composes to those three parts of the message (‘Loop: ‘, number, ‘n’). You might need to allude of the `manpage` for ascii.
.globl _start
start = 0
max = 31
ten = 10
_start:
mov $start,%r15
mov $ten,%r14
loop:
mov $len,%rdx
mov $msg,%rsi
mov $1,%rdi
mov $1,%rax
syscall
inc %r15
mov $0,%rdx
mov %r15,%rax
div %r14
mov $num,%r12
mov %rax,%r13
cmp $0,%r13
je skip
add $'0',%r13
mov %r13b,(%r12)
skip:
inc %r12
mov %rdx,%r13
add $'0',%r13
mov %r13b,(%r12)
cmp $max,%r15
jne loop
mov $0,%rdi
mov $60,%rax
syscall
.section .data
msg: .ascii "Loop: 0\n"
len = . - msg
num = msg + 6
86_30_loop: file format elf64-x86-64
Disassembly of section .text:
00000000004000b0 <_start>:
4000b0: 49 c7 c7 00 00 00 00 mov $0x0,%r15
4000b7: 49 c7 c6 0a 00 00 00 mov $0xa,%r14
00000000004000be <loop>:
4000be: 48 c7 c2 09 00 00 00 mov $0x9,%rdx
4000c5: 48 c7 c6 28 01 60 00 mov $0x600128,%rsi
4000cc: 48 c7 c7 01 00 00 00 mov $0x1,%rdi
4000d3: 48 c7 c0 01 00 00 00 mov $0x1,%rax
4000da: 0f 05 syscall
4000dc: 49 ff c7 inc %r15
4000df: 48 c7 c2 00 00 00 00 mov $0x0,%rdx
4000e6: 4c 89 f8 mov %r15,%rax
4000e9: 49 f7 f6 div %r14
4000ec: 49 c7 c4 2e 01 60 00 mov $0x60012e,%r12
4000f3: 49 89 c5 mov %rax,%r13
4000f6: 49 83 fd 00 cmp $0x0,%r13
4000fa: 74 08 je 400104 <skip>
4000fc: 49 83 c5 30 add $0x30,%r13
400100: 45 88 2c 24 mov %r13b,(%r12)
0000000000400104 <skip>:
400104: 49 ff c4 inc %r12
400107: 49 89 d5 mov %rdx,%r13
40010a: 49 83 c5 30 add $0x30,%r13
40010e: 45 88 2c 24 mov %r13b,(%r12)
400112: 49 83 ff 1f cmp $0x1f,%r15
400116: 75 a6 jne 4000be <loop>
400118: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
40011f: 48 c7 c0 3c 00 00 00 mov $0x3c,%rax
400126: 0f 05 syscall
Reflection.
Personally, i discovered working for low level computing construct will a chance to be testing yet all the In the same run through thick, as compensating. Those grammar might make extremely befuddling and dully with worth of effort with. I revel in Taking in over the full purpose of a procedure. All that that happens between A->B. Picking up introduction should low-level gathering modifying need provided for me exactly knowledge under how mind boggling What's more multifaceted attempting with those fittings could make. I’ve also picked up an entire new appreciation for compilers What's more IDE’s.
X86 is a crew for backward-compatible direction book set architectures dependent upon those Intel cpu What's more its Intel variant. Likewise a fully 16-bit development about Intel's 8-bit-based microprocessor, for memory division Concerning illustration an answer to tending to more memory over a chance to be secured by a plain 16-bit deliver.
New direction book set, A64. Need 31 universally useful 64-bit registers. Need committed zero alternately stack pointer register contingent upon direction book. The system counter may be no more straightforwardly receptive Likewise a register. Guidelines are still 32 odds in length What's more basically the same Likewise A32. Need matched loads/stores. No predication for practically educational. The majority educational make 32-bit or 64-bit contentions.
Addresses expected will make 64-bit.
(1) Our first step is to build 3 C versions of the Hello world program for x86_64
# Portable C versions: # write() version
# syscall() wrapper version # printf() version
For example: Compiling code gcc -o hello hello.c
Running the code ./hello
Use the objdump -d
name of file command to dump (print) the object code (machine code)
In Write version we need to print the string using write. In syscall() wrapper version we need to print the string using syscall and in printf version using printf(). After building and running the code for all three version we will notice a bit difference in amount of code. Write version has basically 185 line and amount of code in main is slightly big than printf while printf has 183 lines and has least amount of code in main. Amount of code in main for syscall version is highest compared to both the versions while number of lines in syscall is 187Output should be similar to this in x86_64 architecture
183 -lines x86_64 in total
0000000000400594 <main>:
400594: a9bf7bfd stp x29, x30, [sp, #-16]!
400598: 910003fd mov x29, sp
40059c: 90000000 adrp x0, 400000 <_init-0x418>
4005a0: 9119c000 add x0, x0, #0x670
4005a4: 97ffffb7 bl 400480 <printf@plt>////printf()
4005a8: 52800000 mov w0, #0x0 // #0
4005ac: a8c17bfd ldp x29, x30, [sp], #16
4005b0: d65f03c0 ret
4005b4: 00000000 .inst 0x00000000 ; undefined
185-lines in total
0000000000400594 <main>:
400594: a9bf7bfd stp x29, x30, [sp, #-16]!
400598: 910003fd mov x29, sp
40059c: 90000000 adrp x0, 400000 <_init-0x418>
4005a0: 9119e000 add x0, x0, #0x678
4005a4: d28001a2 mov x2, #0xd // #13
4005a8: aa0003e1 mov x1, x0
4005ac: 52800020 mov w0, #0x1 // #1
4005b0: 97ffffb0 bl 400470 <write@plt>////Write()
4005b4: 52800000 mov w0, #0x0 // #0
4005b8: a8c17bfd ldp x29, x30, [sp], #16
4005bc: d65f03c0 ret
187-lines in total
0000000000400594 <main>:
400594: a9bf7bfd stp x29, x30, [sp, #-16]!
400598: 910003fd mov x29, sp
40059c: 90000000 adrp x0, 400000 <_init-0x418>
4005a0: 911a0000 add x0, x0, #0x680
4005a4: 528001a3 mov w3, #0xd // #13
4005a8: aa0003e2 mov x2, x0
4005ac: 52800021 mov w1, #0x1 // #1
4005b0: d2800800 mov x0, #0x40 // #64
4005b4: 97ffffb3 bl 400480 <syscall@plt>////syscall()
4005b8: 52800000 mov w0, #0x0 // #0
4005bc: a8c17bfd ldp x29, x30, [sp], #16
4005c0: d65f03c0 ret
4005c4: 00000000 .inst 0x00000000 ; undefined
(2) Our second step would be to Review, build, and run the aarchie64 assembly language programs. Take a look at the code using
objdump -d objectfile
and compare it to the source codeFor example: Compiling code gcc -o hello hello.c
Running the code ./hello
.text
.globl _start
_start:
mov x0, 1 /* file descriptor: 1 is stdout */
adr x1, msg /* message location (memory address) */
mov x2, len /* message length (bytes) */
mov x8, 64 /* write is syscall #64 */
svc 0 /* invoke syscall */
mov x0, 0 /* status -> 0 */
mov x8, 93 /* exit is syscall #93 */
svc 0 /* invoke syscall */
.data
msg: .ascii "Hello, world!\n"
len= . - msg
Use the objdump -d
name of file command to dump the object code
hello.o: file format elf64-littleaarch64
Disassembly of section .text:
0000000000000000 <_start>:
0: d2800020 mov x0, #0x1 // #1
4: 10000001 adr x1, 0 <_start>
8: d28001c2 mov x2, #0xe // #14
c: d2800808 mov x8, #0x40 // #64
10: d4000001 svc #0x0
14: d2800000 mov x0, #0x0 // #0
18: d2800ba8 mov x8, #0x5d // #93
1c: d4000001 svc #0x0
The goal for this lab may be with execute a basic c circle composed Previously, both x86_64 What's more aarch64 constructing agent dialect. We were given an void x86_64 piece which loops from 0 will 9, utilizing r15 Similarly as those list (loop control) counter.
Below code is EMPTY BLOCK
text
.globl _start
start = 0 /* starting value for the loop index; note that this constant, not a variable */
max = 10 /* loop exits when the index hits this number loop condition is i<max */
_start:
mov $start,%r15 /* loop index */
loop:
/* ... body of the loop ... do something useful here ... */
inc %r15 /* increment index */
cmp $max,%r15 /* see if we're done */
jne loop /* loop if we're not */
mov $0,%rdi /* exit status */
mov $60,%rax /* syscall sys_exit */
syscall
x_86_64 & aarch64 Loops
In order to dissemble a binary file, we can use the CL tool
objdump -d
. When we dissemble a binary, we can examine that specific machines assembler instruction set.x86_64 9 Iteration Loop Solution
.text.global _start
start = 0
max = 10
_start:
mov $start,%r15
loop:
mov $len,%rdx
mov $msg,%rsi
mov $1,%rdi
mov $1,%rax
syscall
inc %r15
mov %r15,%r14
add $'0',%r14
mov $num,%r13
mov %r14b,(%r13)
cmp $max,%r15
jne loop
mov $0,%rdi
mov $60,%rax
syscall
.section .data
msg: .ascii "Loop: 0\n"
len = . - msg
num = msg + 6
x86_9_loop: file format elf64-x86-64
Disassembly of section .text:
00000000004000b0 <_start>:
4000b0: 49 c7 c7 00 00 00 00 mov $0x0,%r15
00000000004000b7 <loop>:
4000b7: 48 c7 c2 08 00 00 00 mov $0x8,%rdx
4000be: 48 c7 c6 00 01 60 00 mov $0x600100,%rsi
4000c5: 48 c7 c7 01 00 00 00 mov $0x1,%rdi
4000cc: 48 c7 c0 01 00 00 00 mov $0x1,%rax
4000d3: 0f 05 syscall
4000d5: 49 ff c7 inc %r15
4000d8: 4d 89 fe mov %r15,%r14
4000db: 49 83 c6 30 add $0x30,%r14
4000df: 49 c7 c5 06 01 60 00 mov $0x600106,%r13
4000e6: 45 88 75 00 mov %r14b,0x0(%r13)
4000ea: 49 83 ff 0a cmp $0xa,%r15
4000ee: 75 c7 jne 4000b7 <loop>
4000f0: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
4000f7: 48 c7 c0 3c 00 00 00 mov $0x3c,%rax
4000fe: 0f 05 syscall
In spite of aarch64 gathering need an alternate punctuation over its x86_64 partner they even now have their likenesses Previously, rationale.
aarch64 9 Iteration Loop Solution
.text.globl _start
start = 0 /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10 /* loop exits when the index hits this number (loop condition is i<max) */
ten = 10
_start:
mov x19, start
mov x22, ten
loop:
adr x1, msg
mov x2, len
mov x0, 1
mov x8, 64
svc 0
adr x23, msg
add x19, x19, 1
udiv x20, x19, x22
msub x21, x22, x20, x19
cmp x20, 0
beq skip
add x20, x20, '0'
strb w20, [x23,6]
skip:
add x21, x21, '0'
strb w21, [x23,7]
cmp x19, max
bne loop
mov x0, 0
mov x8, 93
svc 0
.data
msg: .ascii "Loop: 0\n"
len = . - msg
aarch_9_loop: file format elf64-littleaarch64
Disassembly of section .text:
00000000004000b0 <_start>:
4000b0: d2800013 mov x19, #0x0 // #0
4000b4: d2800156 mov x22, #0xa // #10
00000000004000b8 <loop>:
4000b8: 10080281 adr x1, 410108 <msg>
4000bc: d2800122 mov x2, #0x9 // #9
4000c0: d2800020 mov x0, #0x1 // #1
4000c4: d2800808 mov x8, #0x40 // #64
4000c8: d4000001 svc #0x0
4000cc: 100801f7 adr x23, 410108 <msg>
4000d0: 91000673 add x19, x19, #0x1
4000d4: 9ad60a74 udiv x20, x19, x22
4000d8: 9b14ced5 msub x21, x22, x20, x19
4000dc: f100029f cmp x20, #0x0
4000e0: 54000060 b.eq 4000ec <skip>
4000e4: 9100c294 add x20, x20, #0x30
4000e8: 39001af4 strb w20, [x23,#6]
00000000004000ec <skip>:
4000ec: 9100c2b5 add x21, x21, #0x30
4000f0: 39001ef5 strb w21, [x23,#7]
4000f4: f1002a7f cmp x19, #0xa
4000f8: 54fffe01 b.ne 4000b8 <loop>
4000fc: d2800000 mov x0, #0x0 // #0
400100: d2800ba8 mov x8, #0x5d // #93
400104: d4000001 svc #0x0
To expand on our past 9 cycle circle we were tasked with growing the check with 30. Resonances straightforward sufficient. Simply increment the max quatity.With circle should a double-digit number, you need will change over those circle record number by separating it Eventually perusing 10.
With print those circle list value, you will have on change over from a basic should digit character. On ASCII/ISO-8859-1/Unicode UTF-8, the digit characters would in the extent 48-57 (0x30-0x39). You will also requirement on amass the message to a chance to be printed for every accordance – you camwood would this by composing those digit under those message support When outputting it should stdout, which will be likely those best approach, alternately you could perform an arrangement of composes to those three parts of the message (‘Loop: ‘, number, ‘n’). You might need to allude of the `manpage` for ascii.
x86_64 30 Iteration Loop Solution
.text.globl _start
start = 0
max = 31
ten = 10
_start:
mov $start,%r15
mov $ten,%r14
loop:
mov $len,%rdx
mov $msg,%rsi
mov $1,%rdi
mov $1,%rax
syscall
inc %r15
mov $0,%rdx
mov %r15,%rax
div %r14
mov $num,%r12
mov %rax,%r13
cmp $0,%r13
je skip
add $'0',%r13
mov %r13b,(%r12)
skip:
inc %r12
mov %rdx,%r13
add $'0',%r13
mov %r13b,(%r12)
cmp $max,%r15
jne loop
mov $0,%rdi
mov $60,%rax
syscall
.section .data
msg: .ascii "Loop: 0\n"
len = . - msg
num = msg + 6
86_30_loop: file format elf64-x86-64
Disassembly of section .text:
00000000004000b0 <_start>:
4000b0: 49 c7 c7 00 00 00 00 mov $0x0,%r15
4000b7: 49 c7 c6 0a 00 00 00 mov $0xa,%r14
00000000004000be <loop>:
4000be: 48 c7 c2 09 00 00 00 mov $0x9,%rdx
4000c5: 48 c7 c6 28 01 60 00 mov $0x600128,%rsi
4000cc: 48 c7 c7 01 00 00 00 mov $0x1,%rdi
4000d3: 48 c7 c0 01 00 00 00 mov $0x1,%rax
4000da: 0f 05 syscall
4000dc: 49 ff c7 inc %r15
4000df: 48 c7 c2 00 00 00 00 mov $0x0,%rdx
4000e6: 4c 89 f8 mov %r15,%rax
4000e9: 49 f7 f6 div %r14
4000ec: 49 c7 c4 2e 01 60 00 mov $0x60012e,%r12
4000f3: 49 89 c5 mov %rax,%r13
4000f6: 49 83 fd 00 cmp $0x0,%r13
4000fa: 74 08 je 400104 <skip>
4000fc: 49 83 c5 30 add $0x30,%r13
400100: 45 88 2c 24 mov %r13b,(%r12)
0000000000400104 <skip>:
400104: 49 ff c4 inc %r12
400107: 49 89 d5 mov %rdx,%r13
40010a: 49 83 c5 30 add $0x30,%r13
40010e: 45 88 2c 24 mov %r13b,(%r12)
400112: 49 83 ff 1f cmp $0x1f,%r15
400116: 75 a6 jne 4000be <loop>
400118: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
40011f: 48 c7 c0 3c 00 00 00 mov $0x3c,%rax
400126: 0f 05 syscall
Reflection.
Personally, i discovered working for low level computing construct will a chance to be testing yet all the In the same run through thick, as compensating. Those grammar might make extremely befuddling and dully with worth of effort with. I revel in Taking in over the full purpose of a procedure. All that that happens between A->B. Picking up introduction should low-level gathering modifying need provided for me exactly knowledge under how mind boggling What's more multifaceted attempting with those fittings could make. I’ve also picked up an entire new appreciation for compilers What's more IDE’s.
Subscribe to:
Posts (Atom)