Wednesday 21 February 2018

Algorithm Selection using Assembly Language

This blog is based on selection of two algorithms for adjusting the volume of PCM audio samples based on bench-marking of two possible approaches. Digital sound is typically represented, uncompressed, as signed 16-bit integer signal samples.

There is one stream of samples for the left and right stereo channels, at typical sample rates of 44.1 or 48 thousand samples per second, for a total of 88.2 or 96 thousand samples per second. Since there are 16 bits (2 bytes) per sample, the data rate is 88.2 * 1000 * 2 = 176,400 bytes/second (~172 KiB/sec) or 96 * 1000 * 2 = 192,000 bytes/second (~187.5 KiB/sec).
To change the volume of sound, each sample can be scaled by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).
On a mobile device, the amount of processing required to scale sound will affect battery life.

Our first step would be to start with ARMv8 AARch64 SPO600 Server. Here we have a file vol1.c and vol.h. vol.h has define 500000 Samples in it while vol1.c creates 500,000 random "sound samples" in an input array (the number of samples is set in the vol.h file). Scales those samples by the volume factor 0.75 and stores them in an output array.Sums the output array and prints the sum.

Our method 1 would be simple to test the script below simply by multiplying each sample by the floating point volume factor 0.75 to get the scaled sample value

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"

// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
        return (int16_t) (volume_factor * (float) sample);
}

int main() {

        // Allocate memory for large in and out arrays
        int16_t*        in;
        int16_t*        out;
        clock_t startTime, endTime;
        double totalTime;
        in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));

        int             x;
        int             ttl;

        // Seed the pseudo-random number generator
        srand(-1);

        // Fill the array with random data
        for (x = 0; x < SAMPLES; x++) {
                in[x] = (rand()%65536)-32768;
        }
        startTime = clock();

        // ######################################
        // This is the interesting part!
        // Scale the volume of all of the samples
        for (x = 0; x < SAMPLES; x++) {
                out[x] = scale_sample(in[x], 0.75);
        }
        // ######################################
        startTime = clock();

        // Sum up the data
        for (x = 0; x < SAMPLES; x++) {
                ttl = (ttl+out[x])%1000;
        }
        totalTime= (double)(endTime-startTime)/CLOCKS_PER_SEC;
        // Print the sum
        printf("Result: %d\n", ttl);
        printf("CPU time used to scale samples: %1f seconds\n",totalTime);
        return 0;

}

The clock function is used to determine the quantity of CPU processing time this is used to calculate the scaled sample values and shop them in any other array. i will first run all my tests on Aarchie. I collect my program with no optimization. I take advantage of the time command to determine how long it takes to run my software. The time command shows the real time, the person CPU time and the system CPU time. here is the end result:

By compiling the code of vol1.c using command g++ -o vol1 vol1.c 
running it with:-  time ./vol1  we get
Result: -86
CPU time used to scale samples: -0.032502 seconds

real    0m0.038s
user    0m0.037s
sys     0m0.000s

An alternate approach to this approach would be to pre-create a lookup table of all possible sample values multiplies by  the volume factor, and look up each sample in that table to get the scaled values

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"

// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
        return (int16_t) (volume_factor * (float) sample);
}

int main() {

        // Allocate memory for large in and out arrays
        int16_t*        in;
        int16_t*        out;
        int16_t lookupTable[65536];
        clock_t startTime, endTime;
        double totalTime;
        in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));

        int             x;
        int             ttl;

        for(x=0;x<65536;x++){
        lookupTable[x]=(x-32768)*.75;
        }
        // Seed the pseudo-random number generator
        srand(-1);

        // Fill the array with random data
        for (x = 0; x < SAMPLES; x++) {
                in[x] = (rand()%65536)-32768;
        }

        for(x=0;x<SAMPLES;x++){
        out[1]=lookupTable[in[1]+32768];
        }
        startTime = clock();

        // ######################################
        // This is the interesting part!
        // Scale the volume of all of the samples
        for (x = 0; x < SAMPLES; x++) {
                out[x] = scale_sample(in[x], 0.75);
        }
        // ######################################
        startTime = clock();

        // Sum up the data
        for (x = 0; x < SAMPLES; x++) {
                ttl = (ttl+out[x])%1000;
        }
        totalTime= (double)(endTime-startTime)/CLOCKS_PER_SEC;
        // Print the sum
        printf("Result: %d\n", ttl);
        printf("CPU time used to scale samples: %1f seconds\n",totalTime);
        return 0;

}
Compiling and running this code with the same command as previous one we get

Result: -86
CPU time used to scale samples: -0.041233 seconds

real    0m0.046s
user    0m0.046s
sys     0m0.000s

The result summing of the samples is the same but time to calculate those samples has pretty much increased by 0.008731 seconds compared to previous result

The last and final approach is to convert the volume factor 0.75 to a fix-point integer by multiplying by a binary number representing a fixed-point value "1". For example, you could use 0b100000000 (= 256 in decimal) to represent 1.00. Shift the result to the right the required number of bits after the multiplication (>>8 if you're using 256 as the multiplier).

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"

// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
        return (int16_t) (volume_factor * (float) sample);
}

int main() {

        // Allocate memory for large in and out arrays
        int16_t*        in;
        int16_t*        out;
        clock_t startTime, endTime;
        double totalTime;
        in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
        short volumeFactor = 0b100000000 * .75;
        int             x;
        int             ttl;

        // Seed the pseudo-random number generator
        srand(-1);

        // Fill the array with random data
        for (x = 0; x < SAMPLES; x++) {
                in[x] = (rand()%65536)-32768;
        }
        startTime = clock();

        // ######################################
        // This is the interesting part!
        // Scale the volume of all of the samples
        for (x = 0; x < SAMPLES; x++) {
                //out[x] = scale_sample(in[x], 0.75);
                out[x]=in[x]*volumeFactor>>8;
        }
        // ######################################
        startTime = clock();

        // Sum up the data
        for (x = 0; x < SAMPLES; x++) {
                ttl = (ttl+out[x])%1000;
        }
        totalTime= (double)(endTime-startTime)/CLOCKS_PER_SEC;
        // Print the sum
        printf("Result: %d\n", ttl);
        printf("CPU time used to scale samples: %1f seconds\n",totalTime);
        return 0;

}

Result: -386
CPU time used to scale samples: 4.167165 seconds

real    0m0.035s
user    0m0.035s
sys     0m0.000s

Here the result of scaled value is different from previous values because of conversion. Time has increased too.

By implementing -O3 optimization our result and timing quite varies each time.

1) Result: -126
CPU time used to scale samples: -0.024013 seconds

real    0m0.027s
user    0m0.027s
sys     0m0.000s

2) Result: -942
CPU time used to scale samples: -0.024040 seconds

real    0m0.027s
user    0m0.027s
sys     0m0.000s

3) Result: -506
CPU time used to scale samples: -0.023942 seconds

real    0m0.027s
user    0m0.027s
sys     0m0.000s

The entirety of all scaled tests does alter after utilizing -O3 choice. For the moment strategy, it took approximately  more time than the to begin with strategy to calculate the scaled test values and store them in another cluster. For the third strategy, it is approximately nearly 2 times speedier than the to begin with strategy and  nearly 3 times quicker than the moment strategy. The third strategy has the most brief genuine time and client time and the moment strategy has the longest genuine time and client time, which are the same comes about as some time recently. After optimization is empowered utilizing -O3 choice, the preparing time has been definitely diminished for all three strategies. When I compare to no optimization, the to begin with strategy is approximately 3.5 times speedier, the moment strategy is almost 2.4 times speedier and the third strategy is approximately 5.4 times quicker.

“/usr/bin/time -v” to find out how much memory is used to run my program. for all three methods using -O3 option during compilation:

Result: -126
CPU time used to scale samples: -0.023778 seconds
        Command being timed: "./lab5"
        User time (seconds): 0.01
        System time (seconds): 0.00
        Percent of CPU this job got: 96%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 3240
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 551
        Voluntary context switches: 1
        Involuntary context switches: 0
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Result: -942
CPU time used to scale samples: -0.023968 seconds
        Command being timed: "./lab5_2"
        User time (seconds): 0.02
        System time (seconds): 0.00
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 3364
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 583
        Voluntary context switches: 1
        Involuntary context switches: 0
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0


Result: -506
CPU time used to scale samples: -0.023344 seconds
        Command being timed: "./lab5_3"
        User time (seconds): 0.02
        System time (seconds): 0.00
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 3244
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 551
        Voluntary context switches: 1
        Involuntary context switches: 1
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0


Conclusion:
From the comes about over, we see that the third strategy creates the most limited CPU handling time and the moment strategy creates the longest CPU handling time in any case of whether or not optimization is empowered when compiling my program. We get the same comes about when we compare the three strategies utilizing the genuine time and client time from the time command. This implies that the approach to alter volume that comes about in the leading execution is the third strategy. The third strategy changes over the volume figure to a fix-point numbers by duplicating it by a parallel number speaking to a fixed-point , increases this result by each test esteem and shifts the result to the right by the redress number of bits to get the scaled test esteem.

No comments:

Post a Comment