Finally we move on to final step in which I am will be discussing about Is my optimization worth to get upstream? and if yes how should I try to get upstream? Firstly the optimizations I made are not too extreme as I am only changing compiler flags from -O0 to -O2 and -O3 in which -O2 gave me pretty bad performance while -O3 helped me improve some performance.
The performance speed up which I achieved from -O3 wasn't too impressing it just seems to give normal boost to the program by turning off few gateways but I assume as it produced more compiler code or there could be some other reason due to which it was lagging as I didn't get the result I expected. But still I am trying to figure if I could make any kind of changes in make files as I know that ./configure uses -O2 to compile its make files. As I was going through make files they were coded to perfection there was nothing I could change to make program run faster. I have mentioned before in of my blogs before that this program was improved last in 2014 since then there have been no updates or any kind of modification to it.
On further note I was going through all configure files if I could find something relevant to the flags. Luckily I ended up finding something valuable in configure.ac what I noticed was that it was having PTHREAD support this library was getting initialized to any of the in built flags which are used to make files like CFFLAGS or CXXFLAGS as you can see below code seems to be pretty complicated. I was wondering if there could be a way where could find which flags are suitable to get better performance in PTHREAD support but they might be hidden somewhere in the code.
# PTHREAD support
# With special nods to compiling under mingw
if test x"$mingw" = x"yes"; then
AC_MSG_NOTICE([Checking for pthreads under mingw])
AC_DEFINE([HAVE_STRUCT_TIMESPEC],1,[Required for mingw])
CFLAGS="$CFLAGS -mthreads "
CPPFLAGS="-DPTW32_STATIC_LIB $CPPFLAGS"
CXXFLAGS="$CXXFLAGS -mthreads "
AC_DEFINE(HAVE_PTHREAD,1,[Defined to POSIX threads for mingw])
else
m4_include([m4/ax_pthread.m4])
AX_PTHREAD([
echo Using settings from [AX_PTHREAD] macro
LIBS="$PTHREAD_LIBS $LIBS"
CFLAGS=" $PTHREAD_CFLAGS $CFLAGS"
CXXFLAGS="$PTHREAD_CFLAGS $CXXFLAGS "
CPPFLAGS="$PTHREAD_CFLAGS $CPPFLAGS "
CC="$PTHREAD_CC"
AC_DEFINE(HAVE_PTHREAD,1,[Defined to POSIX threads for mingw])
],[AC_MSG_NOTICE([pthreads not found by ax_pthread macro])])
fi
AC_CHECK_HEADERS([pthread.h])
AC_CHECK_LIB([pthreadGC2],[pthread_create])
# On mingw, be sure to use the static version and be sure we are using mthread option
# (which should be a no-op on later version of G++ anyway)
AC_CHECK_FUNCS([pthread_win32_process_attach_np pthread_win32_process_detach_np pthread_win32_thread_attach_np pthread_win32_thread_detach_np])
# end PTHREAD SUPPORT
Finally I came uo no solution to upstream my code nor I could contact open source community for their help as I have no valid proof changes what I have made changes is appropriate.
Software Portability And Optimization
Thursday 19 April 2018
Thursday 29 March 2018
SPO600 - Phase 2 - Compiling Optimizations And Benchmarking
This post is going to be pretty long as I will be going through following stages below
(10MB)
real 0m0.033s
user 0m0.028s
sys 0m0.007s
(105MB)
real 0m0.309s
user 0m0.284s
sys 0m0.037s
(1.0GB)
real 0m3.727s
user 0m3.556s
sys 0m0.458s
Before taking average
(10MB)
real 0m3.303s
user 0m2.875s
sys 0m0.742s
(105MB)
real 0m30.971s
user 0m28.934s
sys 0m3.921s
(1.0GB)
real 5m13.540s
user 4m55.446s
sys 0m46.945s
Test above are without optimization running on O0 by default. This command will help change all optimization flags
find -name Makefile | xargs sed -i 's/-O0/-O2/'
just to make sure your changes have applied just check anyone of the Makefiles and you would see O2 wherever O0 is there
-O2 flag
10 mb file 105 mb file 1 gb file
real: 0m0.041s real: 0m0.387s real: 0m4.48s
user: 0m0.011s user: 0m0.250s user: 0m3.73s
sys: 0m0.005s sys: 0m0.015s sys: 0m0.84s
-O3 flag
10 mb file 105 mb file 1 gb file
real: 0m0.035s real: 0m0.158s real: 0m2.071s
user: 0m0.010s user: 0m0.97s user: 0m1.756s
sys: 0m0.003s sys: 0m0.014s sys: 0m0.77s
By comparing the real time you can see the -O3 flag makes the program faster. Which is a pretty nice optimization. So I recommend they replace the -O2 flag with -O3.
Now I will be bench-marking and compiling optimizations in Aarch64
-O2 flag
10.5 mb file 105 mb file 1.5 gb file
real: 0m0.032s real: 0m0.355s real: 0m4.892s
user: 0m0.025s user: 0m0.333s user: 0m4.651s
sys: 0m0.001s sys: 0m0.028s sys: 0m0.500s
-O3 flag
10.5 mb file 105 mb file 1.5 gb file
real: 0m0.035s real: 0m0.333s real: 0m4.668s
user: 0m0.025s user: 0m0.325s user: 0m4.399s
sys: 0m0.001s sys: 0m0.027s sys: 0m0.326s
Finally done with optimization of md5deep on aarch64 with O2 and O3 it didn't gave expected results. So from the bench-marking and optimizations applied on Aarch64 and x86_64 we can say that x86_64 has perfectly given expected results by properly having compatibility with optimization flags like O2 and O3. I would surely recommend to use O3 flags in x86_64 architecture while executing md5deep program. But G Profiling really helped finding functions which were giving calls to other functions for example: multihash_update(unsigned char const*, unsigned int)
void hash_final_sha1(void * ctx, unsigned char *sum) { }but it seemed pretty hard to make any changes. But I guess changing optimization flags was an alternate option which worked overall so my phase 3 for the project will be based on up streaming my optimization bench-marking
- Setting up md5deep in x86_64 and running multiple tests of md5deep
- Bench-marking of md5deep x86_64 without optimization and with optimization
- Bench-marking of md5deep on AArch64 with optimization as I have already bench-marked md5deep on Aarch64 without optimization in SPO600 - Stage 1 blog
- Download md5deep package(tar.gz) from https://github.com/jessek/hashdeep/releases/tag/release-4.4
- Using scp or sftp transfer that tar.gz file to x86_64 architecture
- Unpack that package using tar -xvzf name of the file command
- Enter into unpacked folder and start typing following commands
- sh bootstrap.sh
- ./configure
- make install
- Generate multiple files with some random data in it for testing purposes by typing following command keep count same for the first time but then change file name and cout to 10 and 100 respectively this will create 3 files with size of 10mb,105mb and 1.0gb. dd if=/dev/urandom of=hello2.txt bs=1048576 count=1000
- time for i in {1..100}; do ./md5deep hello.txt; done by typing this command md5deep is gonna run your file 100 times and will give total time to execution and run the whole program but we have to take average by dividing by 100 of it so that we can get time of each run-time
(10MB)
real 0m0.033s
user 0m0.028s
sys 0m0.007s
(105MB)
real 0m0.309s
user 0m0.284s
sys 0m0.037s
(1.0GB)
real 0m3.727s
user 0m3.556s
sys 0m0.458s
Before taking average
(10MB)
real 0m3.303s
user 0m2.875s
sys 0m0.742s
(105MB)
real 0m30.971s
user 0m28.934s
sys 0m3.921s
(1.0GB)
real 5m13.540s
user 4m55.446s
sys 0m46.945s
Test above are without optimization running on O0 by default. This command will help change all optimization flags
find -name Makefile | xargs sed -i 's/-O0/-O2/'
just to make sure your changes have applied just check anyone of the Makefiles and you would see O2 wherever O0 is there
-O2 flag
10 mb file 105 mb file 1 gb file
real: 0m0.041s real: 0m0.387s real: 0m4.48s
user: 0m0.011s user: 0m0.250s user: 0m3.73s
sys: 0m0.005s sys: 0m0.015s sys: 0m0.84s
-O3 flag
10 mb file 105 mb file 1 gb file
real: 0m0.035s real: 0m0.158s real: 0m2.071s
user: 0m0.010s user: 0m0.97s user: 0m1.756s
sys: 0m0.003s sys: 0m0.014s sys: 0m0.77s
By comparing the real time you can see the -O3 flag makes the program faster. Which is a pretty nice optimization. So I recommend they replace the -O2 flag with -O3.
Now I will be bench-marking and compiling optimizations in Aarch64
-O2 flag
10.5 mb file 105 mb file 1.5 gb file
real: 0m0.032s real: 0m0.355s real: 0m4.892s
user: 0m0.025s user: 0m0.333s user: 0m4.651s
sys: 0m0.001s sys: 0m0.028s sys: 0m0.500s
-O3 flag
10.5 mb file 105 mb file 1.5 gb file
real: 0m0.035s real: 0m0.333s real: 0m4.668s
user: 0m0.025s user: 0m0.325s user: 0m4.399s
sys: 0m0.001s sys: 0m0.027s sys: 0m0.326s
Finally done with optimization of md5deep on aarch64 with O2 and O3 it didn't gave expected results. So from the bench-marking and optimizations applied on Aarch64 and x86_64 we can say that x86_64 has perfectly given expected results by properly having compatibility with optimization flags like O2 and O3. I would surely recommend to use O3 flags in x86_64 architecture while executing md5deep program. But G Profiling really helped finding functions which were giving calls to other functions for example: multihash_update(unsigned char const*, unsigned int)
void hash_final_sha1(void * ctx, unsigned char *sum) { }but it seemed pretty hard to make any changes. But I guess changing optimization flags was an alternate option which worked overall so my phase 3 for the project will be based on up streaming my optimization bench-marking
SPO 600 - Phase 2 - G Profiling
Phase 2 for the project will be pretty long so I decided to break down evenly in multiple stages. This stage is about G Profiling. What is Profiling? Profiling is he process of determining how a program is using the resources that it is consuming. Profiling produces a clear view of the call graph -- the hierarchy of function/procedure/method calls that takes place during the execution of the program.
Resources consumption that can be analyzed during profiling include:
Time (both clock time (total real time, user time, and the amount of time the kernel spent on behalf of the program)
Memory
Temporary storage
Energy (this is a relatively new area of profiling)
There are several profiling tools available. Open Source options include
gprof
perf
oprofile
SystemTap These tools provide different combinations of profiling capabilities, and may provide additional functions. But here we will be deeply exploring gprof like how we will be using gprof in Aarch64 system with md5deep? and how it could be helpful for bench-marking and optimization?
So our first step would be build the software to be profiled using -pg(profile generation) option to enable profiling during compilation or execution of our md5deep command.In order to do the previous step first we need to modify the makefile or other build instructions, but it can often be done using CFLAGS or CCOPTS variable. We have 2 ways of modifying the make file for enabling profiling by addding -pg option.
(1) Manually change each and every make file in the folder where it is located after extraction. There can be multiple make files
find . -name "Makefile" -> Enter this command in that folder to locate all the Makefiles. For me there were 6 Makefiles I had to modify each one of it which was stressful NOTE: Depending on what version of md5deep you have installed makefiles vary from version to version. These are my makefiles
./doc/Makefile
./man/Makefile
./src/Makefile
./tests/testfiles/Makefile
./tests/Makefile
./Makefile
At the end of each Makefile you will see something similar to this
world:
@echo meta-build system.
@echo Making both Linux and Windows distributions
make distclean
./configure CFLAGS="-Wall -W -g -ggdb -O0"
make dist
make windist
I have purposely highlighted the configure line change it to
./configure CFLAGS="-Wall -W -g -pg -ggdb -O2" I added -pg option and changed O2 from O0 for optimization purposes.
(2) Easiest step would be to modify all make files at one shot by entering following command
./configure CFLAGS="-pg -g -O2" CXXFLAGS="-pg -g -O2"
And after this command unluckily I bumped into an error saying configure: error: cannot guess build type; you must specify one it failed and ended unexpectedly. Luckily error wasn't that hard to solve. Basically I did some research on it, I found out that this error is related to demanding an update ./configure is not able to recognize aarch64 anymore cause I am trying to change the makefiles. It is kind of strange cause at first time when I ran ./configure for installation it didn't gave me an error. I think running ./configure must have recognized aarch64 version but by adding some parameters to it must have compromised aarch64 version as per my thinking.
NOTE: THE EXPLANATION ABOUT THE ./configure ERROR ABOVE IS JUST MY ASSUMPTION THERE COULD BE SOME OTHER REASON FOR THE ERROR TOO I RESEARCHED IT AND FOUND SOME RELEVANT REASONING ABOUT THE SAME SITUATION.
So I found a link to download some config files in error itself
ftp://ftp.gnu.org/pub/gnu/config/
While you are on this webpage click on README file there will be 2 links open both the links copy one by one
http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD
http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD
and make a new config file named config.guess and config.sub. As you can see at the end of url the name of the files are given. SCP or SFTP those files to your md5deep folder and run the same command and it worked this time. After executing that command run
make clean
make
When you look at the output of ./configure and makefile we can see there will be -pg -g -O2 added to g++. It should create file called gmon.out as well
gprof md5deep>123.txt
It will create a readable file where profiling info would be there by viewing my profiling info its confusing cause many functions are getting multiple calls still time taken for it is 0 seconds. Here at the bottom I have shown it
granularity: each sample hit covers 4 byte(s) no time propagated
index % time self children called name
0.00 0.00 2/505 MD5Final [4]
0.00 0.00 503/505 MD5Update [2]
[1] 0.0 0.00 0.00 505 MD5Transform [1]
-----------------------------------------------
0.00 0.00 4/4 hash_context_obj::multihash_update(unsigned char const*, unsigned int) [48]
[2] 0.0 0.00 0.00 4 MD5Update [2]
0.00 0.00 503/505
The file is pretty big and functions are getting calls from several other functions so I ended up with a function
multihash_update(unsigned char const*, unsigned int)
I think it is not possible to modify it anyhow cause its written perfectly still my next approach would be to try with different compiler flag it might get affected in terms of performance.
Resources consumption that can be analyzed during profiling include:
Time (both clock time (total real time, user time, and the amount of time the kernel spent on behalf of the program)
Memory
Temporary storage
Energy (this is a relatively new area of profiling)
There are several profiling tools available. Open Source options include
gprof
perf
oprofile
SystemTap These tools provide different combinations of profiling capabilities, and may provide additional functions. But here we will be deeply exploring gprof like how we will be using gprof in Aarch64 system with md5deep? and how it could be helpful for bench-marking and optimization?
So our first step would be build the software to be profiled using -pg(profile generation) option to enable profiling during compilation or execution of our md5deep command.In order to do the previous step first we need to modify the makefile or other build instructions, but it can often be done using CFLAGS or CCOPTS variable. We have 2 ways of modifying the make file for enabling profiling by addding -pg option.
(1) Manually change each and every make file in the folder where it is located after extraction. There can be multiple make files
find . -name "Makefile" -> Enter this command in that folder to locate all the Makefiles. For me there were 6 Makefiles I had to modify each one of it which was stressful NOTE: Depending on what version of md5deep you have installed makefiles vary from version to version. These are my makefiles
./doc/Makefile
./man/Makefile
./src/Makefile
./tests/testfiles/Makefile
./tests/Makefile
./Makefile
At the end of each Makefile you will see something similar to this
world:
@echo meta-build system.
@echo Making both Linux and Windows distributions
make distclean
./configure CFLAGS="-Wall -W -g -ggdb -O0"
make dist
make windist
I have purposely highlighted the configure line change it to
./configure CFLAGS="-Wall -W -g -pg -ggdb -O2" I added -pg option and changed O2 from O0 for optimization purposes.
(2) Easiest step would be to modify all make files at one shot by entering following command
./configure CFLAGS="-pg -g -O2" CXXFLAGS="-pg -g -O2"
And after this command unluckily I bumped into an error saying configure: error: cannot guess build type; you must specify one it failed and ended unexpectedly. Luckily error wasn't that hard to solve. Basically I did some research on it, I found out that this error is related to demanding an update ./configure is not able to recognize aarch64 anymore cause I am trying to change the makefiles. It is kind of strange cause at first time when I ran ./configure for installation it didn't gave me an error. I think running ./configure must have recognized aarch64 version but by adding some parameters to it must have compromised aarch64 version as per my thinking.
NOTE: THE EXPLANATION ABOUT THE ./configure ERROR ABOVE IS JUST MY ASSUMPTION THERE COULD BE SOME OTHER REASON FOR THE ERROR TOO I RESEARCHED IT AND FOUND SOME RELEVANT REASONING ABOUT THE SAME SITUATION.
So I found a link to download some config files in error itself
ftp://ftp.gnu.org/pub/gnu/config/
While you are on this webpage click on README file there will be 2 links open both the links copy one by one
http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD
http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD
and make a new config file named config.guess and config.sub. As you can see at the end of url the name of the files are given. SCP or SFTP those files to your md5deep folder and run the same command and it worked this time. After executing that command run
make clean
make
When you look at the output of ./configure and makefile we can see there will be -pg -g -O2 added to g++. It should create file called gmon.out as well
gprof md5deep>123.txt
It will create a readable file where profiling info would be there by viewing my profiling info its confusing cause many functions are getting multiple calls still time taken for it is 0 seconds. Here at the bottom I have shown it
granularity: each sample hit covers 4 byte(s) no time propagated
index % time self children called name
0.00 0.00 2/505 MD5Final [4]
0.00 0.00 503/505 MD5Update [2]
[1] 0.0 0.00 0.00 505 MD5Transform [1]
-----------------------------------------------
0.00 0.00 4/4 hash_context_obj::multihash_update(unsigned char const*, unsigned int) [48]
[2] 0.0 0.00 0.00 4 MD5Update [2]
0.00 0.00 503/505
The file is pretty big and functions are getting calls from several other functions so I ended up with a function
multihash_update(unsigned char const*, unsigned int)
I think it is not possible to modify it anyhow cause its written perfectly still my next approach would be to try with different compiler flag it might get affected in terms of performance.
Thursday 1 March 2018
SPO600 Project Stage 1
For this project I would be working on MD5DEEP software package used in the computer security, system administration and computer forensics communities communities to run large number of files through several cryptographic digests. Well basically this software uses hashing as their cryptographic method for encryption to keep file data secure and safe. Hashing is a method for reducing large inputs to a smaller fixed size output.
MD5DEEP consists of SHA variants like Tiger and WhirlPool. It has many more functionalities as decribed below
Recursive operation - Md5deep has the capacity to recursive analyze an whole registry tree. That is, figure the MD5 to each document in An registry and for each document in each subdirectory.
Comparison mode - Md5deep might accept a rundown about known hashes What's more think about them with An set from claiming enter files. The project camwood show Possibly the individuals information files that match those rundown of referred to hashes or the individuals that don't match. Hashes sets camwood make drawn from Encase, the national product reference Library, iLook Investigator, Hashkeeper, md5sum, BSD md5, What's more other non specific hash generating projects. Clients need aid welcome will include purpose will read different formats too!.
Time estimation - md5deep can produce a time estimate when it's processing very large files.
Piecewise hashing - Hash input files in arbitrary sized blocks
File type mode - md5deep can process only files of a certain type, such as regular files, block devices, etc.
credits:http://md5deep.sourceforge.net/
So my approach would be to explore MD5DEEP understand concepts, explain methods, implement and try to optimize its code for better performance. I downloaded MD5DEEP from https://github.com/jessek/hashdeep/releases just for information MD5DEEP is platform or architecture dependent so there might be chances you might come across platform or architecture dependent issues. Luckily I didn't come across such issue while downloading and installing it on Linux platform. The repository which has hashdeep source to download was last updated in 2014 which is pretty old and has hashdeep version 4.4 which turns out to be most latest and updated version indeed according to google sources.
After downloading it your local machine go to the directory where downloaded file is located there are few commands we need to fire in order to install this software.
-> Extracting the tar.gz
tar xvzf nameofthefile.tar.gz after extraction you will see folder of that extracted file go to that folder and type following command
->sh bootstrap.sh
->./configure
->make
->make install
After installation is done make a text file with some text in it in the same directory where you have installed it and test with the command md5deep nameof thefile.txt or hashdeep nameofthefile.txt in the same way you can test SHA variants and benchmark the result timing by simple adding time in front of md5deep for example:
time md5deep nameofthefile.txt //would result into
5444783fea966d71ed28da359a3cae9 /home/location/nameofthefile.txt
real 0m0.030s
user 0m0.000s
sys 0m0.016s
My next approach would be to export this attempt to aarch64 and x86_64 and try to benchmark it.
Now that I have fully setup md5deep on Aarch64 system the same way I did for my local system but there would be one more step to it. scp C:/path/directory/...tar.gz server@domain.com:/path/directory/
and the rest of the steps as same as before.-> Extracting the tar.gz tar xvzf nameofthefile.tar.gz after extraction you will see folder of that extracted file go to that folder and type following command
->sh bootstrap.sh
->./configure
->make
->make install DESTDIR=/home/path/directory install
Here we have added path to make install instead of installing in it same directory cause I ran into issues while installing it gave me an error saying PERMISSION DENIED, well you can try installing it in that particular directory you might succeed. 'make install' is complaining because you're trying to install into system directories that are only writable by the system administrator. This is actually a good thing, because it will prevent you from overwriting system files with your test files.
Our next step would be to create 3 files with different file sizes this could be done with this command dd if=/dev/urandom of=file.txt bs=1048576 count=10 will create a file of size count*bs with some random generated content in it. Just for information the content will not be readable. In above case will be 10 mb in the same way by changing count abd bs we could create remaining 2 files. Let me just introduce to the output after running or executing the above command
(10mb file)
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.125207 s, 83.7 MB/s
(105mb file)
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 13.4593 s, 7.8 MB/s
(1gb file)
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 12.4558 s, 84.2 MB/s
Checking number of words in the file
wc -l file.txt
41031 file.txt
Compiling and running all three files 100times with command time for i in {1..100};do md5deep file.txt;done. The time below is the total time of running the program 100times. So now we gotta take an average to find out the time when it will run single time.
BEFORE AVERAGE
(10 mb file)
real 0m4.260s
user 0m3.586s
sys 0m0.786s
(100mb file)
real 0m39.260s
user 0m34.493s
sys 0m5.738s
(1gb file)
real 6m28.003s
user 5m45.722s
sys 0m51.928s
AFTER AVERAGE
(10 mb file)
real 0m0.04260s
user 0m0.03586s
sys 0m0.00786s
(100mb file)
real 0m0.39260s
user 0m0.34493s
sys 0m0.05738s
(1gb file)
real 6m28.003s
user 5m45.722s
sys 0m51.928s
So my upcoming blog for project would be based on some comparison f md5deep with hashdeep algorithm, sha256. But the main purpose of phase 2 would be trying to implement altered build options running md5deep, make some changes in md5deep code to permit better optimization by the compiler if I could do it. I will make sure it doesn't affect Aarch64 systems white make such optimizations and changes. Well I found md5deep on github luckily and it has some files to look into md5.c and md5.h so I am still kinda understanding its coding pattern and will be starting to work on it real time soon.
MD5DEEP consists of SHA variants like Tiger and WhirlPool. It has many more functionalities as decribed below
Recursive operation - Md5deep has the capacity to recursive analyze an whole registry tree. That is, figure the MD5 to each document in An registry and for each document in each subdirectory.
Comparison mode - Md5deep might accept a rundown about known hashes What's more think about them with An set from claiming enter files. The project camwood show Possibly the individuals information files that match those rundown of referred to hashes or the individuals that don't match. Hashes sets camwood make drawn from Encase, the national product reference Library, iLook Investigator, Hashkeeper, md5sum, BSD md5, What's more other non specific hash generating projects. Clients need aid welcome will include purpose will read different formats too!.
Time estimation - md5deep can produce a time estimate when it's processing very large files.
Piecewise hashing - Hash input files in arbitrary sized blocks
File type mode - md5deep can process only files of a certain type, such as regular files, block devices, etc.
credits:http://md5deep.sourceforge.net/
So my approach would be to explore MD5DEEP understand concepts, explain methods, implement and try to optimize its code for better performance. I downloaded MD5DEEP from https://github.com/jessek/hashdeep/releases just for information MD5DEEP is platform or architecture dependent so there might be chances you might come across platform or architecture dependent issues. Luckily I didn't come across such issue while downloading and installing it on Linux platform. The repository which has hashdeep source to download was last updated in 2014 which is pretty old and has hashdeep version 4.4 which turns out to be most latest and updated version indeed according to google sources.
After downloading it your local machine go to the directory where downloaded file is located there are few commands we need to fire in order to install this software.
-> Extracting the tar.gz
tar xvzf nameofthefile.tar.gz after extraction you will see folder of that extracted file go to that folder and type following command
->sh bootstrap.sh
->./configure
->make
->make install
After installation is done make a text file with some text in it in the same directory where you have installed it and test with the command md5deep nameof thefile.txt or hashdeep nameofthefile.txt in the same way you can test SHA variants and benchmark the result timing by simple adding time in front of md5deep for example:
time md5deep nameofthefile.txt //would result into
5444783fea966d71ed28da359a3cae9 /home/location/nameofthefile.txt
real 0m0.030s
user 0m0.000s
sys 0m0.016s
My next approach would be to export this attempt to aarch64 and x86_64 and try to benchmark it.
Now that I have fully setup md5deep on Aarch64 system the same way I did for my local system but there would be one more step to it. scp C:/path/directory/...tar.gz server@domain.com:/path/directory/
and the rest of the steps as same as before.-> Extracting the tar.gz tar xvzf nameofthefile.tar.gz after extraction you will see folder of that extracted file go to that folder and type following command
->sh bootstrap.sh
->./configure
->make
->make install DESTDIR=/home/path/directory install
Here we have added path to make install instead of installing in it same directory cause I ran into issues while installing it gave me an error saying PERMISSION DENIED, well you can try installing it in that particular directory you might succeed. 'make install' is complaining because you're trying to install into system directories that are only writable by the system administrator. This is actually a good thing, because it will prevent you from overwriting system files with your test files.
Our next step would be to create 3 files with different file sizes this could be done with this command dd if=/dev/urandom of=file.txt bs=1048576 count=10 will create a file of size count*bs with some random generated content in it. Just for information the content will not be readable. In above case will be 10 mb in the same way by changing count abd bs we could create remaining 2 files. Let me just introduce to the output after running or executing the above command
(10mb file)
10+0 records in
10+0 records out
10485760 bytes (10 MB) copied, 0.125207 s, 83.7 MB/s
(105mb file)
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 13.4593 s, 7.8 MB/s
(1gb file)
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 12.4558 s, 84.2 MB/s
Checking number of words in the file
wc -l file.txt
41031 file.txt
Compiling and running all three files 100times with command time for i in {1..100};do md5deep file.txt;done. The time below is the total time of running the program 100times. So now we gotta take an average to find out the time when it will run single time.
BEFORE AVERAGE
(10 mb file)
real 0m4.260s
user 0m3.586s
sys 0m0.786s
(100mb file)
real 0m39.260s
user 0m34.493s
sys 0m5.738s
(1gb file)
real 6m28.003s
user 5m45.722s
sys 0m51.928s
AFTER AVERAGE
(10 mb file)
real 0m0.04260s
user 0m0.03586s
sys 0m0.00786s
(100mb file)
real 0m0.39260s
user 0m0.34493s
sys 0m0.05738s
(1gb file)
real 6m28.003s
user 5m45.722s
sys 0m51.928s
So my upcoming blog for project would be based on some comparison f md5deep with hashdeep algorithm, sha256. But the main purpose of phase 2 would be trying to implement altered build options running md5deep, make some changes in md5deep code to permit better optimization by the compiler if I could do it. I will make sure it doesn't affect Aarch64 systems white make such optimizations and changes. Well I found md5deep on github luckily and it has some files to look into md5.c and md5.h so I am still kinda understanding its coding pattern and will be starting to work on it real time soon.
Monday 26 February 2018
Inline Assembler
This blog will guide you through inline assembler in AArch64 but first let me introduce you to What inline assembler is? In workstation programming, a inline constructing agent is An characteristic about a portion compilers that permits low-level code composed in low level computing construct will be installed inside a program, "around code that generally need been aggregated from a higher-level dialect for example, such that c alternately ada. In simple words, inline assembler allows us to apply assembly language code into our high-level code (such as C) and have it work more efficiently, due to the nature of assembly language.
Firstly we will test the performance(build ,compile and run) of this program below
//this file is vol_simd.c
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"
int main() {
int16_t* in; // input array
int16_t* limit; // end of input array
int16_t* out; // output array
// these variables will be used in our assembler code, so we're going
// to hand-allocate which register they are placed in
// Q: what is an alternate approach?
register int16_t* in_cursor asm("r20"); // input cursor
register int16_t* out_cursor asm("r21"); // output cursor
register int16_t vol_int asm("r22"); // volume as int16_t
int x; // array interator
int ttl; // array total
in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
srand(-1);
printf("Generating sample data.\n");
for (x = 0; x < SAMPLES; x++) {
in[x] = (rand()%65536)-32768;
}
// --------------------------------------------------------------------
in_cursor = in;
out_cursor = out;
limit = in + SAMPLES ;
// set vol_int to fixed-point representation of 0.75
// Q: should we use 32767 or 32768 in next line? why?
vol_int = (int16_t) (0.75 * 32767.0);
printf("Scaling samples.\n");
// Q: what does it mean to "duplicate" values in the next line?
__asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h
while ( in_cursor < limit ) {
__asm__ (
"ldr q0, [%[in]],#16 \n\t"
// load eight samples into q0 (v0.8h)
// from in_cursor, and post-increment
// in_cursor by 16 bytes
"sqdmulh v0.8h, v0.8h, v1.8h \n\t"
// multiply each lane in v0 by v1*2
// saturate results
// store upper 16 bits of results into v0
"str q0, [%[out]],#16 \n\t"
// store eight samples to out_cursor
// post-increment out_cursor by 16 bytes
// Q: what happens if we remove the following
// two lines? Why?
: [in]"+r"(in_cursor)
: "0"(in_cursor),[out]"r"(out_cursor)
);
}
// --------------------------------------------------------------------
printf("Summing samples.\n");
for (x = 0; x < SAMPLES; x++) {
ttl=(ttl+out[x])%1000;
}
// Q: are the results usable? are they correct?
printf("Result: %d\n", ttl);
return 0;
}
#define SAMPLES 500000 in vol.h this is the sample size we are using currently to test above code
Compiling it with gcc vol_simd.c and running it with ./a.out we will get output
Generating sample data.
Scaling samples.
Summing samples.
Result: -894
real 0m0.033s
user 0m0.033s
sys 0m0.000s
Now changing sample size in vol.h from 500000 to 2500000 we will get
Generating sample data.
Scaling samples.
Summing samples.
Result: 249
real 0m0.145s
user 0m0.125s
sys 0m0.019s
Processing times without any banner appears to make code quicker to run than those of inline constructing agent code. Gathering for an -O3 banner produces generally comparable run-times for both projects. I find it because of the motivation behind a -O3 makes a run-time that is not exclusively "code-dependent" inasmuch as no flags will depend wholly on the caliber for code (and asm purpose coded) done it.
Q: what is an alternate approach?
An alternate approach would be not to assign registers for the variables in_cursor, out_cursor. By implementing this approach will result in the allocating registers on its own.
Q: should we use 32767 or 32768 in next line? why?
32767, for the reason it is the maximum allowed value of data type int16_t.
Q: what does it mean to "duplicate" values here?
copying what's in register 22 vol_int into vector v1.8h this is what duplicate values means
Q: what happens if we remove the following lines? Why?
Those accompanying lines are answerable for relegating comparing qualities under those code's variables for scaling purpose. Removing them results in
vol_simd.c: In function ‘int main()’:
vol_simd.c:69:5: error: undefined named operand ‘in’
);
^
vol_simd.c:69:5: error: undefined named operand ‘out.
Q: are the results usable? are they correct?
Yes they are correct as we change the scaling value our main output is changing too
Part 2
I have picked Busy Box for this particular section
A bit of background for BusyBox from the official website:
"BusyBox combines tiny versions of many common UNIX utilities into a single small executable. It provides replacements for most of the utilities you usually find in GNU fileutils, shellutils, etc. The utilities in BusyBox generally have fewer options than their full-featured GNU cousins; however, the options that are included provide the expected functionality and behave very much like their GNU counterparts. BusyBox provides a fairly complete environment for any small or embedded system."
How much assembley-language code is present?
To see I used egrep command and found some of the assembly language code in the package. Code I found was based on networking and include directory.
Which platform it is used on?
Supported by gcc but cannot support arm , x86_64
Why it is there (what it does)?
The functions can be found in "networking/tls_pstm_montgomery_reduce.c". The file holds a handful from claiming inline constructing agent works whose motivation may be on improve operations inside different platforms/architectures.
What happens on other platforms?
the directory networking checks if the software runs properly on all platforms using its own script based on inline assembler code
Your opinion of the value of the assembler code VS the loss of portability/increase in complexity of the code.
Similarly as a wide margin Similarly on my observation see, constructing agent code might a chance to be greatly effective for ventures Anyway could a chance to be exact drawn out Furthermore emptying. Those many-sided nature for constructing agent code and need to conform it for every each construction modeling make it impeding to huge scale ventures Furthermore make standard assignments for example, such that debugging alternately support a chance to be exact modest What's more taxis. However, the quality about constructing agent code could doubtlessly make necessary for ventures for example, "BusyBox" that need aid striving to be all-enveloping operations faster, additional proficient and general preferred.
Firstly we will test the performance(build ,compile and run) of this program below
//this file is vol_simd.c
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"
int main() {
int16_t* in; // input array
int16_t* limit; // end of input array
int16_t* out; // output array
// these variables will be used in our assembler code, so we're going
// to hand-allocate which register they are placed in
// Q: what is an alternate approach?
register int16_t* in_cursor asm("r20"); // input cursor
register int16_t* out_cursor asm("r21"); // output cursor
register int16_t vol_int asm("r22"); // volume as int16_t
int x; // array interator
int ttl; // array total
in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
srand(-1);
printf("Generating sample data.\n");
for (x = 0; x < SAMPLES; x++) {
in[x] = (rand()%65536)-32768;
}
// --------------------------------------------------------------------
in_cursor = in;
out_cursor = out;
limit = in + SAMPLES ;
// set vol_int to fixed-point representation of 0.75
// Q: should we use 32767 or 32768 in next line? why?
vol_int = (int16_t) (0.75 * 32767.0);
printf("Scaling samples.\n");
// Q: what does it mean to "duplicate" values in the next line?
__asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h
while ( in_cursor < limit ) {
__asm__ (
"ldr q0, [%[in]],#16 \n\t"
// load eight samples into q0 (v0.8h)
// from in_cursor, and post-increment
// in_cursor by 16 bytes
"sqdmulh v0.8h, v0.8h, v1.8h \n\t"
// multiply each lane in v0 by v1*2
// saturate results
// store upper 16 bits of results into v0
"str q0, [%[out]],#16 \n\t"
// store eight samples to out_cursor
// post-increment out_cursor by 16 bytes
// Q: what happens if we remove the following
// two lines? Why?
: [in]"+r"(in_cursor)
: "0"(in_cursor),[out]"r"(out_cursor)
);
}
// --------------------------------------------------------------------
printf("Summing samples.\n");
for (x = 0; x < SAMPLES; x++) {
ttl=(ttl+out[x])%1000;
}
// Q: are the results usable? are they correct?
printf("Result: %d\n", ttl);
return 0;
}
#define SAMPLES 500000 in vol.h this is the sample size we are using currently to test above code
Compiling it with gcc vol_simd.c and running it with ./a.out we will get output
Generating sample data.
Scaling samples.
Summing samples.
Result: -894
real 0m0.033s
user 0m0.033s
sys 0m0.000s
Now changing sample size in vol.h from 500000 to 2500000 we will get
Generating sample data.
Scaling samples.
Summing samples.
Result: 249
real 0m0.145s
user 0m0.125s
sys 0m0.019s
Processing times without any banner appears to make code quicker to run than those of inline constructing agent code. Gathering for an -O3 banner produces generally comparable run-times for both projects. I find it because of the motivation behind a -O3 makes a run-time that is not exclusively "code-dependent" inasmuch as no flags will depend wholly on the caliber for code (and asm purpose coded) done it.
Q: what is an alternate approach?
An alternate approach would be not to assign registers for the variables in_cursor, out_cursor. By implementing this approach will result in the allocating registers on its own.
Q: should we use 32767 or 32768 in next line? why?
32767, for the reason it is the maximum allowed value of data type int16_t.
Q: what does it mean to "duplicate" values here?
copying what's in register 22 vol_int into vector v1.8h this is what duplicate values means
Q: what happens if we remove the following lines? Why?
Those accompanying lines are answerable for relegating comparing qualities under those code's variables for scaling purpose. Removing them results in
vol_simd.c: In function ‘int main()’:
vol_simd.c:69:5: error: undefined named operand ‘in’
);
^
vol_simd.c:69:5: error: undefined named operand ‘out.
Q: are the results usable? are they correct?
Yes they are correct as we change the scaling value our main output is changing too
Part 2
I have picked Busy Box for this particular section
A bit of background for BusyBox from the official website:
"BusyBox combines tiny versions of many common UNIX utilities into a single small executable. It provides replacements for most of the utilities you usually find in GNU fileutils, shellutils, etc. The utilities in BusyBox generally have fewer options than their full-featured GNU cousins; however, the options that are included provide the expected functionality and behave very much like their GNU counterparts. BusyBox provides a fairly complete environment for any small or embedded system."
How much assembley-language code is present?
To see I used egrep command and found some of the assembly language code in the package. Code I found was based on networking and include directory.
Which platform it is used on?
Supported by gcc but cannot support arm , x86_64
Why it is there (what it does)?
The functions can be found in "networking/tls_pstm_montgomery_reduce.c". The file holds a handful from claiming inline constructing agent works whose motivation may be on improve operations inside different platforms/architectures.
What happens on other platforms?
the directory networking checks if the software runs properly on all platforms using its own script based on inline assembler code
Your opinion of the value of the assembler code VS the loss of portability/increase in complexity of the code.
Similarly as a wide margin Similarly on my observation see, constructing agent code might a chance to be greatly effective for ventures Anyway could a chance to be exact drawn out Furthermore emptying. Those many-sided nature for constructing agent code and need to conform it for every each construction modeling make it impeding to huge scale ventures Furthermore make standard assignments for example, such that debugging alternately support a chance to be exact modest What's more taxis. However, the quality about constructing agent code could doubtlessly make necessary for ventures for example, "BusyBox" that need aid striving to be all-enveloping operations faster, additional proficient and general preferred.
Wednesday 21 February 2018
Algorithm Selection using Assembly Language
This blog is based on selection of two algorithms for adjusting the volume of PCM audio samples based on bench-marking of two possible approaches. Digital sound is typically represented, uncompressed, as signed 16-bit integer signal samples.
There is one stream of samples for the left and right stereo channels, at typical sample rates of 44.1 or 48 thousand samples per second, for a total of 88.2 or 96 thousand samples per second. Since there are 16 bits (2 bytes) per sample, the data rate is 88.2 * 1000 * 2 = 176,400 bytes/second (~172 KiB/sec) or 96 * 1000 * 2 = 192,000 bytes/second (~187.5 KiB/sec).
To change the volume of sound, each sample can be scaled by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).
On a mobile device, the amount of processing required to scale sound will affect battery life.
Our first step would be to start with ARMv8 AARch64 SPO600 Server. Here we have a file vol1.c and vol.h. vol.h has define 500000 Samples in it while vol1.c creates 500,000 random "sound samples" in an input array (the number of samples is set in the vol.h file). Scales those samples by the volume factor 0.75 and stores them in an output array.Sums the output array and prints the sum.
Our method 1 would be simple to test the script below simply by multiplying each sample by the floating point volume factor 0.75 to get the scaled sample value
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"
// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
return (int16_t) (volume_factor * (float) sample);
}
int main() {
// Allocate memory for large in and out arrays
int16_t* in;
int16_t* out;
clock_t startTime, endTime;
double totalTime;
in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
int x;
int ttl;
// Seed the pseudo-random number generator
srand(-1);
// Fill the array with random data
for (x = 0; x < SAMPLES; x++) {
in[x] = (rand()%65536)-32768;
}
startTime = clock();
// ######################################
// This is the interesting part!
// Scale the volume of all of the samples
for (x = 0; x < SAMPLES; x++) {
out[x] = scale_sample(in[x], 0.75);
}
// ######################################
startTime = clock();
// Sum up the data
for (x = 0; x < SAMPLES; x++) {
ttl = (ttl+out[x])%1000;
}
totalTime= (double)(endTime-startTime)/CLOCKS_PER_SEC;
// Print the sum
printf("Result: %d\n", ttl);
printf("CPU time used to scale samples: %1f seconds\n",totalTime);
return 0;
}
The clock function is used to determine the quantity of CPU processing time this is used to calculate the scaled sample values and shop them in any other array. i will first run all my tests on Aarchie. I collect my program with no optimization. I take advantage of the time command to determine how long it takes to run my software. The time command shows the real time, the person CPU time and the system CPU time. here is the end result:
By compiling the code of vol1.c using command g++ -o vol1 vol1.c
running it with:- time ./vol1 we get
Result: -86
CPU time used to scale samples: -0.032502 seconds
real 0m0.038s
user 0m0.037s
sys 0m0.000s
An alternate approach to this approach would be to pre-create a lookup table of all possible sample values multiplies by the volume factor, and look up each sample in that table to get the scaled values
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"
// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
return (int16_t) (volume_factor * (float) sample);
}
int main() {
// Allocate memory for large in and out arrays
int16_t* in;
int16_t* out;
int16_t lookupTable[65536];
clock_t startTime, endTime;
double totalTime;
in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
int x;
int ttl;
for(x=0;x<65536;x++){
lookupTable[x]=(x-32768)*.75;
}
// Seed the pseudo-random number generator
srand(-1);
// Fill the array with random data
for (x = 0; x < SAMPLES; x++) {
in[x] = (rand()%65536)-32768;
}
for(x=0;x<SAMPLES;x++){
out[1]=lookupTable[in[1]+32768];
}
startTime = clock();
// ######################################
// This is the interesting part!
// Scale the volume of all of the samples
for (x = 0; x < SAMPLES; x++) {
out[x] = scale_sample(in[x], 0.75);
}
// ######################################
startTime = clock();
// Sum up the data
for (x = 0; x < SAMPLES; x++) {
ttl = (ttl+out[x])%1000;
}
totalTime= (double)(endTime-startTime)/CLOCKS_PER_SEC;
// Print the sum
printf("Result: %d\n", ttl);
printf("CPU time used to scale samples: %1f seconds\n",totalTime);
return 0;
}
Compiling and running this code with the same command as previous one we get
Result: -86
CPU time used to scale samples: -0.041233 seconds
real 0m0.046s
user 0m0.046s
sys 0m0.000s
The result summing of the samples is the same but time to calculate those samples has pretty much increased by 0.008731 seconds compared to previous result
The last and final approach is to convert the volume factor 0.75 to a fix-point integer by multiplying by a binary number representing a fixed-point value "1". For example, you could use 0b100000000 (= 256 in decimal) to represent 1.00. Shift the result to the right the required number of bits after the multiplication (>>8 if you're using 256 as the multiplier).
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"
// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
return (int16_t) (volume_factor * (float) sample);
}
int main() {
// Allocate memory for large in and out arrays
int16_t* in;
int16_t* out;
clock_t startTime, endTime;
double totalTime;
in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
short volumeFactor = 0b100000000 * .75;
int x;
int ttl;
// Seed the pseudo-random number generator
srand(-1);
// Fill the array with random data
for (x = 0; x < SAMPLES; x++) {
in[x] = (rand()%65536)-32768;
}
startTime = clock();
// ######################################
// This is the interesting part!
// Scale the volume of all of the samples
for (x = 0; x < SAMPLES; x++) {
//out[x] = scale_sample(in[x], 0.75);
out[x]=in[x]*volumeFactor>>8;
}
// ######################################
startTime = clock();
// Sum up the data
for (x = 0; x < SAMPLES; x++) {
ttl = (ttl+out[x])%1000;
}
totalTime= (double)(endTime-startTime)/CLOCKS_PER_SEC;
// Print the sum
printf("Result: %d\n", ttl);
printf("CPU time used to scale samples: %1f seconds\n",totalTime);
return 0;
}
Result: -386
CPU time used to scale samples: 4.167165 seconds
real 0m0.035s
user 0m0.035s
sys 0m0.000s
Here the result of scaled value is different from previous values because of conversion. Time has increased too.
By implementing -O3 optimization our result and timing quite varies each time.
1) Result: -126
CPU time used to scale samples: -0.024013 seconds
real 0m0.027s
user 0m0.027s
sys 0m0.000s
2) Result: -942
CPU time used to scale samples: -0.024040 seconds
real 0m0.027s
user 0m0.027s
sys 0m0.000s
3) Result: -506
CPU time used to scale samples: -0.023942 seconds
real 0m0.027s
user 0m0.027s
sys 0m0.000s
The entirety of all scaled tests does alter after utilizing -O3 choice. For the moment strategy, it took approximately more time than the to begin with strategy to calculate the scaled test values and store them in another cluster. For the third strategy, it is approximately nearly 2 times speedier than the to begin with strategy and nearly 3 times quicker than the moment strategy. The third strategy has the most brief genuine time and client time and the moment strategy has the longest genuine time and client time, which are the same comes about as some time recently. After optimization is empowered utilizing -O3 choice, the preparing time has been definitely diminished for all three strategies. When I compare to no optimization, the to begin with strategy is approximately 3.5 times speedier, the moment strategy is almost 2.4 times speedier and the third strategy is approximately 5.4 times quicker.
“/usr/bin/time -v” to find out how much memory is used to run my program. for all three methods using -O3 option during compilation:
Result: -126
CPU time used to scale samples: -0.023778 seconds
Command being timed: "./lab5"
User time (seconds): 0.01
System time (seconds): 0.00
Percent of CPU this job got: 96%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3240
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 551
Voluntary context switches: 1
Involuntary context switches: 0
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Result: -942
CPU time used to scale samples: -0.023968 seconds
Command being timed: "./lab5_2"
User time (seconds): 0.02
System time (seconds): 0.00
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3364
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 583
Voluntary context switches: 1
Involuntary context switches: 0
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Result: -506
CPU time used to scale samples: -0.023344 seconds
Command being timed: "./lab5_3"
User time (seconds): 0.02
System time (seconds): 0.00
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3244
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 551
Voluntary context switches: 1
Involuntary context switches: 1
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Conclusion:
From the comes about over, we see that the third strategy creates the most limited CPU handling time and the moment strategy creates the longest CPU handling time in any case of whether or not optimization is empowered when compiling my program. We get the same comes about when we compare the three strategies utilizing the genuine time and client time from the time command. This implies that the approach to alter volume that comes about in the leading execution is the third strategy. The third strategy changes over the volume figure to a fix-point numbers by duplicating it by a parallel number speaking to a fixed-point , increases this result by each test esteem and shifts the result to the right by the redress number of bits to get the scaled test esteem.
There is one stream of samples for the left and right stereo channels, at typical sample rates of 44.1 or 48 thousand samples per second, for a total of 88.2 or 96 thousand samples per second. Since there are 16 bits (2 bytes) per sample, the data rate is 88.2 * 1000 * 2 = 176,400 bytes/second (~172 KiB/sec) or 96 * 1000 * 2 = 192,000 bytes/second (~187.5 KiB/sec).
To change the volume of sound, each sample can be scaled by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).
On a mobile device, the amount of processing required to scale sound will affect battery life.
Our first step would be to start with ARMv8 AARch64 SPO600 Server. Here we have a file vol1.c and vol.h. vol.h has define 500000 Samples in it while vol1.c creates 500,000 random "sound samples" in an input array (the number of samples is set in the vol.h file). Scales those samples by the volume factor 0.75 and stores them in an output array.Sums the output array and prints the sum.
Our method 1 would be simple to test the script below simply by multiplying each sample by the floating point volume factor 0.75 to get the scaled sample value
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"
// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
return (int16_t) (volume_factor * (float) sample);
}
int main() {
// Allocate memory for large in and out arrays
int16_t* in;
int16_t* out;
clock_t startTime, endTime;
double totalTime;
in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
int x;
int ttl;
// Seed the pseudo-random number generator
srand(-1);
// Fill the array with random data
for (x = 0; x < SAMPLES; x++) {
in[x] = (rand()%65536)-32768;
}
startTime = clock();
// ######################################
// This is the interesting part!
// Scale the volume of all of the samples
for (x = 0; x < SAMPLES; x++) {
out[x] = scale_sample(in[x], 0.75);
}
// ######################################
startTime = clock();
// Sum up the data
for (x = 0; x < SAMPLES; x++) {
ttl = (ttl+out[x])%1000;
}
totalTime= (double)(endTime-startTime)/CLOCKS_PER_SEC;
// Print the sum
printf("Result: %d\n", ttl);
printf("CPU time used to scale samples: %1f seconds\n",totalTime);
return 0;
}
The clock function is used to determine the quantity of CPU processing time this is used to calculate the scaled sample values and shop them in any other array. i will first run all my tests on Aarchie. I collect my program with no optimization. I take advantage of the time command to determine how long it takes to run my software. The time command shows the real time, the person CPU time and the system CPU time. here is the end result:
By compiling the code of vol1.c using command g++ -o vol1 vol1.c
running it with:- time ./vol1 we get
Result: -86
CPU time used to scale samples: -0.032502 seconds
real 0m0.038s
user 0m0.037s
sys 0m0.000s
An alternate approach to this approach would be to pre-create a lookup table of all possible sample values multiplies by the volume factor, and look up each sample in that table to get the scaled values
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"
// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
return (int16_t) (volume_factor * (float) sample);
}
int main() {
// Allocate memory for large in and out arrays
int16_t* in;
int16_t* out;
int16_t lookupTable[65536];
clock_t startTime, endTime;
double totalTime;
in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
int x;
int ttl;
for(x=0;x<65536;x++){
lookupTable[x]=(x-32768)*.75;
}
// Seed the pseudo-random number generator
srand(-1);
// Fill the array with random data
for (x = 0; x < SAMPLES; x++) {
in[x] = (rand()%65536)-32768;
}
for(x=0;x<SAMPLES;x++){
out[1]=lookupTable[in[1]+32768];
}
startTime = clock();
// ######################################
// This is the interesting part!
// Scale the volume of all of the samples
for (x = 0; x < SAMPLES; x++) {
out[x] = scale_sample(in[x], 0.75);
}
// ######################################
startTime = clock();
// Sum up the data
for (x = 0; x < SAMPLES; x++) {
ttl = (ttl+out[x])%1000;
}
totalTime= (double)(endTime-startTime)/CLOCKS_PER_SEC;
// Print the sum
printf("Result: %d\n", ttl);
printf("CPU time used to scale samples: %1f seconds\n",totalTime);
return 0;
}
Compiling and running this code with the same command as previous one we get
Result: -86
CPU time used to scale samples: -0.041233 seconds
real 0m0.046s
user 0m0.046s
sys 0m0.000s
The result summing of the samples is the same but time to calculate those samples has pretty much increased by 0.008731 seconds compared to previous result
The last and final approach is to convert the volume factor 0.75 to a fix-point integer by multiplying by a binary number representing a fixed-point value "1". For example, you could use 0b100000000 (= 256 in decimal) to represent 1.00. Shift the result to the right the required number of bits after the multiplication (>>8 if you're using 256 as the multiplier).
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include "vol.h"
// Function to scale a sound sample using a volume_factor
// in the range of 0.00 to 1.00.
static inline int16_t scale_sample(int16_t sample, float volume_factor) {
return (int16_t) (volume_factor * (float) sample);
}
int main() {
// Allocate memory for large in and out arrays
int16_t* in;
int16_t* out;
clock_t startTime, endTime;
double totalTime;
in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
short volumeFactor = 0b100000000 * .75;
int x;
int ttl;
// Seed the pseudo-random number generator
srand(-1);
// Fill the array with random data
for (x = 0; x < SAMPLES; x++) {
in[x] = (rand()%65536)-32768;
}
startTime = clock();
// ######################################
// This is the interesting part!
// Scale the volume of all of the samples
for (x = 0; x < SAMPLES; x++) {
//out[x] = scale_sample(in[x], 0.75);
out[x]=in[x]*volumeFactor>>8;
}
// ######################################
startTime = clock();
// Sum up the data
for (x = 0; x < SAMPLES; x++) {
ttl = (ttl+out[x])%1000;
}
totalTime= (double)(endTime-startTime)/CLOCKS_PER_SEC;
// Print the sum
printf("Result: %d\n", ttl);
printf("CPU time used to scale samples: %1f seconds\n",totalTime);
return 0;
}
Result: -386
CPU time used to scale samples: 4.167165 seconds
real 0m0.035s
user 0m0.035s
sys 0m0.000s
Here the result of scaled value is different from previous values because of conversion. Time has increased too.
By implementing -O3 optimization our result and timing quite varies each time.
1) Result: -126
CPU time used to scale samples: -0.024013 seconds
real 0m0.027s
user 0m0.027s
sys 0m0.000s
2) Result: -942
CPU time used to scale samples: -0.024040 seconds
real 0m0.027s
user 0m0.027s
sys 0m0.000s
3) Result: -506
CPU time used to scale samples: -0.023942 seconds
real 0m0.027s
user 0m0.027s
sys 0m0.000s
The entirety of all scaled tests does alter after utilizing -O3 choice. For the moment strategy, it took approximately more time than the to begin with strategy to calculate the scaled test values and store them in another cluster. For the third strategy, it is approximately nearly 2 times speedier than the to begin with strategy and nearly 3 times quicker than the moment strategy. The third strategy has the most brief genuine time and client time and the moment strategy has the longest genuine time and client time, which are the same comes about as some time recently. After optimization is empowered utilizing -O3 choice, the preparing time has been definitely diminished for all three strategies. When I compare to no optimization, the to begin with strategy is approximately 3.5 times speedier, the moment strategy is almost 2.4 times speedier and the third strategy is approximately 5.4 times quicker.
“/usr/bin/time -v” to find out how much memory is used to run my program. for all three methods using -O3 option during compilation:
Result: -126
CPU time used to scale samples: -0.023778 seconds
Command being timed: "./lab5"
User time (seconds): 0.01
System time (seconds): 0.00
Percent of CPU this job got: 96%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3240
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 551
Voluntary context switches: 1
Involuntary context switches: 0
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Result: -942
CPU time used to scale samples: -0.023968 seconds
Command being timed: "./lab5_2"
User time (seconds): 0.02
System time (seconds): 0.00
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3364
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 583
Voluntary context switches: 1
Involuntary context switches: 0
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Result: -506
CPU time used to scale samples: -0.023344 seconds
Command being timed: "./lab5_3"
User time (seconds): 0.02
System time (seconds): 0.00
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.02
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3244
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 551
Voluntary context switches: 1
Involuntary context switches: 1
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Conclusion:
From the comes about over, we see that the third strategy creates the most limited CPU handling time and the moment strategy creates the longest CPU handling time in any case of whether or not optimization is empowered when compiling my program. We get the same comes about when we compare the three strategies utilizing the genuine time and client time from the time command. This implies that the approach to alter volume that comes about in the leading execution is the third strategy. The third strategy changes over the volume figure to a fix-point numbers by duplicating it by a parallel number speaking to a fixed-point , increases this result by each test esteem and shifts the result to the right by the redress number of bits to get the scaled test esteem.
Friday 16 February 2018
Exploring single instruction/multiple data (SIMD) vectorization, and the auto-vectorization capabilities of the GCC compiler
What is SIMD VEctorization
A vector may be a direction book operand holding An situated of information components stuffed under a one-dimensional exhibit. The components camwood a chance to be basic alternately floating-point qualities. Practically Vector/SIMD media development Also SPU educational work on vector operands. Vectors need aid also known as SIMD operands alternately stuffed operands.What is Auto Vectorization?
Programmed vectorization, clinched alongside parallel computing, may be an extraordinary the event about programmed parallelization, the place a workstation project will be changed over from An scalar implementation, which methods a single match of operands In a time, will a vector implementation, which methods person operation looking into numerous pairs from claiming operands without a moment's delay.So the general purpose of this post is to show to how to implement SIMD vectorization and autovectorization in C code and understanding it by breaking down the code using assembly language and will be compiling on GCC compiler to know its capabilities which is integral part. Here we will be creating a short program with two 1000-element integer arrays and fills them with random numbers in the range -1000 to +1000, then sums those two arrays element-by-element to a third array, and finally sums the third array and prints the result. So, our first step would be to create such program.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main() {
int array1[1000];
int array2[1000];
long array3[1000];
long arraySum = 0;
srand(time(NULL));
for (int i = 0; i < 1000; i++) {
array1[i] = (rand()% 2001) - 1000;
array2[i] = (rand()% 2001) - 1000;
array3[i] = array1[i] + array2[i];
arraySum += array3[i];
}
printf("The total array sum is: %li\n", arraySum);
return 0;
}
Compiling this code through gcc compiler it using command
//gcc -O3 -fopt-info-vec-missed=vect_v0.miss vect_v0.c -o vect_v0 gives us something like this
vector.c:14:1: note: not vectorized: loop contains function calls or data references that cannot be analyzed
vector.c:12:1: note: not vectorized: not enough data-refs in basic block.
vector.c:16:22: note: not vectorized: not enough data-refs in basic block.
vector.c:14:1: note: not vectorized: not enough data-refs in basic block.
vector.c:20:9: note: not vectorized: not enough data-refs in basic block.
But if we make few changes in our code
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int main() {
int array1[1000];
int array2[1000];
long array3[1000];
long arraySum = 0;
srand(time(NULL));
for (int i = 0; i < 1000; i++) {
array1[i] = (rand()% 2001) - 1000;
array2[i] = (rand()% 2001) - 1000;
}
for (int i = 0; i < 1000; i++) {
array3[i] = array1[i] + array2[i];
arraySum += array3[i];
}
printf("The total array sum is: %li\n", arraySum);
return 0;
}
after compililation with the command //gcc -O3 -fopt-info-vec-missed=vect_v0.miss vect_v0.c -o vect_v0 note: loop vectorized
Here we can see that our loop got vectorized.
Now disassembling the code which is autovectorized using simd objdump -d nameof yourfile
0000000000400560 <main>:
// Here we reserve space on the stack for local variables
400560: d283f010 mov x16, #0x1f80 // #8064
400564: cb3063ff sub sp, sp, x16
400568: d2800000 mov x0, #0x0 // #0
40056c: a9007bfd stp x29, x30, [sp]
400570: 910003fd mov x29, sp
400574: a90153f3 stp x19, x20, [sp, #16]
400578: 529a9c74 mov w20, #0xd4e3 // #54499
40057c: a9025bf5 stp x21, x22, [sp, #32]
400580: 72a83014 movk w20, #0x4180, lsl #16
400584: f9001bf7 str x23, [sp, #48]
400588: 910103b5 add x21, x29, #0x40
40058c: 913f83b6 add x22, x29, #0xfe0
400590: 5280fa33 mov w19, #0x7d1 // #2001
400594: d2800017 mov x23, #0x0 // #0
400598: 97ffffd6 bl 4004f0 <time@plt>
40059c: 97ffffe9 bl 400540 <srand@plt>
4005a0: 97ffffdc bl 400510 <rand@plt>
4005a4: 9b347c01 smull x1, w0, w20
4005a8: 9369fc21 asr x1, x1, #41
4005ac: 4b807c21 sub w1, w1, w0, asr #31
4005b0: 1b138020 msub w0, w1, w19, w0
4005b4: 510fa000 sub w0, w0, #0x3e8
4005b8: b8376aa0 str w0, [x21, x23]
4005bc: 97ffffd5 bl 400510 <rand@plt>
4005c0: 9b347c01 smull x1, w0, w20
4005c4: 9369fc21 asr x1, x1, #41
4005c8: 4b807c21 sub w1, w1, w0, asr #31
4005cc: 1b138020 msub w0, w1, w19, w0
4005d0: 510fa000 sub w0, w0, #0x3e8
4005d4: b8376ac0 str w0, [x22, x23]
4005d8: 910012f7 add x23, x23, #0x4
4005dc: f13e82ff cmp x23, #0xfa0
4005e0: 54fffe01 b.ne 4005a0 <main+0x40> // b.any
4005e4: 4f000401 movi v1.4s, #0x0
4005e8: d2800000 mov x0, #0x0 // #0
4005ec: 3ce06ac0 ldr q0, [x22, x0]
4005f0: 3ce06aa2 ldr q2, [x21, x0]
4005f4: 91004000 add x0, x0, #0x10
4005f8: f13e801f cmp x0, #0xfa0
// This is what it's all for: vector addition
4005fc: 4ea28400 add v0.4s, v0.4s, v2.4s
400600: 0ea01021 saddw v1.2d, v1.2d, v0.2s
400604: 4ea01021 saddw2 v1.2d, v1.2d, v0.4s
400608: 54ffff21 b.ne 4005ec <main+0x8c> // b.any
40060c: 5ef1b821 addp d1, v1.2d
400610: 90000000 adrp x0, 400000 <_init-0x4b8>
400614: 91200000 add x0, x0, #0x800
// Move the first and second 64-bit elements from vector 1 to two separate registers
// This might be so that they can be used as arguments for printf?
400618: 4e083c21 mov x1, v1.d[0]
40061c: 97ffffcd bl 400550 <printf@plt>
400620: f9401bf7 ldr x23, [sp, #48]
400624: a94153f3 ldp x19, x20, [sp, #16]
400628: 52800000 mov w0, #0x0 // #0
40062c: a9425bf5 ldp x21, x22, [sp, #32]
400630: d283f010 mov x16, #0x1f80 // #8064
400634: a9407bfd ldp x29, x30, [sp]
400638: 8b3063ff add sp, sp, x16
40063c: d65f03c0 ret
In spite of the fact that gcc’s auto-vectorization could build raise execution it might not be useful for certain provisions. But auto vectorization can't be trusted. There are huge numbers confinements states will think about auto-vectorization. Gcc needs affirmation that arrays would adjusted Furthermore information may be adjusted. Also, code will well on the way must make re-written should rearrange circle purpose.
Subscribe to:
Posts (Atom)