I’m going to start with a piece of generic c/c++ code and walk through the steps of working out what is taking all the time and how to use the profile results to speed things up.
The best place to start is to get a version of the code you need to profile that takes about 30 seconds to a minute to run. The reason for doing this is that you don’t spend an eternity waiting for something to complete. Short turn around times to begin with. If you find that for some reason you are not capturing the data properly, or that something is missing, it can be useful to extend this time. It is not until I have undergone several iterations of profiling-optimising-running, profiling- optimising-running that I move to much longer run times (eg hours).
dwyer3@lyra:~/Projects/HPC_Tutorials/Profiling/Water> make icpc -O2 -c SPH_Water.cpp -o SPH_Water.o icpc -O2 SPH_Water.o -o Water -limf Binary created!! dwyer3@lyra:~/Projects/HPC_Tutorials/Profiling/Water> ./Water Timing: 38263 milliseconds - 249 frames
The next thing to do is to turn on the profiling flags.
Compile your program with -pg
dwyer3@lyra:~/Projects/HPC_Tutorials/Profiling/Water> make icpc -O2 -pg -c SPH_Water.cpp -o SPH_Water.o icpc -O2 SPH_Water.o -o Water -limf -pg Binary created!!
Run the newly compiled code.
dwyer3@lyra:~/Projects/HPC_Tutorials/Profiling/Water> ./Water Timing: 38057 milliseconds - 249 frames
To start getting profiling information out, run the following commands:
gprof Go gmon.out -p > p.txt gprof Go gmon.out -q > q.txt
Examining gprof Outputs
It can be seen from Fig.1 (examining p.txt) that I have one method that is taking up most of the time. Of the 38 seconds of runtime, 99.8% of the time was spent in the function: Run(). This is not helpful (for this particular case). q.txt will tell us similar information.
Don’t go thinking that gprof is useless; It isn’t. Rather this is not a great example for it. Another gprof capture I have for another project is presented in Fig 2.
Figure 2 shows that the std::pow function is being called many, many times. Other interesting information here is that 3.57% of program run time is spent just making calls to the function. All the std::pow function is doing, in this particular case, is offloading the work to the Intel Math Function for it’s implementation of the power function. The user (me) must have compiled with -limf and #include “mathimf.h” instead of the usual -lm and #include “math.h”. Is there a way to not call std::pow and short circuit directly to Intel’s version? The Intel math version of pow took an additional 22.5% of program runtime. 26.06% (3.57+22.5) of program runtime was spent using the power function.
Figure 3 is better in that it shows a break down of multiple methods. No one method is consuming a large proportion of compute time.
There are two methods in Figure 4 that are taking up most of the time. One is being called about 1.7 billion times while the other a mere 1000 times. Which do you work with?
Getting back to Figure 1. We need more information. Hence, we now turn to other profiling tools. The best one being Intel’s VTune (Amplifier).
Intel VTune Amplifier
To start Intel VTune Amplifier:
module load intel module load intel/vtune amplxe-gui
You should be greeted with Figure 5
When you want to profile your code for the first time, you will want to create a new project:
File -> New -> Project
You will also want to make sure that VTune can find your source code. It is not necessary but I really, really recommend it. This doesn’t work 100% of the time; VTune still contains a lot of bugs … but keep trying. Sometimes you will need to point to the source code, run the profiling, and repeat until eventually VTune locates your source. Other times you need to exit VTune and start it again, defining a new project.
Update: It seems this has been mostly fixed. I find that I have to go through every entry in the “Search directories for” combobox, selecting where the code is and also selecting the sub directory checkbox. Once all of that is done, don’t click ok, rather go back to the “Target” tab and click OK.
Another thing is not to compile the binary with any optimisation flags … only compile with “-g” (Which seems to me a bit of a problem, but I’m not going into that).
You should have defined your project adequately for now.
Press OK and start a new tuning event by
File -> New -> Analysis
At this point you could run any of the options but I have found that you are better off creating a tuning event yourself.
Click the “New” at the bottom left of the GUI
Once you have done this, click START on the right.
Fullscreen Figure 9
If you double click on the time consuming bit (in my case |>Water), and VTune was able to locate you source code you will be able to scroll through your source code line-by-line with a counter on each line indicating how much cpu time was spent on it. VERY USEFUL. If you see assembly code, VTune was unable to locate your source code and you may need to go back to figure 7 and rerun the custom analysis.
Figure 11 is interesting. It is the VTune results of figure 4. The interesting thing to note is that more time is spent manipulating the points array than the actual computation. This is where you would start to examine the cache misses, etc. That sort of exercise is beyond the scope of this investigation. Stay tuned.