Just did a pretty neat trick for some CPU code I had to speed up. The solution here comes as a result of GPU programming where ‘if’ statements in code are bad. Well, branching is bad anyway, but on a GPU the evil is really exposed.

This snippet is called a **lot**. According to gprof, 99.61% of program time is spent in the method that contains the below code. As a benchmark for arguments sake, we will say that the below code total runtime is **26 seconds**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | for (int i=0;i<F->sizeQ; i++) { for (int j=i+1;j<F->sizeQ; j++) { bool sum = true; // Only sum over different values for self node (ie. 1 and -1) if (F->Q[i].integer_array[index] != F->Q[j].integer_array[index]) { for (int m=0;m<index;m++) { if (F->Q[i].integer_array[m]!=F->Q[j].integer_array[m]) { sum=false; continue; } } for (int n=index+1;n<F->in_node_cnt;n++) { if (F->Q[i].integer_array[n]!=F->Q[j].integer_array[n]) { sum = false; continue; } } } else { sum = false; continue; } if (sum) { // Do some computational work that barely registers on a 1 hour profile. // Code REMOVED } } } |

There are quite a few things here that are examples of pretty poor programming. Or, rather the author wrote the code and just left it at that once it worked.

To make this posting a little faster, I’ll summarise the changes I made before I did the cool trick.

1. Change the *continue* statements on lines [14,22] to *break* – Effect **16.6** seconds

2. Instead of the caching mess that is the *F->Q[i].array[j]* statements (these arrays are constantly deleted and new’ed else where in the code), I formulate *int **array* – Effect: **12** seconds

3. **gcov** had *if* statement on line 19 being called many more times than line 11, so reorder the loops. Effect: **11.3** seconds

4. The “pivot” index in the new *int **array* variable is moved to *array[0]* (on creation), so can merge i=1:index-1 and i=index+1 into single loop: Effect: **10.2** seconds

5. *array[0]* or “index” is removed completely into a boolean array, AND

6. Inspection of the elements of remaining* int **array* show that values are only ever ‘-1’ and ‘1’ – now a boolean array, AND

7. Using the reverse of the trick I use with Binary Arrays the *int **array* is converted to a single *int *array*: Effect: **5.0** seconds

At this stage, the code now looks like:

/* * Take to boolean array of length n and create a single integer * * Still looking for a better way to do this. */ int PackBools (bool* v, int n) { int mask = 1, result = 0; for (int i = 0; i < n; i++, mask <<= 1) { if (v[i] == true) result |= mask; } return result; } |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | for (int i = 0; i<F_sizeQ; i++) { counter = 0; // Add all elements on left of pivot for (int j = 0; j < index; j++) { if (F->q[i].node_vals_ptr[j] > 0) array[counter] = true; else array[counter] = false; counter++; } // Put the pivot in the first element of the row if (F->q[i].node_vals_ptr[index] > 0) doWork[i] = true; else doWork[i] = false; // Add all elements on right of pivot for (int j = index+1; j < F_maxSize; j++) { if (F->q[i].node_vals_ptr[j] > 0) array[counter] = true; else array[counter] = false; counter++; } // Turn the boolean array back into a single integer final[i] = PackBools(array, F_maxSize-1); } for ( int i = 0; i < F_sizeQ; i++ ) { for ( int j = i+1; j < F_sizeQ; ++j ) { if (doWork[i] ^ doWork[j]) { if ( final[i] == final[j]) { // Do that barely registering compute work // Code removed } } } } |

The VTune results here say that still 90% of the runtime is being spent on lines 36 (0.95s), 38 (2.23s) and 40 (1.36s).

To remove the if statements, replace all code from line 34 to 47 with:

34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | bool toProcess; int *processIndexes = new int[F_sizeQ*(F_sizeQ+1)/2]; counter = 0; for ( int i = 0; i < F_sizeQ; i++ ) { for ( int j = i+1; j < F_sizeQ; ++j ) { toProcess = (final[i] == final[j]) && ( doWork[i]^doWork[j] ); processIndexes[counter] = j*F_sizeQ + i; // Increment the counter if the result is valid counter += toProcess; } } int i, j; for (int m = 0; m<counter; m++) { i = processIndexes[m]/F_sizeQ; j = processIndexes[m]%F_sizeQ; // Do the, now registering, compute work } |

Total Compute time is now down to **1.5** seconds.

I thought the counter increment and array indexing was cool. I now only have an array of indexes that need to be processed. On line 41, the most common ‘if’ fail will short circuit early so no array increment occurs.

It should be noted that the nested loops above are still responsible for most of the runtime but when I parallelised the rest of the code, running on a compute node (dual X5650’s) with hyperthreading turned on made the code run 25% slower than running without the hyperthreading. That is running with 24 logical cpus compared to running 12 cpus. One iteration of the loop on line 39 takes three (3) cycles.

The rest of the code above was then fine tuned and parallelised with pthread.h. The result was that the jobs should take about 155 hours to complete. That is a big reduction on the original, estimated, Matlab code time of 186, 000 hours (21.2 years).