Feb 172011
 

Vector dependence or inefficient

Dependence

Please keep in mind that there is a lot more to cover with vector dependence and at this stage, am going to give the quick and nasty shortcuts that you may or may not have success with.

Ok, so you’ve been compiling with -vec-report3 and given the following snippet:

for (int j = 0; j<N; j++)
{
    // Invert and multiply the delta_k_current
    input[j] = (S_t[i]-m_k_current[j])*delta_k_current[j];
}

the compiler will tell you:


sourcefile.cpp(231) (col. 13): remark: loop was not vectorized: existence of vector dependence.
sourcefile.cpp(234) (col. 17): remark: vector dependence: assumed FLOW dependence between (unknown) line 234 and (unknown) line 234.
sourcefile.cpp(234) (col. 17): remark: vector dependence: assumed ANTI dependence between (unknown) line 234 and (unknown) line 234.

You have examined you structures and for the life of you, you see no such dependency.

Try this:

#pragma ivdep
for (int j = 0; j<N; j++)
{
    // Invert and multiply the delta_k_current
    input[j] = (S_t[i]-m_k_current[j])*delta_k_current[j];
}

Recompile and if you were right and lucky:

sourcefile.cpp(231) (col. 13): remark: LOOP WAS VECTORIZED.

Inefficient

Another good one is in the following example:

for (int k = 0; k<N; k++)
{
    if (prob_index[k])
        fc[k] = fcb[SIZE-N+k];
}

The compiler will tell you:

sourcefile.cpp(290) (col. 17): remark: loop was not vectorized: vectorization possible but seems inefficient.

When you I run this code snippet (keeping in mind there is other stuff as well), my elapsed time is 12.2448 seconds.
I understand that it is possible that the loop is not big enough (in this case N=30; a low trip count) so vectorisation may seem inefficient, but let’s say that I do want it to vectorise. Add the following:

#pragma vector always
for (int k = 0; k<N; k++)
{
    if (prob_index[k])
        fc[k] = fcb[SIZE-N+k];
}

I now get (from the compiler):

sourcefile.cpp(290) (col. 17): remark: LOOP WAS VECTORIZED.

My elapsed time is now 8.7879 seconds. ~40% speedup? I’ll take that, thank you very much.

Update: 13 March 2012

I’ve just learnt something new.

Just found, tested and implemented:

#pragma simd

You can declare functions as simd(ified) by either specifying the length or opting for automatic detection.
You can do the same with loops.

Why?

Consider the following loop I’ve been trying to make full use of hardware (X5650):

for (int i = 0 ; i < n ; i ++)
{
  a[i][0] = (b[i][0] - b[i+1][0]);
  a[i][1] = (b[i][1] - b[i+1][1]);
}

When compile with “-vec-report2”:

loop was not vectorized: existence of vector dependence

But I know that this is fine. So I force the vectorisation:

#pragma ivdep
for (int i = 0 ; i &lt; n ; i ++)
{
  a[i][0] = (b[i][0] - b[i+1][0]);
  a[i][1] = (b[i][1] - b[i+1][1]);
}

But I get:

loop was not vectorized: vectorization possible but seems inefficient.

Which is not actually true (from experience).
I would like to be able to chain the vectorisation pragmas together (something like: “#pragma ivdep & vector always”) but I cannot. The “#pragma vector always” will override the low trip count the compiler detects.

Try this:

#pragma simd
for (int i = 0 ; i &lt; n ; i ++)
{
  a[i][0] = (b[i][0] - b[i+1][0]);
  a[i][1] = (b[i][1] - b[i+1][1]);
}


SIMD LOOP WAS VECTORIZED.

Loop execution time (for sample) reduced to 12.5% of original runtime. Results exact. Sick.


Syntax of Hint : Semantics
#pragma ivdep: discard assumed data dependences
#pragma vector always: override efficiency heuristics
#pragma vector nontemporal: enable streaming stores
#pragma vector [un]aligned: assert [un]aligned property
#pragma novector: disable vectorisation
#pragma distribute point: suggest point for loop distribution
#pragma loop count ([int]): estimate trip count
restrict: assert exclusive access through pointer
_declspec(align([int],[int])): suggest memory alignement
__assume_aligned([var],[int]): assert alignment property

Extra Reading

SoftwareVectorisationHandbook
The Software Vectorization Handbook
Applying Multimedia Extensions for Maximum Performance
by Aart J.C. Bik

 

SIMD:
Intel VEC SIMD
Good little PDF

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)

Human Conf Test * Time limit is exhausted. Please reload CAPTCHA.