0:00

Test what kind of loops the Intel compiler can vectorize.

Â To test vectorization, we are going to write our own function.

Â Let's call it MyFunction,

Â the name really doesn't matter.

Â Pretend that it is accepting

Â a single input in the integer n and what we are going to do with this input is,

Â we will declare two arrays A of size n and B of

Â size n. I will initialize these arrays using array invocation,

Â which is C++ language extension and that's too plain for compilers,

Â A starting from zero n element,

Â equal to B starting from zero n element equal to

Â one and I will return A of two.

Â I have to return something to make sure that

Â this entire code is not eliminated by the compiler through dead code elimination.

Â To compile this code, I will execute a shell command into a C++ compiler.

Â Instead of compiling into an object file,

Â I will produce assembly with -S.

Â I will request an optimization report

Â qopt report and then I indicate the source code name.

Â The command has succeeded and it says that

Â optimization reports are generated in opt report files in the output location.

Â Let's see what new files we have.

Â So in addition to the code,

Â now we have the optimization report and the assembly listing in the dot s file.

Â The optimization report is just a text file and what

Â you find here is that the loop in line three was vectorized.

Â Furthermore, there was a reminder loop which was not vectorised and the remainder loop is

Â necessary if the length of the rays,

Â the value of n, is not a multiple of the vector length.

Â The assembly in worker.s will tell us what exactly is happening.

Â So for line three,

Â this is the beginning of the loop,

Â testing the loop exit condition and setting up and here

Â is the operation mov is a vector load instruction.

Â In this case, this is like a store instruction and it uses the registers xmm,

Â xmm0 specifically to store data into memory.

Â Ps stands for packed single so this is a packed vector of single precision numbers.

Â Well, how long is this vector?

Â Because this is an xmm register,

Â we know that the length of the vector is 128 bits.

Â These are legacy SSE instructions.

Â If we wanted to compile the code for xeon Phi,

Â I would have to include one more compiler argument

Â -xMIC-AVX512 with this command,

Â the optimization report still says that the loop is vectorised, in fact,

Â the reminder is now vectorized and

Â the assembly will show that we are using ZMM registers.

Â So this is a vector store operation.

Â It's online version for a vector of

Â packed double precision numbers and ZMM registers are 512 bits long.

Â These are from the AVX-512 instructions at the,

Â instructions had included with second generation Intel Xeon Phi processors.

Â Okay. So, this was a trivial loop,

Â now let's set up a loop that does something more interesting.

Â So first a+b, let's see how that vectorizes,

Â that shouldn't be a problem at all.

Â Indeed, the compiler report says that we are vectorized and assembly for

Â line six is using

Â vector loads and vector stores and this is the vector addition operation,

Â vector add packs double,

Â operating on ZMM registers from the size of the registers and from the data type,

Â we can conclude that we are adding eight double precision numbers at a time.

Â Well, how about something less trivial?

Â For example, here I will redefine B of i,

Â I will multiply it by

Â the double representation of the index i, will that vectorize?

Â Well, we're running the compilation command.

Â The the optimization report says it's

Â vectorized and the assembly listing should also show that,

Â but, this time we are going to see vector broadcast your vector conversion.

Â This is a conversion from a vector of

Â integers to a vector of packed double precision numbers and this is from

Â a YMM register which is 256 bit long to a ZMM register.

Â Why is it YMM?

Â Well, that's because our integers are 32 bit integers

Â and eight of them will only fill 256 bytes,

Â ah bits, again type conversion,

Â and take a look at this,

Â this is very interesting,

Â this is a fused multiply and add operation in double precision numbers.

Â So what's happening here is instead of

Â doing the multiplication and the addition separately,

Â the compiler recognize that we have

Â a more complex pattern where we have multiplications followed by additions.

Â So, what's happening here is to compute a of I,

Â the compiler use multiplication and addition as a single operation.

Â Okay. So, how about a more complex pattern of vectorization?

Â Maybe, we are going to introduce a stride, for example,

Â if we go in i from zero to

Â n divided by two

Â and multiply the index by two in the right hand side, will this vectorise?

Â So I'm recompiling. The optimisation report says that the loop was vectorized,

Â but in a the minute I will show you how to detect a red flag for this vectorization.

Â The assembly for line six,

Â as you can see there's now some new instructions,

Â this is a vector permutation,

Â followed by vector addition.

Â So this is still processed with vector instructions,

Â but if we compile the code with

Â a more verbose optimization report by setting qopt report equal to five,

Â we will find that the loop in line five is vectorized.

Â But there is a remark about non-unit stride load of variable b in line six.

Â These non-unit strides loads or non-unit stores,

Â usually they are less efficient in terms of performance than unit stride loads.

Â So, we will talk later about tuning data containers

Â and loop patterns in such a way that you avoid strided access and maintain unit stride.

Â Something else that I wanted to demonstrate is branches inside loops.

Â So, if i modulo three is zero.

Â So, only for every third value of

Â i do i want to do this computation, will that vectorize?

Â Now let's see, recompiling,

Â re-reading the vectorization report and for line five,

Â for line five we find that the loop is vectorized.

Â There are only unit stride loads here,

Â masked aligned the stride load and this masking indicates

Â that not all results that affect the calculation are stored but only some of them.

Â Well, how would you compute this with a vector?

Â What the compiler usually does it lumps together eight values of i.

Â It performs the addition operation for all values of

Â i and then separately it computes a mask,

Â and using this mask only some of the results are stored.

Â So, we are going to do

Â three times the amount of arithmetics that the scalar loop would do.

Â But this is okay because we get a speed up by a factor of eight due to vectorization.

Â At some point, if the branches are not taken very often,

Â you should think of maybe prohibiting vectorization

Â because with vectorisation you will... might have to

Â do a lot of mathematics that you discard.

Â To prohibit vectorization, you can use pragma novector

Â and it might actually result

Â in a performance increase if you take these branches very rarely.

Â Finally, let me demonstrate calling functions from vector loops.

Â If we include the header for the standard math library,

Â I will be able to do things like exponential of b,

Â and let's see how that works and recompiling,

Â the loop is hopefully is still vectorized, yes,

Â looks like it is in assembly,

Â let's see what we find for line eight.

Â This is a vector load and this is a call to

Â a short vector math library for the exp function

Â with a width of 8 SVML (stands

Â for Short Vector Math Library) and this is the library that

Â connects your transcendental functions in code to the underlying architecture.

Â These functions can be used with vector inputs and this is what happened here.

Â Sometimes these functions will use analytical approximations,

Â sometimes they will use built in instructions in your instruction set.

Â It depends on the accuracy that your code is required to

Â produce and it also depends on the nature of the function, for example,

Â if here I use one over the square root,

Â then instead of a short vector math library function,

Â we will see an assembly instruction calling

Â the AVX512 ER instruction for that reciprocal square.

Â Recompile, check the assembly and for line eight we have

Â this vector reciprocals square root

Â with 28 being precision on packed double precision numbers.

Â Now let's see what is not going to vectorize.

Â Back to a plus b.

Â Let's do a harmless looking change,

Â a of i was equal to b of i minus one,

Â but to make it more fun let's make it a of five minus one.

Â This is actually the kernel of

Â the Fibonacci number calculation and this kernel

Â says that I have to know a of i minus one before I can compute a of i,

Â so let's see how this compiles.

Â When I compile I will see

Â that the loop in line seven is not vectorized because of a vector dependence.

Â The dependence is between a of i and a of i minus one and if I look in

Â the assembly then for line for line eight I will find scalar instructions.

Â And this is a scalar addition, it's actually,

Â it has the prefix V because it is going to use the vector processing units,

Â but the addition is on a single double precision number.

Â So, this is not possible to get vectorize because it has vector dependence,

Â and the compiler recognizes it,

Â and produces code that will give you correct result.

Â Now, the final step and this I will allow you to study at home,

Â I will change minus one to plus one,

Â recompile, check the optimization report,

Â and for looping line seven,

Â we find that the loop is vectorized.

Â So, here's a challenge for you.

Â If I have minus one,

Â it does not vectorise because it's unsafe.

Â If it has plus one it does vectorize.

Â So at home, think about it,

Â is it safe to vectorise this kernel?

Â What it looks like the vectors and see if this compiler decision is justified.

Â