
Today’s main stream CPU’s have SIMD (Single Instruction Multiple Data) compute capabilities. Intel has SSE, with AMD its 3DNow and ARM has NEON.
However, it seems, from my very narrow point of view, that these instruction set extensions do not nearly get the attention that they deserve in terms of opportunities to accelerate computational code. This post is about exploring and demonstrating the use of these capabilities using Intel’s SSE instruction set.
As a learning exercise, we chose to use a simple pixel alpha composition kernel. The nice thing about this kernel is that it is very simple on one hand, and has visual representation on the other. Our test function takes a RGBA ’source’ and a RGB ‘destination’ images and blends the source image into the destination image. It uses the following expression to calculate the composed pixel color:
Pdst = (Psrc * αsrc) + Pdst * (1 – αsrc)
Where:
Psrc – Pixel color of the source pixel
αsrc – Alpha value of the source pixel
Pdst – Pixel color of the destination pixel
Data layout:
The first step with designing SIMD code, is to get your data arranged in a way that is SIMD friendly. In our case this is fairly straight forward – we represent our images one channel after the other. An RGBA image looks like:

Figure 1 Image data layout
The component data element, in our case, is unsigned 8 bit.
Baseline Implementation:
As a base line, we used a simple composition loop, leaving the compiler to do it’s tricks. The code (for a single line) looks like:
for (unsigned int x = X0; x < X1; x++) {
short diff;
short tmp;
diff = *pSrc[0] - *pDst[0];
tmp = short(*pSrc[3] * diff) >> 8;
*pDst[0] = tmp + *pDst[0];
diff = *pSrc[1] - *pDst[1];
tmp = short(*pSrc[3] * diff) >> 8;
*pDst[1] = tmp + *pDst[1];
diff = *pSrc[2] - *pDst[2];
tmp = short(*pSrc[3] * diff) >> 8;
*pDst[2] = tmp + *pDst[2];
pSrc[0] += 1;
pSrc[1] += 1;
pSrc[2] += 1;
pSrc[3] += 1;
pDst[0] += 1;
pDst[1] += 1;
pDst[2] += 1;
}
The observant reader would notice that this code actually implements an alternative variant of the blending equation presented above. This modified form is:
Pdst = αsrc * (Psrc - Pdst) + Pdst
The advantage of this form is that it uses a single multiply operation. However, I never really verified that this version is more efficient than the original form.
Running this function on my laptop which has:
Intel Core 2 Duo SP9300 @ 2.26 Mhz, 2 * 32KB L1, 6 MB L2
The compiler used is Microsoft Visual Studio 2008
We get the following results:
| Test | Background Width | Background Height | Overlay Width | Overlay Height | pixRate (Mpix/sec) | pixel_time (nSec) | Line overhead |
| Serial | 2000 | 1200 | 48 | 48 | 37.67 | 26.546 | |
| Serial | 2000 | 1200 | 192 | 48 | 40.405 | 24.749 | 115.01 |
| Serial | 2000 | 1200 | 256 | 48 | 41.821 | 23.912 | 155.61 |
| Serial | 2000 | 1200 | 384 | 48 | 42.401 | 23.584 | 162.49 |
| Serial | 2000 | 1200 | 640 | 48 | 42.571 | 23.49 | 158.58 |
The column that is interesting is the pixel time, which is the pixel processing time in nanoseconds.
The ‘line overhead’ is calculated using the following expression:
L = W1 * W2 * (R1 – R2) / (W2 – W1)
Where:
L – Line overhead time
Wn – line width in test case n
Rn – pixel_time for test case n
It is left for the interested (and bored) reader to figure out why this expression represents the per-line calculation time.
SSE Implementation using Microsoft’s Visual Studio 2008 SSE Intrinsics
The next step in our journey was to implement our blending kernel using Visual Studio SSE intrinsics. In this case, the blending function processes 16 pixels per iteration using the processor’s 128 bit vector processing capabilities. The result code is:
for (unsigned x = X0; x < X1; x += 16) {
register __m128i s0, s1, d0, d1, a0, a1, r0, r1, zero;
register __m128i diff0, tmp0, diff1, tmp1, t;
zero = _mm_setzero_si128();
// load alpha
t = _mm_loadu_si128((__m128i *) pSrc[3]);
a0 = _mm_unpacklo_epi8(t, zero);
a1 = _mm_unpackhi_epi8(t, zero);
EMMX_BLEND(0);
EMMX_BLEND(1);
EMMX_BLEND(2);
pSrc[0] += 16;
pSrc[1] += 16;
pSrc[2] += 16;
pSrc[3] += 16;
pDst[0] += 16;
pDst[1] += 16;
pDst[2] += 16;
}
Where EMMX_BLEND is a macro:
#define EMMX_BLEND(comp) t = _mm_loadu_si128((__m128i *) pDst[(comp)]); d0 = _mm_unpacklo_epi8(t, zero); d1 = _mm_unpackhi_epi8(t, zero); t = _mm_loadu_si128((__m128i *) pSrc[(comp)]); s0 = _mm_unpacklo_epi8(t, zero); s1 = _mm_unpackhi_epi8(t, zero); /* A * S */ tmp0 = _mm_mullo_epi16(s0, a0); tmp1 = _mm_mullo_epi16(s1, a1); /* 255 - A */ diff0 = _mm_sub_epi16(ff, a0); diff1 = _mm_sub_epi16(ff, a1); /* (255 - A) * D */ diff0 = _mm_mullo_epi16(diff0, d0); diff1 = _mm_mullo_epi16(diff1, d1); /* r = A * S + (255 - A) * D */ r0 = _mm_add_epi16(tmp0, diff0); r1 = _mm_add_epi16(tmp1, diff1); /* shift; */ r0 = _mm_srli_epi16(r0, 8); r1 = _mm_srli_epi16(r1, 8); /* and pack */ t = _mm_packus_epi16(r0, r1); _mm_storeu_si128((__m128i *) pDst[(comp)], t)
The results for this test case are:
| Test | Background Width | Background Height | Overlay Width | Overlay Height | pixRate (Mpix/sec) | pixel_time (nSec) | Line overhead | Speedup to base | Speedup to previous |
| SSE_INTR | 2000 | 1200 | 48 | 48 | 76.623 | 13.051 | 103.40% | 103.40% | |
| SSE_INTR | 2000 | 1200 | 192 | 48 | 95.879 | 10.43 | 167.74 | 137.29% | 137.29% |
| SSE_INTR | 2000 | 1200 | 256 | 48 | 94.682 | 10.562 | 147.04 | 126.40% | 126.40% |
| SSE_INTR | 2000 | 1200 | 384 | 48 | 100.628 | 9.938 | 170.77 | 137.31% | 137.31% |
| SSE_INTR | 2000 | 1200 | 640 | 48 | 102.811 | 9.727 | 172.49 | 141.49% | 141.49% |
As can be observed, the results are somewhat disappointing. A speed up of 160%-170% is not much, given that you have 16 times more execution units. A close look at the assembler code that is generated by Visual Studio reveals the culprit. It seems that the Visual studio compiler uses only 2 registers our of the 8 SSE registers that are available in the processor. After scratching my head for while, I thought that I should be more explicit in terms of how one calculation result should be re-used for the next calculation. The downside, of course, is that it requires writing the code in a totally unreadable form (or, also known as, write-only-code). The result EMMX_BLEND macro became:
#define EMMX_BLEND(comp)
t = _mm_loadu_si128((__m128i *) pDst[(comp)]);
d0 = _mm_unpacklo_epi8(t, zero);
d1 = _mm_unpackhi_epi8(t, zero);
t = _mm_loadu_si128((__m128i *) pSrc[(comp)]);
_mm_storeu_si128((__m128i *) pDst[(comp)],
_mm_packus_epi16(_mm_srli_epi16(_mm_add_epi16(_mm_mullo_epi16(_mm_unpacklo_epi8(t, zero), a0),
_mm_mullo_epi16(_mm_sub_epi16(ff, a0), d0)), 8),
_mm_srli_epi16(_mm_add_epi16(_mm_mullo_epi16(_mm_unpackhi_epi8(t, zero), a1),
_mm_mullo_epi16(_mm_sub_epi16(ff, a1), d1)), 8)))
And the results:
| Test | Background Width | Background Height | Overlay Width | Overlay Height | pixRate (Mpix/sec) | pixel_time (nSec) | Line overhead | Speedup to base | Speedup to previous |
| SSE_INTR 2 | 2000 | 1200 | 48 | 48 | 92.163 | 10.85 | 144.66% | 20.29% | |
| SSE_INTR 2 | 2000 | 1200 | 192 | 48 | 122.854 | 8.14 | 173.44 | 204.04% | 28.13% |
| SSE_INTR 2 | 2000 | 1200 | 256 | 48 | 128.381 | 7.789 | 180.83 | 207.00% | 35.60% |
| SSE_INTR 2 | 2000 | 1200 | 384 | 48 | 132.865 | 7.526 | 182.35 | 213.37% | 32.05% |
| SSE_INTR 2 | 2000 | 1200 | 640 | 48 | 136.784 | 7.311 | 183.65 | 221.30% | 33.05% |
An extra ~30% over the readable version which translates into a total of up to 200% speedup over the baseline. Not quite there yet!
So, if the brute force method doesn’t work, perhaps we should be even more brutal – time for assembler. At first, I was a little hesitant about writing the code in assembly, mainly when it comes to debugging the code using the Visual Studio debugger. However, as it turned out, it wasn’t that hard and perhaps even a lesson to bare in mind. Besides the ‘mechanical’ translation of the intrinsics into their assembly syntax, all I was left to deal with was the register allocation. And the end result code looks like this:
__asm {
// initalization & load Alpha
pxor xmm0, xmm0 // xmm0 <- 0
mov eax, dword ptr [pSrc + 12]
movdqu xmm1, [eax]; xmm1 <- *pSrc[3]
movdqa xmm2, xmm1;
punpcklbw xmm2, xmm0; // xmm2 <- a0, 16bit
movdqa xmm3, xmm1;
punpckhbw xmm3, xmm0; // xxm3 <- a1, 16bit
// blending the red;
mov eax, dword ptr[pDst + 0];
movdqu xmm1, [eax]; // xmm1 = pDst[0]
movdqa xmm6, xmm1;
punpcklbw xmm6, xmm0; // xmm6 <- pDst[0] low 16bit
movdqa xmm7, xmm1;
punpckhbw xmm7, xmm0; // xmm7 <- pDst[0] high, 16 bit
// load the ff constant
movdqu xmm4, [ffconst]; // xmm4 <- ff
movdqa xmm5, xmm4;
psubw xmm5, xmm2; // xmm5 = ff - a0
pmullw xmm6, xmm5; // xmm6 = (ff - a0) * d0;
// now for the upper bits
movdqa xmm5, xmm4;
psubw xmm5, xmm3; // xmm5 = ff - a1
pmullw xmm7, xmm5; // xmm7 = (ff - a1) * d1;
// load the source;
mov eax, dword ptr[pSrc + 0];
movdqu xmm1, [eax]; // xmm1 = pSrc[0]
// low bits of pSrc[0]
movdqa xmm5, xmm1;
punpcklbw xmm5, xmm0; // xmm5 = pSrc[0], low, 16 bit;
pmullw xmm5, xmm2; // xmm5 = s0 * a0;
paddw xmm6, xmm5; // xmm6 = s0 * a0 + (ff - a0) * d0;
// high bits of pSrc[0]
movdqa xmm5, xmm1;
punpckhbw xmm5, xmm0;
pmullw xmm5, xmm3; // xmm5 = s1 * a1
paddw xmm7, xmm5; // xmm7 = s1 * a1 + (ff - a1) * d1;
// shift the results;
psrlw xmm6, 8;
psrlw xmm7, 8;
// pack back
packuswb xmm6, xmm7; // xmm6 <- xmm6{}xmm7 low bits;
mov eax, dword ptr [pDst + 0];
movdqu [eax], xmm6; // done for this component;
// Similar code goes for the green and blue channles
// see the code for details
};
As for the results:
| Test | Background Width | Background Height | Overlay Width | Overlay Height | pixRate (Mpix/sec) | pixel_time (nSec) | Line overhead | Speedup to base | Speedup to previous |
| SSE_ASM | 2000 | 1200 | 48 | 48 | 174.591 | 5.728 | 363.44% | 89.42% | |
| SSE_ASM | 2000 | 1200 | 192 | 48 | 297.48 | 3.362 | 151.42 | 636.14% | 142.12% |
| SSE_ASM | 2000 | 1200 | 256 | 48 | 318.166 | 3.143 | 152.71 | 660.80% | 147.82% |
| SSE_ASM | 2000 | 1200 | 384 | 48 | 334.824 | 2.987 | 150.36 | 689.55% | 151.96% |
| SSE_ASM | 2000 | 1200 | 640 | 48 | 356.493 | 2.805 | 151.68 | 737.43% | 160.64% |
Finally it seems that we are getting close. The improvement over the base line is 7.3 X. This is not the desired 16X yet, but given the 4 evenings that I have put into this effort, this is probably at the point where I should call it version 1 and post something!
The Test program & the code
The test program (img_ops.exe) that used for this project is available *here*. It has two running options:
Benchmark mode – Is used to measure the performance with various image sizes.
Syntax: img_ops.exe benchmark
Bubble Tank demo – Demonstrate the performance difference between the serial and the vectorized code. In this mode the blending of multiple instances of the overlay image are blended on the background images.
Syntax: img_ops.exe bubble_tank <background image> <overlay image>
Disclaimers, next steps and conclusions
Ok, there are many of them so lets get started.
First, you might have noticed that the SIMD implementation assumes that source image line width is a multiply of 16 bit. Obviously this restriction is not realistic and should be relieved by adding wind up/down code to handle the head and the tail of a line.
Next, As can be observed from the results above, the ‘per-line’ overhead is quite considerable. With the fastest version, it seems that we add overhead of ~80 pixels per line. Obviously this should be reviewed for potential optimization. In general, both the serial and the SIMD codes have not been reviewed and there is probably big potential for shaving a few cycles off the inner loops.
In terms of next steps (once I get to finish up this one) is to test this code on other architectures. Especially interesting one is ARM/Neon. The reason is that while there is a growing market desire for richer user interfaces, there are many ARM based embedded systems that can’t afford to include a GPU due to area/costs considerations. Such implementations could use the SIMD extensions to deliver a rich and compelling user experience.
To conclude this phase of the experiment, there are few takeaways:
First, as expected, the SIMD instruction set indeed offers great optimization potential. We have presented 7.3 X performance boost, with strong feeling that there is plenty of room for more.
Second, SIMD code, especially written in assembly language and using Visual Studio, is not debugging friendly. So, you better have a solid, debugged serial or vectorized implementation before you start your assembly programming.