Experiments with Intel’s SSE SIMD instruction set

May 31st, 2010 by Jacky Romano

img_ops_snapshot

Today’s main stream CPU’s have SIMD (Single Instruction Multiple Data) compute capabilities. Intel has SSE, with AMD its 3DNow and ARM has NEON.

However, it seems, from my very narrow point of view, that these instruction set extensions do not nearly get the attention that they deserve in terms of opportunities to accelerate computational code. This post is about exploring and demonstrating the use of these capabilities using Intel’s SSE instruction set.

As a learning exercise, we chose to use a simple pixel alpha composition kernel. The nice thing about this kernel is that it is very simple on one hand, and has visual representation on the other. Our test function takes a RGBA ’source’ and a RGB ‘destination’ images and blends the source image into the destination image. It uses the following expression to calculate the composed pixel color:

Pdst = (Psrc * αsrc) + Pdst * (1 – αsrc)

Where:

Psrc – Pixel color of the source pixel

αsrc – Alpha value of the source pixel

Pdst – Pixel color of the destination pixel

Data layout:

The first step with designing SIMD code, is to get your data arranged in a way that is SIMD friendly. In our case this is fairly straight forward – we represent our images one channel after the other.  An RGBA image looks like:

Figure 1 Image data layout

Figure 1 Image data layout

The component data element, in our case, is unsigned 8 bit.

Baseline Implementation:

As a base line, we used a simple composition loop, leaving the compiler to do it’s tricks. The code (for a single line) looks like:

for (unsigned int x = X0; x < X1; x++) {
	short diff;
	short tmp;

	diff = *pSrc[0] - *pDst[0];
	tmp = short(*pSrc[3] * diff) >> 8;
	*pDst[0] = tmp + *pDst[0];

	diff = *pSrc[1] - *pDst[1];
	tmp = short(*pSrc[3] * diff) >> 8;
	*pDst[1] = tmp + *pDst[1];

	diff = *pSrc[2] - *pDst[2];
	tmp = short(*pSrc[3] * diff) >> 8;
	*pDst[2] = tmp + *pDst[2];

	pSrc[0] += 1;
	pSrc[1] += 1;
	pSrc[2] += 1;
	pSrc[3] += 1;

	pDst[0] += 1;
	pDst[1] += 1;
	pDst[2] += 1;
}

The observant reader would notice that this code actually implements an alternative variant of the blending equation presented above. This modified form is:

Pdst = αsrc *  (Psrc - Pdst) + Pdst

The advantage of this form is that it uses a single multiply operation. However, I never really verified that this version is more efficient than the original form.

Running this function on my laptop which has:
Intel Core 2 Duo SP9300 @ 2.26 Mhz, 2 * 32KB L1, 6 MB L2

The compiler used is Microsoft Visual Studio 2008

We get the following results:

Test Background Width Background Height Overlay Width Overlay Height pixRate (Mpix/sec) pixel_time (nSec) Line overhead
Serial 2000 1200 48 48 37.67 26.546
Serial 2000 1200 192 48 40.405 24.749 115.01
Serial 2000 1200 256 48 41.821 23.912 155.61
Serial 2000 1200 384 48 42.401 23.584 162.49
Serial 2000 1200 640 48 42.571 23.49 158.58

The column that is interesting is the pixel time, which is the pixel processing time in nanoseconds.

The ‘line overhead’ is calculated using the following expression:

L = W1 * W2 * (R1 – R2) / (W2 – W1)

Where:

L – Line overhead time

Wn – line width in test case n

Rn – pixel_time for test case n

It is left for the interested (and bored) reader to figure out why this expression represents the per-line calculation time.

SSE Implementation using Microsoft’s Visual Studio 2008 SSE Intrinsics

The next step in our journey was to implement our blending kernel  using Visual Studio SSE intrinsics. In this case, the blending function processes 16 pixels per iteration using the processor’s 128 bit vector processing capabilities. The result code is:

for (unsigned x = X0; x < X1; x += 16) {
	register __m128i s0, s1, d0, d1, a0, a1, r0, r1, zero;
	register __m128i diff0, tmp0, diff1, tmp1, t;
	zero = _mm_setzero_si128();
	// load alpha
	t = _mm_loadu_si128((__m128i *) pSrc[3]);
	a0 = _mm_unpacklo_epi8(t, zero);
	a1 = _mm_unpackhi_epi8(t, zero);

	EMMX_BLEND(0);
	EMMX_BLEND(1);
	EMMX_BLEND(2);

	pSrc[0] += 16;
	pSrc[1] += 16;
	pSrc[2] += 16;
	pSrc[3] += 16;

	pDst[0] += 16;
	pDst[1] += 16;
	pDst[2] += 16;
}

Where EMMX_BLEND is a macro:

#define EMMX_BLEND(comp)
	t = _mm_loadu_si128((__m128i *) pDst[(comp)]);
	d0 = _mm_unpacklo_epi8(t, zero);
	d1 = _mm_unpackhi_epi8(t, zero);
	t = _mm_loadu_si128((__m128i *) pSrc[(comp)]);
	s0 = _mm_unpacklo_epi8(t, zero);
	s1 = _mm_unpackhi_epi8(t, zero);
	/* A * S */
	tmp0 = _mm_mullo_epi16(s0, a0);
	tmp1 = _mm_mullo_epi16(s1, a1);
	/* 255 - A    */
	diff0 = _mm_sub_epi16(ff, a0);
	diff1 = _mm_sub_epi16(ff, a1);
	/* (255 - A) * D */
	diff0 = _mm_mullo_epi16(diff0, d0);
	diff1 = _mm_mullo_epi16(diff1, d1);
	/* r = A * S + (255 - A) * D */
	r0 = _mm_add_epi16(tmp0, diff0);
	r1 = _mm_add_epi16(tmp1, diff1);
	/* shift; */
	r0 = _mm_srli_epi16(r0, 8);
	r1 = _mm_srli_epi16(r1, 8);
	/* and pack */
	t = _mm_packus_epi16(r0, r1);
	_mm_storeu_si128((__m128i *) pDst[(comp)], t)

The results for this test case are:

Test Background Width Background Height Overlay Width Overlay Height pixRate (Mpix/sec) pixel_time (nSec) Line overhead Speedup to base Speedup to previous
SSE_INTR 2000 1200 48 48 76.623 13.051 103.40% 103.40%
SSE_INTR 2000 1200 192 48 95.879 10.43 167.74 137.29% 137.29%
SSE_INTR 2000 1200 256 48 94.682 10.562 147.04 126.40% 126.40%
SSE_INTR 2000 1200 384 48 100.628 9.938 170.77 137.31% 137.31%
SSE_INTR 2000 1200 640 48 102.811 9.727 172.49 141.49% 141.49%

As can be observed, the results are somewhat disappointing. A speed up of 160%-170% is not much, given that you have 16 times more execution units. A close look at the assembler code that is generated by Visual Studio reveals the culprit. It seems that the Visual studio compiler uses only 2 registers our of the 8 SSE registers that are available in the processor. After scratching my head for while, I thought that I should be more explicit in terms of how one calculation result should be re-used for the next calculation. The downside, of course, is that it requires writing the code in a totally unreadable form (or, also known as, write-only-code). The result EMMX_BLEND macro became:

#define EMMX_BLEND(comp)
	t = _mm_loadu_si128((__m128i *) pDst[(comp)]);
	d0 = _mm_unpacklo_epi8(t, zero);
	d1 = _mm_unpackhi_epi8(t, zero);
	t = _mm_loadu_si128((__m128i *) pSrc[(comp)]);
	_mm_storeu_si128((__m128i *) pDst[(comp)],
            _mm_packus_epi16(_mm_srli_epi16(_mm_add_epi16(_mm_mullo_epi16(_mm_unpacklo_epi8(t, zero), a0),
            _mm_mullo_epi16(_mm_sub_epi16(ff, a0), d0)), 8),
            _mm_srli_epi16(_mm_add_epi16(_mm_mullo_epi16(_mm_unpackhi_epi8(t, zero), a1),
            _mm_mullo_epi16(_mm_sub_epi16(ff, a1), d1)), 8)))

And the results:

Test Background Width Background Height Overlay Width Overlay Height pixRate (Mpix/sec) pixel_time (nSec) Line overhead Speedup to base Speedup to previous
SSE_INTR 2 2000 1200 48 48 92.163 10.85 144.66% 20.29%
SSE_INTR 2 2000 1200 192 48 122.854 8.14 173.44 204.04% 28.13%
SSE_INTR 2 2000 1200 256 48 128.381 7.789 180.83 207.00% 35.60%
SSE_INTR 2 2000 1200 384 48 132.865 7.526 182.35 213.37% 32.05%
SSE_INTR 2 2000 1200 640 48 136.784 7.311 183.65 221.30% 33.05%

An extra ~30% over the readable version which translates into a total of up to 200% speedup over the baseline. Not quite there yet!

So, if the brute force method doesn’t work, perhaps we should be even more brutal – time for assembler. At first, I was a little hesitant about writing the code in assembly, mainly when it comes to debugging the code using the Visual Studio debugger. However, as it turned out, it wasn’t that hard and perhaps even a lesson to bare in mind. Besides the ‘mechanical’ translation of the intrinsics into their assembly syntax, all I was left to deal with was the register allocation. And the end result code looks like this:

__asm {
	// initalization & load Alpha
	pxor xmm0, xmm0 // xmm0 <- 0
		mov eax, dword ptr [pSrc + 12]
		movdqu xmm1, [eax]; xmm1 <- *pSrc[3]
		movdqa xmm2, xmm1;
	punpcklbw xmm2, xmm0; // xmm2 <- a0, 16bit
	movdqa xmm3, xmm1;
	punpckhbw xmm3, xmm0; // xxm3 <- a1, 16bit

	// blending the red;
	mov eax, dword ptr[pDst + 0];
	movdqu xmm1, [eax]; // xmm1 = pDst[0]
	movdqa xmm6, xmm1;
	punpcklbw xmm6, xmm0; // xmm6 <- pDst[0] low 16bit
	movdqa xmm7, xmm1;
	punpckhbw xmm7, xmm0; // xmm7 <- pDst[0] high, 16 bit
	// load the ff constant
	movdqu xmm4, [ffconst]; // xmm4 <- ff
	movdqa xmm5, xmm4;
	psubw  xmm5, xmm2; // xmm5 = ff - a0
	pmullw xmm6, xmm5; // xmm6 = (ff - a0) * d0;
	// now for the upper bits
	movdqa xmm5, xmm4;
	psubw  xmm5, xmm3; // xmm5 = ff - a1
	pmullw xmm7, xmm5; // xmm7 = (ff - a1) * d1;
	// load the source;
	mov eax, dword ptr[pSrc + 0];
	movdqu xmm1, [eax]; // xmm1 = pSrc[0]
	// low bits of pSrc[0]
	movdqa xmm5, xmm1;
	punpcklbw xmm5, xmm0; // xmm5 = pSrc[0], low, 16 bit;
	pmullw xmm5, xmm2; // xmm5 = s0 * a0;
	paddw xmm6, xmm5; // xmm6 = s0 * a0 + (ff - a0) * d0;
	// high bits of pSrc[0]
	movdqa xmm5, xmm1;
	punpckhbw xmm5, xmm0;
	pmullw xmm5, xmm3; // xmm5 = s1 * a1
	paddw xmm7, xmm5; // xmm7 = s1 * a1 + (ff - a1) * d1;
	// shift the results;
	psrlw xmm6, 8;
	psrlw xmm7, 8;
	// pack back
	packuswb xmm6, xmm7; // xmm6 <- xmm6{}xmm7 low bits;
	mov eax, dword ptr [pDst + 0];
	movdqu [eax], xmm6; // done for this component;

	// Similar code goes for the green and blue channles
	// see the code for details

};

As for the results:

Test Background Width Background Height Overlay Width Overlay Height pixRate (Mpix/sec) pixel_time (nSec) Line overhead Speedup to base Speedup to previous
SSE_ASM 2000 1200 48 48 174.591 5.728 363.44% 89.42%
SSE_ASM 2000 1200 192 48 297.48 3.362 151.42 636.14% 142.12%
SSE_ASM 2000 1200 256 48 318.166 3.143 152.71 660.80% 147.82%
SSE_ASM 2000 1200 384 48 334.824 2.987 150.36 689.55% 151.96%
SSE_ASM 2000 1200 640 48 356.493 2.805 151.68 737.43% 160.64%

Finally it seems that we are getting close. The improvement over the base line is 7.3 X. This is not the desired 16X yet, but given the 4 evenings that I have put into this effort, this is probably at the point where I should call it version 1 and post something!

The Test program & the code

The test program (img_ops.exe) that used for this project is available *here*. It has two running options:

Benchmark mode – Is used to measure the performance with various image sizes.

Syntax: img_ops.exe benchmark

Bubble Tank demo – Demonstrate the performance difference between the serial and the vectorized code. In this mode the blending of multiple instances of the overlay image are blended on the background images.

Syntax: img_ops.exe bubble_tank <background image> <overlay image>

Disclaimers, next steps and conclusions

Ok, there are many of them so lets get started.

First, you might have noticed that the SIMD implementation assumes that source image line width is a multiply of 16 bit. Obviously this restriction is not realistic and should be relieved by adding wind up/down code to handle the head and the tail of a line.

Next, As can be observed from the results above, the ‘per-line’ overhead is quite considerable. With the fastest version, it seems that we add overhead of ~80 pixels per line. Obviously this should be reviewed for potential optimization. In general, both the serial and the SIMD codes have not been reviewed and there is probably big potential for shaving a few cycles off the inner loops.

In terms of next steps (once I get to finish up this one) is to test this code on other architectures. Especially interesting one is ARM/Neon. The reason is that while there is a growing market desire for richer user interfaces, there are many ARM based embedded systems that can’t afford to include a GPU due to area/costs considerations. Such implementations could use the SIMD extensions to deliver a rich and compelling user experience.

To conclude this phase of the experiment, there are few takeaways:

First, as expected, the SIMD instruction set indeed offers great optimization potential. We have presented 7.3 X performance boost, with strong feeling that there is plenty of room for more.

Second, SIMD code, especially written in assembly language and using Visual Studio, is not debugging friendly. So, you better have a solid, debugged serial or vectorized implementation before you start your assembly programming.

Leave a Reply