Smoothie – Smooth Animation Engine

June 10th, 2010 by Adi Grossman

Smoothie

What is Smoothie?

Smoothie is a C++ implementation of a physical simulation engine for GUI animations. It calculates the current (per frame) position of a UI element by it’s final position.

What is Smoothie good for (features) ?

  • Animations are smooth, velocity does not change abruptly, feels natural & easy for the eye to track.
  • Animations are parametric yet simple, only a final position is needed to start an animation.
  • Final position may be changed during an ongoing animation.
  • Animation time can be limited, a distant final position doesn’t mean a long animation.
  • Animations can be started by a drag & release action, resulting animation may be constrained.
  • Low CPU usage, works on mobile devices.
  • OS independent.
  • Open source (MIT license)

Usage

Setup

The following code will be used in all the examples.

#include "Smoothie.h"
// this function simulates 25 fps
unsigned int simulator25fps(void)
{
    static unsigned int ticks = 0;
    ticks += 40;
    return ticks;
}

// assign an implementation
Smoothie::getTimeMsecFuncPtr Smoothie::getTimeMsec = simulator25fps;
// print x space marks followed by an asterisk
void show(int x)
{
    while(x--)
        printf(" ");
    printf("*\n");
}

In the following examples I will use a smoothie as defined below:

// create a smoothie positioned at 0, that traverses a distance of 180 win 2000ms (=2 seconds)
Smoothie smoothie(0, 180, 2000);

Requesting an animation of a distance less than 180, will take less that 2000ms.

Requesting an animation of a distance greater than 180, will cause the smoothie to skip the section with the highest speed, so that the total time of the animation will be 2000ms.

Example 1: Going from 0 to 120

// start animation
smoothie.setFinalPosition(120);

unsigned int timeTillIdle;
do
{
    // calculate current position
    int x = smoothie.calcCurrentPosition(&timeTillIdle);
    show(x); // show on the screen the value of x
} while(timeTillIdle); // stop when smoothie is idle: timeTillIdle==0

Example 2: Going from 0 to 120, aborting at 60

// start animation
smoothie.setFinalPosition(120);

unsigned int timeTillIdle;
do
{
    // calculate current position
    int x = smoothie.calcCurrentPosition(&timeTillIdle);
    // request to go back to 0 if we reached 60

    if(x >= 60 && smoothie.getFinalPosition() == 120)
        smoothie.setFinalPosition(0);

    show(x); // show on the screen the value of x

} while(timeTillIdle); // stop when smoothie is idle: timeTillIdle==0

Example 3: “glide” animation after a drag-release action

// drag was released at position=0 with velocity=500.
// final position is in the range of [50..120], bounce size is 20
smoothie.setCurrentConstraint(0, 500, 50, 120, 20);

unsigned int timeTillIdle;
do
{
    // calculate current position
    int x = smoothie.calcCurrentPosition(&timeTillIdle);
    show(x); // show on the screen the value of x
} while(timeTillIdle); // stop when smoothie is idle: timeTillIdle==0

Download Smoothie Source

GTW – Rich User Interface library for Web Applications

June 1st, 2010 by Jacky Romano

This post is about GTW – A new User interface toolkit that is being developed by the Graphtech Labs team.

What is GTW?

One of the more exciting (for us in Graphtech, at least) features of the upcoming HTML5 standard is its hardware accelerated 3D graphics capabilities exposed via the WebGL API. GTW uses this API to deliver a user interface library that enables web developers to build web applications with modern and compelling user interface widgets.

In this post, we will go through an early-bird GTW demonstration that implements a cover-flow style interface to Google Images Search. If you have a WebGL enabled browser – check it out here.

Using GTW

Hopefully we got your attention to see what’s going on under the hood, so here it is. We will demonstrate the usage of GTW using the Google Images Cover Flow above.

This page includes several components:

  • A Text Input Area
  • A Search button
  • An Image View area

Now for the actual steps:

The first step in your HTML file is to define your canvases and setup your initialization function:

<BODY onload="webGLStart();">
  <CANVAS id="ilist-canvas" style="border: none;" width="1024" height="700"></CANVAS>
  <div style="padding: 5px;position:absolute; left: 200px; top: 143px; background-color: #000; opacity: 0.9;height: 20px; width: 400px; border-radius: 5px; border: 1px solid #fff;">
  <p style="margin:0;padding:0;padding-bottom: 5px;">

  <input type="text" name="myarea" id="txtInput" value="webgl"
  style="border:1px solid #888;width: 100%"  />

Note that this simple page definition includes the following:

  • ‘onload’ callback – webGLStart() in our case.
  • ‘ilist-canvas’ – our main drawing area
  • ‘txtInput’ – a text type input that is used by the image viewer to build it’s url

Next, you need to include few GTW headers – like so:

 <SCRIPT type="text/javascript" src="./gtw.js"></SCRIPT>
 <SCRIPT type="text/javascript" src="./demo.js"></SCRIPT>

There are two of these headers:

  • gtw.js – The GTW toolkit core
  • demo.js – Classes that are specific for the demo application

Next, we need to initialize GTW. This initialization includes binding the canvas to GTW’s stage and load a SceneFile (SF). This is the code that does the trick:

  <SCRIPT type="text/javascript">
    function webGLStart()
    {
        var canvas = document.getElementById("ilist-canvas");
        var stage = gtwStage.prototype.getStage();
        stage.init(canvas);
        stage.loadSceneFile("demo_imageList.json");
  }
  </SCRIPT>

The complete HTML file is available here.

The files of this demonstration will be published publicly soon. However, if you would like to have the source code right now, please drop us a line.

Finally, we get to the ‘creative’ part – define your Scene-File. The GTW scene file is a JSON syntax file that defines the UI elements hierarchy. Each element in the hierarchy has a minimum  ”type” property which identifies the element, plus a set of  type specific elements and their properties. No worries – examples are forthcoming.

The root of our Scene is a ‘gtwGroup’ node. It looks like this:

{
    "type" : "gtwGroup",
    "children" : [ .... ]
}

Then, we get to the group’s children. The first child that we are going to add is the search button. it’s definition looks like this:

{
        "id" : "btnSearch",
        "type" : "gtwButton",
        "position" : [0.9, 0.85, 0.0],
        "size" : [0.2, 0.08, 0.0],
        "text" : "Search",
        "control" : "btnSearch",
        "text_color" : "#000000",
        "font_size" : 30,
        "texture" : "button_rec.png",
        "texture_pressed" : "button_rec_pressed.png"
},

The button is of type ‘gtwButton’ (you are picking the naming convention, right?) and it takes a set of properties that defines it’s look. The same goes for the left and right pan buttons. We will not put them here but you are welcome to check the json file for details. Now, here is where things are getting interesting and WGL goes into action. The next element is a ‘gtwMirror’. The gtwMirror is a group that both renders it’s children (just like gtwGroup) and in addition renders a mirror image of them. In our coverflow, it creates the reflection on the floor. Here is it’s definition:

{
    "type" : "gtwMirror",
    "position" : [0, 0, 0],
    "box"      : [0, 0, 0, 1.464,1 ],
    "reactive" : false,
    "children" : [ .... ],
}

Note the children property of gtwMirror, which are the children that we want to be reflected in the mirror. In our case, the children include the scene background, the stylish, wavy background decoration and the image viewer itself. All are defined as:

    {
        "type" : "gtwTexture",
        "position" : [-0.42, 0.0, -0.5],
        "size" : [2.32, 1.5, 0.0],
        "filename" : "dark-gradient-background.jpg"
    },
    {
	"type" : "gtwTexture",
        "position" : [1.263, 0.93, 0.0],
        "size" : [0.2, 0.08, 0.0],
	"filename" : "gt_logo.jpg"
    },
    {
        "type" : "gtwMovingTexture",
        "position" : [-0.42, 0.0, -0.5],
        "size" : [2.32, 1.5, 0.0],
        "filename" : "wave.png"
    },
    {
        "id" : "ivGoogle",
        "type" : "gtwImageView",
        "position" : [0, 0, 0],
        "size" : [1.464, 0.5, 0.0],
        "filter" : [".jpg",".JPG",".png"],
        "url" : "http://127.0.0.1/call/google.images/images?q=",
        "control" : "txtInput",
        "button_requery" : "btnSearch",
        "button_left" : "btnArrowLeft",
        "button_right" : "btnArrowRight"
    }

The complete Scene Graph (SF) file is available here

What’s Next

Well, obviously the work in GTW is in it’s early stages and there is tons of work to do. Over the next weeks (and months) we will be working on:

  • Adding controls such as menus, dialog boxes, scroll lists etc
  • Cool visual effects and animations
  • WYSIWYG (what you see is what you get) style design tool

If you have ideas for new features and use cases, don’t be shy post a comment !

Credits

At times I find it hard to explain – As 3D graphics engineers, we know how to make crayons, but we don’t know how to paint. The guys that put the magic behind it all are these magnificent artists from ShugaPusher Studio

  • Ori Succary
  • Erez Bar

That said, Crayons are still required. The engineers that made this happen are:

  • Yaron (Haflo) Peleg
  • Ehud Katz
  • Guy (Chief) Zadikario

Experiments with Intel’s SSE SIMD instruction set

May 31st, 2010 by Jacky Romano

img_ops_snapshot

Today’s main stream CPU’s have SIMD (Single Instruction Multiple Data) compute capabilities. Intel has SSE, with AMD its 3DNow and ARM has NEON.

However, it seems, from my very narrow point of view, that these instruction set extensions do not nearly get the attention that they deserve in terms of opportunities to accelerate computational code. This post is about exploring and demonstrating the use of these capabilities using Intel’s SSE instruction set.

As a learning exercise, we chose to use a simple pixel alpha composition kernel. The nice thing about this kernel is that it is very simple on one hand, and has visual representation on the other. Our test function takes a RGBA ’source’ and a RGB ‘destination’ images and blends the source image into the destination image. It uses the following expression to calculate the composed pixel color:

Pdst = (Psrc * αsrc) + Pdst * (1 – αsrc)

Where:

Psrc – Pixel color of the source pixel

αsrc – Alpha value of the source pixel

Pdst – Pixel color of the destination pixel

Data layout:

The first step with designing SIMD code, is to get your data arranged in a way that is SIMD friendly. In our case this is fairly straight forward – we represent our images one channel after the other.  An RGBA image looks like:

Figure 1 Image data layout

Figure 1 Image data layout

The component data element, in our case, is unsigned 8 bit.

Baseline Implementation:

As a base line, we used a simple composition loop, leaving the compiler to do it’s tricks. The code (for a single line) looks like:

for (unsigned int x = X0; x < X1; x++) {
	short diff;
	short tmp;

	diff = *pSrc[0] - *pDst[0];
	tmp = short(*pSrc[3] * diff) >> 8;
	*pDst[0] = tmp + *pDst[0];

	diff = *pSrc[1] - *pDst[1];
	tmp = short(*pSrc[3] * diff) >> 8;
	*pDst[1] = tmp + *pDst[1];

	diff = *pSrc[2] - *pDst[2];
	tmp = short(*pSrc[3] * diff) >> 8;
	*pDst[2] = tmp + *pDst[2];

	pSrc[0] += 1;
	pSrc[1] += 1;
	pSrc[2] += 1;
	pSrc[3] += 1;

	pDst[0] += 1;
	pDst[1] += 1;
	pDst[2] += 1;
}

The observant reader would notice that this code actually implements an alternative variant of the blending equation presented above. This modified form is:

Pdst = αsrc *  (Psrc - Pdst) + Pdst

The advantage of this form is that it uses a single multiply operation. However, I never really verified that this version is more efficient than the original form.

Running this function on my laptop which has:
Intel Core 2 Duo SP9300 @ 2.26 Mhz, 2 * 32KB L1, 6 MB L2

The compiler used is Microsoft Visual Studio 2008

We get the following results:

Test Background Width Background Height Overlay Width Overlay Height pixRate (Mpix/sec) pixel_time (nSec) Line overhead
Serial 2000 1200 48 48 37.67 26.546
Serial 2000 1200 192 48 40.405 24.749 115.01
Serial 2000 1200 256 48 41.821 23.912 155.61
Serial 2000 1200 384 48 42.401 23.584 162.49
Serial 2000 1200 640 48 42.571 23.49 158.58

The column that is interesting is the pixel time, which is the pixel processing time in nanoseconds.

The ‘line overhead’ is calculated using the following expression:

L = W1 * W2 * (R1 – R2) / (W2 – W1)

Where:

L – Line overhead time

Wn – line width in test case n

Rn – pixel_time for test case n

It is left for the interested (and bored) reader to figure out why this expression represents the per-line calculation time.

SSE Implementation using Microsoft’s Visual Studio 2008 SSE Intrinsics

The next step in our journey was to implement our blending kernel  using Visual Studio SSE intrinsics. In this case, the blending function processes 16 pixels per iteration using the processor’s 128 bit vector processing capabilities. The result code is:

for (unsigned x = X0; x < X1; x += 16) {
	register __m128i s0, s1, d0, d1, a0, a1, r0, r1, zero;
	register __m128i diff0, tmp0, diff1, tmp1, t;
	zero = _mm_setzero_si128();
	// load alpha
	t = _mm_loadu_si128((__m128i *) pSrc[3]);
	a0 = _mm_unpacklo_epi8(t, zero);
	a1 = _mm_unpackhi_epi8(t, zero);

	EMMX_BLEND(0);
	EMMX_BLEND(1);
	EMMX_BLEND(2);

	pSrc[0] += 16;
	pSrc[1] += 16;
	pSrc[2] += 16;
	pSrc[3] += 16;

	pDst[0] += 16;
	pDst[1] += 16;
	pDst[2] += 16;
}

Where EMMX_BLEND is a macro:

#define EMMX_BLEND(comp)
	t = _mm_loadu_si128((__m128i *) pDst[(comp)]);
	d0 = _mm_unpacklo_epi8(t, zero);
	d1 = _mm_unpackhi_epi8(t, zero);
	t = _mm_loadu_si128((__m128i *) pSrc[(comp)]);
	s0 = _mm_unpacklo_epi8(t, zero);
	s1 = _mm_unpackhi_epi8(t, zero);
	/* A * S */
	tmp0 = _mm_mullo_epi16(s0, a0);
	tmp1 = _mm_mullo_epi16(s1, a1);
	/* 255 - A    */
	diff0 = _mm_sub_epi16(ff, a0);
	diff1 = _mm_sub_epi16(ff, a1);
	/* (255 - A) * D */
	diff0 = _mm_mullo_epi16(diff0, d0);
	diff1 = _mm_mullo_epi16(diff1, d1);
	/* r = A * S + (255 - A) * D */
	r0 = _mm_add_epi16(tmp0, diff0);
	r1 = _mm_add_epi16(tmp1, diff1);
	/* shift; */
	r0 = _mm_srli_epi16(r0, 8);
	r1 = _mm_srli_epi16(r1, 8);
	/* and pack */
	t = _mm_packus_epi16(r0, r1);
	_mm_storeu_si128((__m128i *) pDst[(comp)], t)

The results for this test case are:

Test Background Width Background Height Overlay Width Overlay Height pixRate (Mpix/sec) pixel_time (nSec) Line overhead Speedup to base Speedup to previous
SSE_INTR 2000 1200 48 48 76.623 13.051 103.40% 103.40%
SSE_INTR 2000 1200 192 48 95.879 10.43 167.74 137.29% 137.29%
SSE_INTR 2000 1200 256 48 94.682 10.562 147.04 126.40% 126.40%
SSE_INTR 2000 1200 384 48 100.628 9.938 170.77 137.31% 137.31%
SSE_INTR 2000 1200 640 48 102.811 9.727 172.49 141.49% 141.49%

As can be observed, the results are somewhat disappointing. A speed up of 160%-170% is not much, given that you have 16 times more execution units. A close look at the assembler code that is generated by Visual Studio reveals the culprit. It seems that the Visual studio compiler uses only 2 registers our of the 8 SSE registers that are available in the processor. After scratching my head for while, I thought that I should be more explicit in terms of how one calculation result should be re-used for the next calculation. The downside, of course, is that it requires writing the code in a totally unreadable form (or, also known as, write-only-code). The result EMMX_BLEND macro became:

#define EMMX_BLEND(comp)
	t = _mm_loadu_si128((__m128i *) pDst[(comp)]);
	d0 = _mm_unpacklo_epi8(t, zero);
	d1 = _mm_unpackhi_epi8(t, zero);
	t = _mm_loadu_si128((__m128i *) pSrc[(comp)]);
	_mm_storeu_si128((__m128i *) pDst[(comp)],
            _mm_packus_epi16(_mm_srli_epi16(_mm_add_epi16(_mm_mullo_epi16(_mm_unpacklo_epi8(t, zero), a0),
            _mm_mullo_epi16(_mm_sub_epi16(ff, a0), d0)), 8),
            _mm_srli_epi16(_mm_add_epi16(_mm_mullo_epi16(_mm_unpackhi_epi8(t, zero), a1),
            _mm_mullo_epi16(_mm_sub_epi16(ff, a1), d1)), 8)))

And the results:

Test Background Width Background Height Overlay Width Overlay Height pixRate (Mpix/sec) pixel_time (nSec) Line overhead Speedup to base Speedup to previous
SSE_INTR 2 2000 1200 48 48 92.163 10.85 144.66% 20.29%
SSE_INTR 2 2000 1200 192 48 122.854 8.14 173.44 204.04% 28.13%
SSE_INTR 2 2000 1200 256 48 128.381 7.789 180.83 207.00% 35.60%
SSE_INTR 2 2000 1200 384 48 132.865 7.526 182.35 213.37% 32.05%
SSE_INTR 2 2000 1200 640 48 136.784 7.311 183.65 221.30% 33.05%

An extra ~30% over the readable version which translates into a total of up to 200% speedup over the baseline. Not quite there yet!

So, if the brute force method doesn’t work, perhaps we should be even more brutal – time for assembler. At first, I was a little hesitant about writing the code in assembly, mainly when it comes to debugging the code using the Visual Studio debugger. However, as it turned out, it wasn’t that hard and perhaps even a lesson to bare in mind. Besides the ‘mechanical’ translation of the intrinsics into their assembly syntax, all I was left to deal with was the register allocation. And the end result code looks like this:

__asm {
	// initalization & load Alpha
	pxor xmm0, xmm0 // xmm0 <- 0
		mov eax, dword ptr [pSrc + 12]
		movdqu xmm1, [eax]; xmm1 <- *pSrc[3]
		movdqa xmm2, xmm1;
	punpcklbw xmm2, xmm0; // xmm2 <- a0, 16bit
	movdqa xmm3, xmm1;
	punpckhbw xmm3, xmm0; // xxm3 <- a1, 16bit

	// blending the red;
	mov eax, dword ptr[pDst + 0];
	movdqu xmm1, [eax]; // xmm1 = pDst[0]
	movdqa xmm6, xmm1;
	punpcklbw xmm6, xmm0; // xmm6 <- pDst[0] low 16bit
	movdqa xmm7, xmm1;
	punpckhbw xmm7, xmm0; // xmm7 <- pDst[0] high, 16 bit
	// load the ff constant
	movdqu xmm4, [ffconst]; // xmm4 <- ff
	movdqa xmm5, xmm4;
	psubw  xmm5, xmm2; // xmm5 = ff - a0
	pmullw xmm6, xmm5; // xmm6 = (ff - a0) * d0;
	// now for the upper bits
	movdqa xmm5, xmm4;
	psubw  xmm5, xmm3; // xmm5 = ff - a1
	pmullw xmm7, xmm5; // xmm7 = (ff - a1) * d1;
	// load the source;
	mov eax, dword ptr[pSrc + 0];
	movdqu xmm1, [eax]; // xmm1 = pSrc[0]
	// low bits of pSrc[0]
	movdqa xmm5, xmm1;
	punpcklbw xmm5, xmm0; // xmm5 = pSrc[0], low, 16 bit;
	pmullw xmm5, xmm2; // xmm5 = s0 * a0;
	paddw xmm6, xmm5; // xmm6 = s0 * a0 + (ff - a0) * d0;
	// high bits of pSrc[0]
	movdqa xmm5, xmm1;
	punpckhbw xmm5, xmm0;
	pmullw xmm5, xmm3; // xmm5 = s1 * a1
	paddw xmm7, xmm5; // xmm7 = s1 * a1 + (ff - a1) * d1;
	// shift the results;
	psrlw xmm6, 8;
	psrlw xmm7, 8;
	// pack back
	packuswb xmm6, xmm7; // xmm6 <- xmm6{}xmm7 low bits;
	mov eax, dword ptr [pDst + 0];
	movdqu [eax], xmm6; // done for this component;

	// Similar code goes for the green and blue channles
	// see the code for details

};

As for the results:

Test Background Width Background Height Overlay Width Overlay Height pixRate (Mpix/sec) pixel_time (nSec) Line overhead Speedup to base Speedup to previous
SSE_ASM 2000 1200 48 48 174.591 5.728 363.44% 89.42%
SSE_ASM 2000 1200 192 48 297.48 3.362 151.42 636.14% 142.12%
SSE_ASM 2000 1200 256 48 318.166 3.143 152.71 660.80% 147.82%
SSE_ASM 2000 1200 384 48 334.824 2.987 150.36 689.55% 151.96%
SSE_ASM 2000 1200 640 48 356.493 2.805 151.68 737.43% 160.64%

Finally it seems that we are getting close. The improvement over the base line is 7.3 X. This is not the desired 16X yet, but given the 4 evenings that I have put into this effort, this is probably at the point where I should call it version 1 and post something!

The Test program & the code

The test program (img_ops.exe) that used for this project is available *here*. It has two running options:

Benchmark mode – Is used to measure the performance with various image sizes.

Syntax: img_ops.exe benchmark

Bubble Tank demo – Demonstrate the performance difference between the serial and the vectorized code. In this mode the blending of multiple instances of the overlay image are blended on the background images.

Syntax: img_ops.exe bubble_tank <background image> <overlay image>

Disclaimers, next steps and conclusions

Ok, there are many of them so lets get started.

First, you might have noticed that the SIMD implementation assumes that source image line width is a multiply of 16 bit. Obviously this restriction is not realistic and should be relieved by adding wind up/down code to handle the head and the tail of a line.

Next, As can be observed from the results above, the ‘per-line’ overhead is quite considerable. With the fastest version, it seems that we add overhead of ~80 pixels per line. Obviously this should be reviewed for potential optimization. In general, both the serial and the SIMD codes have not been reviewed and there is probably big potential for shaving a few cycles off the inner loops.

In terms of next steps (once I get to finish up this one) is to test this code on other architectures. Especially interesting one is ARM/Neon. The reason is that while there is a growing market desire for richer user interfaces, there are many ARM based embedded systems that can’t afford to include a GPU due to area/costs considerations. Such implementations could use the SIMD extensions to deliver a rich and compelling user experience.

To conclude this phase of the experiment, there are few takeaways:

First, as expected, the SIMD instruction set indeed offers great optimization potential. We have presented 7.3 X performance boost, with strong feeling that there is plenty of room for more.

Second, SIMD code, especially written in assembly language and using Visual Studio, is not debugging friendly. So, you better have a solid, debugged serial or vectorized implementation before you start your assembly programming.

Simple network emulation

May 23rd, 2010 by Yuval Drori

I needed to set up a test environment for a client server application that emulates different network speeds and latency. Running a Google query for “wan emulation” or “network emulation” returns many different results, most of them paid apps and boxes. Then I remembered a podcast I listened to recently about pfSense. I downloaded the nightly version 2.0 beta and installed it on an old box we had laying around with 2 network cards. The installation process was very straightforward. After installing, I had to tell pfSense which network interface is the wan side – in my case connected to the GraphTech internal lan, and which is the lan – in my case connected to the server I was testing. Now it is all a matter of setting up the system using the simple web interface from the server machine on the lan side. I set up a 1:1 nat between the pfSense and the server I am testing and a firewall rule to map all traffic to the server. I also set up a few limiters - the queues that fakes different network speeds and added them to the firewall rule as In/Out rules. All I need to do now in order to test different upload and download speeds, latency and even unordered packets is to change the In/Out queues in the firewall rule!

Image shift detection

May 17th, 2010 by Stas Gurtovoy

This post will discuss some of the theory of image shift detection. What’s shift detection for images and what is it good for?

I think it’s best to explain by example, which will be our reference for the rest of the post. We’ll refer to a typical server-client distributed visualisation environment, in which the server runs the applications, the image is grabbed from the server, processed and encoded if needed, and then send to the client, where it’s decoded and displayed. The time consuming parts are server & client processing and network limitations. See this post for more details.

Now lets consider a scenario where the server runs an application like a browser or image gallery, and the user scrolls down the screen, so the entire screen is updated, but actually only small part of it is getting changed, most of it is just being shifted upwards. Another example could be a map application with a small blinking location mark on it, and the application updates the entire map every time the small mark blinks or is changed. In these examples it seems intuitive to save some processing time and bandwidth by “reusing” the images from previous frames. This is where we could use shift detection.
Seeing as some parts of the frame are changed while others aren’t, we need to divide the full frame to tiles. Tiling usually causes some overhead so the size of the tiles should be optimised to the environment. If for example we’re using libjpeg to encode frames, there is overhead for each tile, therefore we don’t want to make them too small. The essence of shift detection is trying to find tiles from the current frame in the previous frames (possibly with some shift) and then reusing them to save time. In order to do this, we need copies of the previous and current frames on the server and client. We need to allocate memory for these and maintain them correctly and consistently (the server and client copies should be identical).
The shift detection process has 3 main tasks:
1. Create an anchor – a reference point for matching.
2. Find shift by matching the anchor.
3. Match tiles, based on the found shift.

Note: the shift detection discussed here only handles vertical and horizontal shifting. Random direction shift detection is much more expensive, therefore not always useful.

1. Creating an anchor

We need an efficient way to detect potential shift. Comparing the entire tile is very slow, so we need to define some unique fragments which will give us a good reference point. For example, we could use “dirty” stripes, stripes with lots of pixels changes in them. An anchor could be defined by N dirty stripes of length L. The anchor shouldn’t be too simple – it should represent a unique area in the frame, otherwise we’re prone to mismatches – where the anchor is matched, but tiles actually aren’t. We also want the anchor to be somewhere close to the center of the frame, to insure, in good probability, it’s not scrolled out. Creating an anchor could be integrated within the encoding, in order to save time. For example, for every tile we encode we could try create a potential anchor, by counting different pixels and creating dirty stripes from them (we don’t need to save stripes data, only the coordinates). If we managed to create a potential anchor in the encoding process, we can compare it to the current anchor, if exists, and replace it if it’s closer to center.
2. Shift matching
When process a new frame, first of all we want to see if it’s possible to reuse data from previous frames. If we have a valid anchor from previous frames, it might be possible. We want to see where the anchor from the previous frame is located in our current frame, thus calculate the shift from the previous frame. As mentioned, we’re only handling vertical & horizontal shifts.

Note: Zero shift, is also a shift – therefore “diff” between two frames is also covered by shift detection.

So if our anchor is N stripes, we can start by matching the first stripe, vertically or horizontally (we could save previous shift direction as hint). If we matched the first stripe, we can then use the found shift to match the rest of the anchor’s stripes. If we matched the entire anchor, then we can declare we’ve found a shift, otherwise we have no match, and we encode the frame normally.

3. Tile matching
Now we’ve detected a potential shift, we still need to make sure the shifted tiles in the current frame are matched with the previous frame, because even if we matched the anchor, the tiles still can have some changes between them (like mouse cursor, or a letter in an editor application). So now, for every tile in the current frame, we try to match an area in the previous frame, based on the shift we’ve detected. We have to go over all the pixels and compare them to make sure we can match the entire tile. If we have a match, then instead of sending tile data to the client, we can send only the shift coordinates, and the client can reproduce the tile from the previous frame copy. If shift is zero, then we don’t have to send anything, because the area is unchanged. If we don’t have a match, then we process the tile normally and send the new image data to the client.

Performance wise, the shift detection takes a least one full pass over all pixels per tile ( the shift matching is cheap, because of the vertical & horizontal restriction ), but it’s still significantly less than most of the image compression algorithms, and if our target applications matches the “shift” scenario, it could be a very effective strategy. We can also save a lot of bandwidth, by sending only shift coordinates instead of full image data, even in compressed form.

GTLabs demo of 3D glasses for Second Life

April 23rd, 2010 by Yuval Drori

In this video we see people using our pilot demo of 3D
glasses for Second Life. The demo took place in the
Metaverse.org internal meeting in Nancy, France.

Look particularly at the head tracking and the movement
of the image in the small laptop screen.

We used the Vuzix iWear VR920, 3D glasses. The glasses
come with an SDK which we then linked to the Second Life
(SL) open client 1.24.4 (2956)

Using the Vuzix SDK, when we detect connected glasses and Second Life is in MouseView mode, we use the glasses sensors to move the camera around.

Note:
- Any current Second Life client must be compatible with
the 2.0 viewer. There is a self certification process
for this.
- One could do 3D stereo.
iWear is fully iWear® 3D compliant and supports NVIDIA stereo drivers.
Second Life, on the other hand, is not. A patch for 3D stereo was proposed but broken on later builds.

- The cable connected to the glasses interferes with ones
movement.  We are looking for ways to transmit the data
wirelessly.

A similar set up five years ago would have cost around $10K.
With the commercialization and maturity of hardware like
3D glasses we expect to see many more uses of this type of
technology for virtual and augmented reality in the near
future.

How to create your own media center – Choosing your hardware.

April 13th, 2010 by Gil Shapira

Ok, so you have been convinced that a Media Center is something you want, now it’s time to choose your hardware.
First, let’s start with some questions that need to be answered:

  • What kind of videos are you planning to watch? (1080p, 720p, SD).
  • What other devices (receiver, TV, speakers, headphones) are you planning to connect to your media center and how are you planning to connect them? (Optic , s-video , component, composite, HDMI)
  • What file formats are you using? (avi, mp3, mkv, flac)
  • How are you planning to control your media center? (remote control, keyboard, mobile phone)
  • Where are you planning to put the media center? Does it need to be small? quiet? look good?
  • Do you want to be able to record content from the TV?
  • Do you want to be able to play cds, dvds and Blueray discs?
  • Do you need it to support languages other than English (like Hebrew) when browsing the menus or watching movies with subtitles?
  • What kind of network do you have? (LAN, wireless)
  • Do you want to have full (or at least as high as possible) control over the menus and be able to adapting them to your needs.
  • Do you want to run other applications besides the Media Center app? (like file sharing, ftp and web server apps)
  • Do you have the skills (and will) to mess with computer hardware/software? Because some setups require high maintenance and some needs none.
  • Is it going to be turned on 24hrs? if that the case, power consumption may be something you should take in consideration
  • What is your budget?

It is good idea to mention now that there may be no hardware to suite you perfectly so try to answer these questions and keep in mind that you may need to compromise on some things.

Now, after  you answered these questions, let’s have a look on the available hardware.
The hardware can be divided into two groups:

  • Dedicated hardware (sometimes being called streamers) .
  • Computers in disguise (HTPCs).

The streamers are small boxes with special hardware and embedded software which serves the most common basic needs of media centers. They may or may not contain hard disks, may or may not use wireless network, or in general words, may or may not contain whatever you may think of.

The HTPCs are computers as we all know them. Since computer can run any piece of software, you may have almost full control over the media center (more on software in the next chapter). Of course you still need hardware to support the software (more on computer hardware in a minute).

Before we continue, let’s summarize these two types of hardware:

Streamers:
Pros:

  • No need to have computer skills.
  • Just plug and play.
  • Usually small and quiet.
  • Usually comes with a remote control.
  • Less expensive.
  • Low power consumption.

Cons:

  • No way to customize the look and feel
  • No ability to record TV.
  • Limited file format support.
  • Limited ability to update the software.
  • Limited input/output connectors.

Things to check before you buy:

  • What input/output connectors to you need?
  • What file formats does it support?
  • What language does it support?
  • Is it able to play 1080p content?
  • How does it connect to the network?

HTPCs:
Pros:

  • Full (or almost full) ability to customize the look and feel.
  • Ability to record TV content.
  • Fully upgradable.
  • May play any kind of format.
  • May contain optic drive.
  • May have rich input/output connectors.
  • Actually may have everything that can be connected to a computer.

Cons:

  • Needs some computers knowledge.
  • High maintenance.
  • More expensive.
  • Higher power consumption.

Things to check before you buy:

  • What kind of hardware does it contain?

If you still take an interest in HTPCs, here is the list of hardware components that you may want:

Motherboard and graphic card:
Motherboard and graphic card are the heart of the media center. Unless you want to play high end computer games on your HTPC, it is good idea to make the graphic card to be part of the motherboard. You need to pay attention that the motherboard contains all the input/output connectors you need. If you’re planning to watch 1080p video content, it is highly recommended that the video card will have hardware support for decoding (this will remove the need of a strong CPU). Most motherboards also have a sound card on-board. If you’re planning to use digital output, these on-board sound cards should be sufficient. Make sure that it has enough sata connectors (supports for RAID if needed) because you may need more than one hard disk if you keep HD content. Some USB connectors are always good to have. One or two PCI-e slots if you’re planning to add a TV card (in order to record content). Pay attention to the motherboard size, there is an ATX and mATX which is much smaller. Make sure you don’t buy a motherboard too big to fit your chassis or your needs.

CPU
Unless you don’t have hardware support for decoding 1080p videos on you video card, you don’t need a strong CPU. Two or more cores are always good to have.

RAM
2GB of RAM should be sufficient. Don’t buy the high-end super fast memory, they’re not needed here.

Hard disks
The bigger you buy, the more space you’ll have to store your content. You may buy internal hard disk and keep them inside your chassis. You may buy external ones and connect them via USB. You can also use disk remotely and make the HTPC access the data using the network, but keep in mind that you’ll need at least one hard disk in your HTPC and remote hard disks are much slower than local.
Most hard disk vendors have different series for their hard disks. I recommend the power efficient ones since they are much quieter and consume less power.

Network card
If you don’t have one on-board, buy one according to your needs (100Mb, 1000Mb, wireless).

TV card
You’ll need one in order to record content from TV.

Remote control
You may buy a remote control USB sensor in order to control your HTPC. You may then replace the remote control with a “smart” one and use it to control your other appliances. Another option is to use a bluetooth/wireless/wifi remote control or even your mobile phone.

PSU
If you’re planning to keep the HTPC turned on, makes sure you PSU is highly efficient and silent. Around 400W should be sufficient.

Chassis
Surprisingly, the chassis may be one of the more expensive components in your setup.You may want it to be good looking. The sky (and the budget) is the limit ;) .The regular height for a chassis is about 17 cm. There are low profile cases with a height of about 9 cm. If you decide to buy a low profile chassis, make sure your cards are also low profiled. Make sure your chassis is  big enough to contain the motherboard and maybe few hard disks and have enough space for an optical device.
It must be highly ventilated. Make sure its fans are big and quiet.
Remember: Low quality cases will be noisy.

Heat sinks
You may want to replace the CPU and GPU fans with heat sinks in order to make the HTPC even more quiet.

Last words.
Recently, you can find Atom based HTPC boxes using Ion technology. These boxes tend to be small PCs which have most of acomputer’s ability and you can use them without the need to mess with computer hardware. I personally haven’t have a chance to play with such a toy but if you want to give it a try, use the list above to check if it really meets your needs.

I hope this short hardware introduction was helpful and I’ll be more than happy to read your comments.

See you in the next chapter.

Gil

Extending libjpeg to new formats

March 31st, 2010 by Stas Gurtovoy

I’m working on a project on Android, and one of the tasks in the project is to capture an image from the frame buffer and perform a quick jpeg compression on it. Android has a working libjpeg – so it shouldn’t be a big problem – there’s only one little issue, it needs to be quick. The problem is that my device is a G1, and it’s frame buffer is in RGB_565 format (16bit per pixel) – and libjpeg doesn’t support this. So I have to do a (slow) blit of the frame buffer to local memory, together with a format conversion to RGB888, for example (which is supported by libjpeg), and then pass the buffer to jpeg. I would like to prevent this conversion blit and pass the original RGB_565 buffer as input.

Note: Actually, accessing frame buffer memory on the device is slow, so in practice it’s more effective to do a fast copy (memcpy) of the frame buffer data to local memory, and perform the conversion from there, but this is not the point of this post.

Jpeg has two main functionalities: encoding and decoding. In both cases there are 2 formats (or color spaces as libjpeg refers to them) involved: input format (for encoding) or output format (for decoding) and the internal format, used by the library to represent the data. In this post we’ll concentrate on the encoding part, and how to extend the jpeg encoder support to new formats, more specific to RGB_565.

Note: Android’s version of libjpeg is actually extended to support RGB_565 (for G1) and RGB_8888 (for Nexus1) as output format (for the decoder), but not as input format.

So how do we extend libjpeg to support RGB_565?
Warning: this post includes modifications of the actual source code of libjpeg – it assumes that you have the source, are able to build it and generally know what you’re doing. Continue at your own risk..

First we need to define the enum for the new type – android already done this, in jpeglib.h:

/* Known color spaces. */
 typedef enum {
     JCS_UNKNOWN,        /* error/unspecified */
     JCS_GRAYSCALE,      /* monochrome */
     JCS_RGB,        /* red/green/blue */
     JCS_YCbCr,      /* Y/Cb/Cr (also known as YUV) */
     JCS_CMYK,       /* C/M/Y/K */
     JCS_YCCK,       /* Y/Cb/Cr/K */
 #ifdef ANDROID_RGB
     JCS_RGBA_8888,  /* red/green/blue/alpha */
     JCS_RGB_565     /* red/green/blue in 565 format */
 #endif
 } J_COLOR_SPACE;

Next we need to modify the jcparam.c file, specifically the jpeg_default_colorspace function. This function sets the default internal color space based on the input color space. We add these lines:

case JCS_RGB_565:
     jpeg_set_colorspace(cinfo, JCS_YCbCr);
     break;

Note: Jpeg performs best on YCbCr data – it works faster & gives more compact image result, than on RGB data (again – that’s not the theme of this post), therefore we set the default inner color space to JCS_YCbCr (and not JCS_RGB). Please note that libjpeg defines JCS_YCbCr as default for JCS_RGB as well.

After we define the default behaviour, we need to modify one last file: jccolor.c . This file holds the code for all encoding color conversions. It’s important to understand that we don’t add JCS_RGB_565 as an internal jpeg work format, but rather as an input format, therefore we need to add methods that convert it to existing inner formats.
It’s easier (but, as mentioned, less effective) to convert RGB_565 to RGB_888, so we start with this. We add the following function to jccolor.c:

METHODDEF(void)
rgb_565_rgb_convert(j_compress_ptr cinfo,
           JSAMPARRAY input_buf, JSAMPIMAGE output_buf,
           JDIMENSION output_row, int num_rows)
{
  register unsigned short * inptr;
  register JSAMPROW outptr_r;
  register JSAMPROW outptr_g;
  register JSAMPROW outptr_b;
  register JDIMENSION col;
  register int ci;
  JDIMENSION num_cols = cinfo->image_width;

  while (--num_rows >= 0)
  {
      inptr = (unsigned short*)*input_buf;
      outptr_r = output_buf[0][output_row];
      outptr_g = output_buf[1][output_row];
      outptr_b = output_buf[2][output_row];

        for (col = 0; col < num_cols; col++)
        {
             unsigned short rgb565 = *inptr;
            outptr_r[col] = (JSAMPLE)((rgb565 & 0x001F) << 3);
            outptr_g[col] = (JSAMPLE)((rgb565 & 0x07E0) >> 3);
            outptr_b[col] = (JSAMPLE)((rgb565 & 0xF800) >> 8);
            inptr ++; //RGB565
        }

    input_buf++;
    output_row++;
  }
}

Now what we really want is JCS_RGB_565 to JCS_YCbCr conversion, so we need to add the following method as well:

METHODDEF(void)
rgb_565_ycc_convert (j_compress_ptr cinfo,
         JSAMPARRAY input_buf, JSAMPIMAGE output_buf,
         JDIMENSION output_row, int num_rows)
{
  my_cconvert_ptr cconvert = (my_cconvert_ptr) cinfo->cconvert;
  register int r, g, b;
  register INT32 * ctab = cconvert->rgb_ycc_tab;
  register unsigned short * inptr;
  register JSAMPROW outptr0, outptr1, outptr2;
  register JDIMENSION col;
  JDIMENSION num_cols = cinfo->image_width;

  while (--num_rows >= 0) {
    inptr = (unsigned short*) *input_buf++;
    outptr0 = output_buf[0][output_row];
    outptr1 = output_buf[1][output_row];
    outptr2 = output_buf[2][output_row];
    output_row++;
    for (col = 0; col < num_cols; col++) {

        unsigned short rgb565 = *inptr;
        r = (JSAMPLE)((rgb565 & 0x001F) << 3);
        g = (JSAMPLE)((rgb565 & 0x07E0) >> 3);
        b = (JSAMPLE)((rgb565 & 0xF800) >> 8);

        inptr ++; //RGB565
      /* If the inputs are 0..MAXJSAMPLE, the outputs of these equations
       * must be too; we do not need an explicit range-limiting operation.
       * Hence the value being shifted is never negative, and we don't
       * need the general RIGHT_SHIFT macro.
       */
      /* Y */
      outptr0[col] = (JSAMPLE)
        ((ctab[r+R_Y_OFF] + ctab[g+G_Y_OFF] + ctab[b+B_Y_OFF])
         >> SCALEBITS);
      /* Cb */
      outptr1[col] = (JSAMPLE)
        ((ctab[r+R_CB_OFF] + ctab[g+G_CB_OFF] + ctab[b+B_CB_OFF])
         >> SCALEBITS);
      /* Cr */
      outptr2[col] = (JSAMPLE)
        ((ctab[r+R_CR_OFF] + ctab[g+G_CR_OFF] + ctab[b+B_CR_OFF])
         >> SCALEBITS);
    }
  }
}

We also need to let libjpeg know when to invoke these functions, this it done in jinit_color_converter method. The cases for JCS_RGB and JCS_YCbCr internal formats should look like this:

   case JCS_RGB:
     if (cinfo->num_components != 3)
       ERREXIT(cinfo, JERR_BAD_J_COLORSPACE);
     if (cinfo->in_color_space == JCS_RGB && RGB_PIXELSIZE == 3)
       cconvert->pub.color_convert = null_convert;
     else if (cinfo->in_color_space == JCS_RGB_565 && RGB_PIXELSIZE == 3)
       cconvert->pub.color_convert = rgb_565_rgb_convert;
     else
       ERREXIT(cinfo, JERR_CONVERSION_NOTIMPL);
     break;

   case JCS_YCbCr:
     if (cinfo->num_components != 3)
       ERREXIT(cinfo, JERR_BAD_J_COLORSPACE);
     if (cinfo->in_color_space == JCS_RGB) {
       cconvert->pub.start_pass = rgb_ycc_start;
       cconvert->pub.color_convert = rgb_ycc_convert;
     } else if (cinfo->in_color_space == JCS_RGB_565) {
        cconvert->pub.start_pass = rgb_ycc_start;
        cconvert->pub.color_convert = rgb_565_ycc_convert;
     } else if (cinfo->in_color_space == JCS_YCbCr)
       cconvert->pub.color_convert = null_convert;
     else
       ERREXIT(cinfo, JERR_CONVERSION_NOTIMPL);
     break;

If the inner format is JCS_RGB, we set the color_convert function pointer to rgb_565_rgb_convert, and for JCS_RGB_YCbCr we set it to rgb_565_ycc_convert. The color_convert function is called in the pre_process_data phase, for every encoded scan line. Now what do those function do? They basically perform format conversion. One thing worth mentioning is that the input buffer store pixels in interleaved format: RGBRGBRGB…
whereas the output buffer holds the data per channel:
RRRRRR…
GGGGG…
BBBBB…

For the rgb->ycc conversion we also define the start_pass function, which creates tables of fixed values per channel, which are later used in the rgb_565_ycc_convert function. This function is called only once per encoder, and we haven’t implemented a new version for RGB_565, but rather use the one for RGB_888.

Now you can use libjpeg with JCS_RGB_565 as input parameter, all you need to change in your application is something like this:

compressor.input_components = 2;
compressor.in_color_space = JCS_RGB_565;

everything else is standard (see example.c in the libjpeg source for more details).

Anyway, this gave us a nice performance boost for our task, I think I’ll purpose this as extension for libjpeg.

Adding colors to emails

March 23rd, 2010 by Sagi Ben-Akiva

Couple of days ago I found this blog post which explains how to add colors to your emails using Thunderbird email client

make-your-emails-more-colorful

Enjoy,
Sagi.

Android G1 system installation

March 21st, 2010 by Ehud Katz

android_logo

So, what happens if you mistakenly delete necessary files while developing on android? Well, I’ll tell you what happens – it will automatically load the recovery mode, and since some critical files are missing (like the sh and ls), you cannot even enter the device’s shell. If you, cleverly, try to do a “adb sync” command, and copy the necessary files, while in recovery mode, it will delete them after reboot.

To rescue the device, you need to flash the system.img, userdata.img and boot.img files – basically reflash the device. To do so, you need to reboot the device into bootloader (holding power-button & camera-button together - when the device is shut down, or using the “adb reboot bootloader“). When the bootloader screen comes up, connect the device to the PC, and if it the screen does not show “fastboot” instead of “serial 0” (or something like this), try pressing the back-button. If after all, the text “fastboot” doesn’t show up, then you will need to “upgrade” the bootloader. To do so, follow the instructions here: http://android-dls.com/wiki/index.php?title=Engineering_Bootloader. You might need to update the recovery partition as well, before you do the previous instructions: http://android-dls.com/wiki/index.php?title=Replace_Recovery_Partition.

When the bootloader is in “fastboot” state, you will need to enter the following commands, to reflash the device:

fastboot flash system system.img

fastboot flash userdata userdata.img

fastboot flash boot boot.img

That’s it! Just reboot the device, and it should load like new.  :)

* You may get the “.img” files by compiling the android source code.

For more information refer to: http://androidcommunity.com/forums/f10/adb-fastboot-for-rookies-18124/.