YUV conversion on CELL

Table of Contents.

Converting YUV to RGB using SPU SIMD code

Scalar Example

SIMD Integer x 4

Who likes short shorts?

Float? Or Sink?

Comparing the compiled code

Integer x 4

Short x 8

Integer x 4 - version 2

Integer x 4 - version 2-a

Short x 8 - version 2 / 2-a

Short x 8 - version 3

Short x 8 - version 4

Timing Results and Conclusions

Appendix A - Source Code

References

Converting YUV to RGB using SPU SIMD code

Converting YUV to RGB (approximately) is a fairly simple process but requires a lot of arithmetic, and is something that converts to SIMD code incredibly well.

As described in the YUV conversion article on fourcc.org, an approximate calculation for each pixel is given by:

  R = Y + 1.402 (Cr-128)
  G = Y - 0.34414 (Cb-128) - 0.71414 (Cr-128)
  B = Y + 1.772 (Cb-128)

Note: (Y, Cr, Cb) == (Y, U, V) in our notation
Note: There is much discussion on the linked page as to which numbers are right - these numbers seem to work well enough for me for now.

Scalar example

This is a straightforward implementation using fixed point arithmetic in c.

// Using 2.14 fixed point format
#define SHIFT (14)
// Let the compiler do the constant formation
#define YSCALE  ((int) (( 1.0    ) * (1<<SHIFT)))
#define BUSCALE ((int) (( 1.772  ) * (1<<SHIFT)))
#define GVSCALE ((int) ((-0.71414) * (1<<SHIFT)))
#define GUSCALE ((int) ((-0.34414) * (1<<SHIFT)))
#define RVSCALE ((int) (( 1.402  ) * (1<<SHIFT)))

void yuv2rgb(unsigned char *yp, unsigned char *up, unsigned char *vp, unsigned int *argbp, int count)
{
    int x;

    for (x = 0;x<count;x++) {
        int y = yp[x];
        int u = up[x/2] 
        int v = vp[x/2];
        int r, g, b;

        // pre-calculate common subexpressions
        y = y * YSCALE;
        u = u - 128;
        v = v - 128;

        // perform basic calculation
        r = y + RVSCALE * v;
        g = y + GUSCALE * u + GVSCALE * v;
        b = y + BUSCALE * u;

        // fix point
        r >> = SHIFT;
        g >> = SHIFT;
        b >> = SHIFT;

        // clamp to maximum
        r = r > 255 ? 255 : r;
        g = g > 255 ? 255 : g;
        b = b > 255 ? 255 : b;

        // clamp to minimum
        r = r < 0 ? 0 : r;
        g = g < 0 ? 0 : g;
        b = b < 0 ? 0 : b;

        // pack and write argb value
        argbp[x] = ( r << 16) | ( g << 8 ) | b;
    }
}

There are probably a few rounding error/loss of precision issues above - it is an ok approximation at this point though. Also, this is uncompiled or tested.

In most (?) video formats, the U and V data is specified at 1/2 the horizontal resolution of Y, so in the above code we take that into account. Now this is one of those embarassingly parallelisable functions, so converting it to spu SIMD code is almost trivial.

SIMD - integer x 4

So, lets have a look at how one goes about converting it to SIMD code.

The only really tricky bit is a detail hidden in the example above - the casting of the input values to an int increases their size. When loading packed byte data this casting and size expansion must be handled manually. But apart from that, the conversion is straightforward.

First, initialise the constants - we can use the same values, but they need to be in vectors since we cannot multiply an immediate value. Similarly, we need a clamping value as a vector type so we can clamp 4 values at once.

void yuv2rgb_vec(unsigned vector char *yp,
                 unsigned vector char *up, unsigned vector char *vp,
                 unsigned vector int *argbp, int count)
{
    int x;
    const vec_int4 yscale  = spu_splats(YSCALE);
    const vec_int4 buscale = spu_splats(BUSCALE);
    const vec_int4 gvscale = spu_splats(GVSCALE);
    const vec_int4 guscale = spu_splats(GUSCALE);
    const vec_int4 rvscale = spu_splats(RVSCALE);
    const vec_int4 clamp   = spu_splats( 255 );

Next comes some shuffle control words. We use two different ones, and they are used to convert the packed byte data into integers. One is used for the Y data which is used as is, and the other for the U and V data, which also requires 2x expansion before we can use it.

We want to minimise our memory loads, so we load 1 full quad-word (i.e. 16 bytes) each time (not that we have a choice!), and process as much as we can, and if we are using integers (32 bit), that means one memory load is expanded into 4 quad-word registers for the Y data, and 8 for the U and V data. Instead of requiring a new shuffle pattern for every set of registers we can just use the same one and increment it - adding is fast and runs on a different pipeline to the shuffle and loads, so is faster than loading a pre-calculated shuffle pattern (given that we don't need to do much other arithmetic in the mean time). Because I was lazy and didn't want to cut and paste the whole lot of code twice, lets just use some logic to make sure that the UV shuffle mask updates at the required rate.

We can also use a shuffle pattern to perform the packing back to RGB format from 3 registers of RGB values in fewer instructions than using shifts and ors. The shuffle performs the equivalent of r<<16 | g<<8 in one step.

    const vec_uint4 shuffle_y_0  = (vec_uint4) { 0x80808000, 0x80808001, 0x80808002, 0x80808003 };
    const vec_uint4 shuffle_uv_0 = (vec_uint4) { 0x80808000, 0x80808000, 0x80808001, 0x80808001 };

    const vec_uint4 shuffle_rg   = (vec_uint4) { 0x80031380, 0x80071780, 0x800b1b80, 0x800f1f80 };

The first one just converts the first (next) 4 bytes into unsigned integers, and the second expands the first 2 bytes into 4 unsigned integers. To get to the next 4 bytes in the loaded quadword, we just add 0x04 to the first one (and 0x02 to the second one), and repeat 4 (or 8) times to cover everything we actually loaded.

The loop thus becomes:

    vec_uint4 shuffle_uv = shuffle_uv_0;

    for (x=0;x<count/16;x+=1) {
        vec_int4 y = (vec_int4)yp[x];
        vec_int4 u = (vec_int4)up[x/2];
        vec_int4 v = (vec_int4)vp[x/2];

Note the use of casts - this just means I have to do fewer casts later. A case where C code just gets in the way of the instructions - the spu doesn't give a shit about types.

        vec_int4 y0, y1, y2, y3;
        vec_int4 u0, u1, u2, u3;
        vec_int4 v0, v1, v2, v3;

        vec_int4 r0, r1, r2, r3;
        vec_int4 g0, g1, g2, g3;
        vec_int4 b0, b1, b2, b3;

        vec_uint4 shuffle_y  = shuffle_y_0;

	shuffle_uv = (x & 1) == 0 ? shuffle_uv_0 : shuffle_uv;

        // convert 16 YUV values into 4x4 value vectors
	y0 = spu_shuffle(y, y, (vec_uchar16)shuffle_y);
	shuffle_y = spu_add(shuffle_y, 4);
	u0 = spu_shuffle(u, u, (vec_uchar16)shuffle_uv);
	v0 = spu_shuffle(v, v, (vec_uchar16)shuffle_uv);
	shuffle_uv = spu_add(shuffle_uv, 2);

        ... repeat this 3 more times for yn, un, and vn
        // 5*4 = 20

y0, u0, and v0 (and 1,2,3 of each) now contain 4 YUV values, stored as a vector. The arithmetic can just be applied as it was before, but using intrinsics (although the compiler can do some of it automatically it can be easier just doing it manually). And all we have to do is repeat it 4 times.

        // pre-calculate common subexpression
        y0 = spu_mulo((vec_short8)y0, (vec_short8)yscale);
        v0 = spu_add(v0, -128);
        u0 = spu_add(u0, -128);
        .. repeat for 1-3
        // 3*4 = 12

        // perform basic calculation (note use of multiply and add)
        r0 = spu_madd((vec_short8)v0, (vec_short8)rvscale, y0);
        g0 = spu_madd((vec_short8)v0, (vec_short8)gvscale, y0);
        g0 = spu_madd((vec_short8)u0, (vec_short8)guscale, g0);
	b0 = spu_madd((vec_short8)u0, (vec_short8)buscale, y0);
	.. repeat for 1-3
        // 4*4 = 16

The spu only has a 16x16 bit to 32 bit multiply - but it can also add to an existing result. This removes the need for extra adds. The parameters must be cast to shorts, although it only uses each odd short - i.e. the lower 16 bits of each integer, so its all good.

	// fix point (rotate left with mask algebraic is the same as an arithmetic shift right)
	r0 = spu_rlmaska(r0, -SHIFT);
	g0 = spu_rlmaska(g0, -SHIFT);
	b0 = spu_rlmaska(b0, -SHIFT);
	.. repeat for 1-3
        // 3*4 = 12

        // clamp to maximum (compare and select makes this very simple)
	r0 = spu_sel(r0, clamp,  spu_cmpgt(r0, clamp));
	g0 = spu_sel(g0, clamp,  spu_cmpgt(g0, clamp));
	b0 = spu_sel(b0, clamp,  spu_cmpgt(b0, clamp));
	.. repeat for 1-3

	// clamp to minimum (we use and here since cmp already returns the right value.
	//                   This way we don't need a register of zero's either.)
	r0 = spu_and(r0, spu_cmpgt(r0, 0));
	g0 = spu_and(g0, spu_cmpgt(g0, 0));
	b0 = spu_and(b0, spu_cmpgt(b0, 0));
	.. repeat for 1-3
        // 12*4 = 48

        // pack and write argb value
	argbp[x*4+0] = spu_or(b0, spu_shuffle(r0, g0, shuffle_rg));
	.. repeat for 1-3
        // 2*4 = 8
    }
}

The compiler has a nice lot of independent calculations to work with here - it can reschedule things nicely with few stalls. The above only uses 1/2 of the u and v values - and repeats for the other half, duplicating the loads again. This reduces code size at the expense of performance - 2 redundant loads and some unecessary logic could be removed.

The output here is a packed ARGB value - but if we're going to be doing other processing it might make sense to store the data as unpacked planes, of various bit sizes. e.g. from 8 bits, 16/32 bits integer/fixed point, or floats. The core algorithm and unpacking would be the same.

I like short shorts ...

So far we've looked at an obvious direct translation of a scalar algorithm to a SIMD/vectorised one. It will be a huge improvement over the scalar version, particularly on an spu, but we should be able to do better.

An obvious idea would be to convert this to perform the calculations using shorts. This way much of the calculations could be performed on 8 values rather than 4. We can only multiply 4 values at once, so the multiple-accumulation isn't (much) different, but it halves much of the rest.

This is the sort of thing which can't very easily be represented in C, so although most of the intrinsic use in the previous example was optional, here we have no choice.

The constants can just be converted to a short type. Although the multiply only uses half of them, we need both since we can use different multiply instructions to get different intermediate results.

void yuv2rgb_vec(unsigned vector char *yp,
                 unsigned vector char *up, unsigned vector char *vp,
                 unsigned vector int *argbp, int count)
{
    int x;
    const vec_short8 yscale  = spu_splats(YSCALE);
    const vec_short8 buscale = spu_splats(BUSCALE);
    const vec_short8 gvscale = spu_splats(GVSCALE);
    const vec_short8 guscale = spu_splats(GUSCALE);
    const vec_short8 rvscale = spu_splats(RVSCALE);
    const vec_short8 clamp   = spu_splats( 255 );

Then, we need different and more shuffle patterns. The y shuffle pattern gets 8 rather than 4 at a time, and the uv shuffle pattern 4. We can also use another shuffle pattern to re-interleave the shorts after the multiplies, and we need 2 ARGB packing shuffles, since we now generate 8 pixels rather than 4. Also we cannot simply or in the blue anymore and need an additional set of shuffles for that.

    const vec_ushort8 shuffle_y_0  = (vec_ushort8) { 0x8000, 0x8001, 0x8002, 0x8003, 0x8004, 0x8005, 0x8006, 0x8007 };
    const vec_ushort8 shuffle_uv_0 = (vec_uhsort8) { 0x8000, 0x8000, 0x8001, 0x8001, 0x8002, 0x8002, 0x8003, 0x8003 };

    const vec_ushort8 shuffle_toshort = (vec_ushort8) { 0x0203, 0x1213, 0x0607, 0x1617, 0x0a0b, 0x1a1b, 0x0e0f, 0x1e1f };

    const vec_uint4 shuffle_rg0    = (vec_uint4) { 0x80011180, 0x80031380, 0x80051580, 0x80071780 };
    const vec_uint4 shuffle_rg1    = (vec_uint4) { 0x80091980, 0x800b1b80, 0x800d1d80, 0x800f1f80 };

    const vec_uint4 shuffle_rgb0    = (vec_uint4) { 0x80010211, 0x80050613, 0x80090a15, 0x800d0e17 };
    const vec_uint4 shuffle_rgb1    = (vec_uint4) { 0x80010219, 0x8005061b, 0x80090a1d, 0x800d0e1f };

And our new loop:

    vec_uint4 shuffle_uv = shuffle_uv_0;

    for (x=0;x<count/16;x+=1) {
        vec_short8 y = (vec_short4)yp[x];
        vec_short8 u = (vec_short4)up[x/2];
        vec_short8 v = (vec_short4)vp[x/2];

        vec_short8 y0, y1;
        vec_short8 u0, u1;
        vec_short8 v0, v1;

        vec_int4 y0odd, y0even, y10dd, y1even;
        vec_int4 u0odd, u0even, u10dd, u1even;
        vec_int4 v0odd, v0even, v10dd, v1even;

        vec_short8 r0, r1;
        vec_short8 g0, g1;
        vec_short8 b0, b1;

        vec_int4 r0odd, r1odd;
        vec_int4 g0odd, g1odd;
        vec_int4 b0odd, b1odd;

        vec_uint4 shuffle_y  = shuffle_y_0;

	shuffle_uv = (x & 1) == 0 ? shuffle_uv_0 : shuffle_uv;

        // convert 16 YUV values into 2x8 value vectors
	y0 = spu_shuffle(y, y, (vec_uchar16)shuffle_y);
	shuffle_y = spu_add(shuffle_y, 8);
	u0 = spu_shuffle(u, u, (vec_uchar16)shuffle_uv);
	v0 = spu_shuffle(v, v, (vec_uchar16)shuffle_uv);
	shuffle_uv = spu_add(shuffle_uv, 4);

        // ... repeat again for y1, u1, v1

y0, u0, and v0 (and y1, etc) now contain 8 YUV values stored as a vector. Now we have to adjust the arithmetic and use intermedite 32 bit integers - because we need the bits and because we can only do 4x16 bit multiplies.

        // pre-calculate common subexpressions
	y0odd = spu_mulo(y0, yscale);
	y0even = spu_mule(y0, yscale);
        v0 = spu_add(v0, -128);
        u0 = spu_add(u0, -128);
        .. repeat again

        // perform basic calculation
        r0odd = spu_madd(v0, rvscale, y0odd);
        g0odd = spu_madd(v0, gvscale, y0odd);
        g0odd = spu_madd(u0, guscale, g0odd);
	b0odd = spu_madd(u0, buscale, y0odd);

        r0even = spu_mhhadd(v0, rvscale, y0even);
        g0even = spu_mhhadd(v0, gvscale, y0even);
        g0even = spu_mhhadd(u0, guscale, g0even);
	b0even = spu_mhhadd(u0, buscale, y0even);
	.. repeat again

But we don't need to move the data around to get it in the right format, as there are 2 different multiply add instructions we can use. The first we used before, but there is also a version which multiplies the high 16 bits of each word instead. Unfortunately, we still need to fix the point on all 12 values before we continue.

	// fix point
        r0odd = spu_rlmaska(r0odd, -SHIFT);
        r0even = spu_rlmaska(r0even, -SHIFT);
        g0odd = spu_rlmaska(g0odd, -SHIFT);
        g0even = spu_rlmaska(g0even, -SHIFT);
        b0odd = spu_rlmaska(b0odd, -SHIFT);
        b0even = spu_rlmaska(b0even, -SHIFT);
	.. repeat

And we need an extra shuffle step - but this halves the clamping code required (this could also be done in the previous code as well).

        // convert back to shorts
        r0 = spu_shuffle(r0even, r0odd, shuffle_toshort);
        g0 = spu_shuffle(g0even, g0odd, shuffle_toshort);
        b0 = spu_shuffle(b0even, b0odd, shuffle_toshort);
        .. repeat

        // clamp to maximum
	r0 = spu_sel(r0, clamp,  spu_cmpgt(r0, clamp));
	g0 = spu_sel(g0, clamp,  spu_cmpgt(g0, clamp));
	b0 = spu_sel(b0, clamp,  spu_cmpgt(b0, clamp));
	.. repeat

	// clamp to minimum
	r0 = spu_and(r0, spu_cmpgt(r0, 0));
	g0 = spu_and(g0, spu_cmpgt(g0, 0));
	b0 = spu_and(b0, spu_cmpgt(b0, 0));
	.. repeat

        // pack and write argb values
	argbp[x*4+0] = spu_shuffle(spu_shuffle(r0, g0, shuffle_rg0), b0, shuffle_rgb0);
	argbp[x*4+1] = spu_shuffle(spu_shuffle(r0, g0, shuffle_rg1), b0, shuffle_rgb1);
	.. repeat
    }
}

Ignoring the loads and stores which are the same, this takes significantly fewer instructions - 84 vs 116. And there should still be enough instructions for the compiler scheduler to do a decent job. Given the better mix of pipeline 0 and 1 instructions, it may even reduce the cycle count further.

Float? Or sink?

So what about floats? The SPU is bloody fast at doing floating point calculations, will it actually be faster than using fixed point? We could potentially save some fixed-point rotates, but unless we change the api, we're just wasting lots of time doing numerical conversions.

#define YSCALEF  ( 1.0f    )
#define BUSCALEF ( 1.772f  )
#define GVSCALEF (-0.71414f)
#define GUSCALEF (-0.34414f)
#define RVSCALEF ( 1.402f  )

void yuv2rgb_vec(unsigned vector char *yp,
                 unsigned vector char *up, unsigned vector char *vp,
                 vector unsigned int *argbp, int count)
{
    int x;
    const vec_float4 yscale  = spu_splats(YSCALEF);
    const vec_float4 buscale = spu_splats(BUSCALEF);
    const vec_float4 gvscale = spu_splats(GVSCALEF);
    const vec_float4 guscale = spu_splats(GUSCALEF);
    const vec_float4 rvscale = spu_splats(RVSCALEF);
    const vec_int4   clamp   = spu_splats( 1.0f );
    const vec_float4 half    = spu_splats( 0.5f );

We just need to convert the scale factors to floats. We cannot do an immediate add for floats so we also need 0.5 as a vector. We could change the clamp factor to a float too, but see below for details. All of the YUV unpacking is the same as for the integer version, as is the RGB packing.

    const vec_uint4 shuffle_y_0  = (vec_uint4) { 0x80808000, 0x80808001, 0x80808002, 0x80808003 };
    const vec_uint4 shuffle_uv_0 = (vec_uint4) { 0x80808000, 0x80808000, 0x80808001, 0x80808001 };

    const vec_uint4 shuffle_rg   = (vec_uint4) { 0x80031380, 0x80071780, 0x800b1b80, 0x800f1f80 };

Now the loop is

    for (x=0;x<count/16;x+=1) {
        vec_int4 y = (vec_int4)yp[x];
        vec_int4 u = (vec_int4)up[x/2];
        vec_int4 v = (vec_int4)vp[x/2];

        vec_float4 y0, y1, y2, y3;
        vec_float4 u0, u1, u2, u3;
        vec_float4 v0, v1, v2, v3;

        vec_float4 r0, r1, r2, r3;
        vec_float4 g0, g1, g2, g3;
        vec_float4 b0, b1, b2, b3;

        vec_uint4 shuffle_y  = shuffle_y_0;
        vec_uint4 shuffle_uv = shuffle_uv_0;

        // convert 16 YUV values into 4x4 value vectors
	y0 = spu_convtf(spu_shuffle(y, y, (vec_uchar16)shuffle_y), 8);
	shuffle_y = spu_add(shuffle_y, 4);
	u0 = spu_convtf(spu_shuffle(u, u, (vec_uchar16)shuffle_uv), 8);
	v0 = spu_convtf(spu_shuffle(v, v, (vec_uchar16)shuffle_uv), 8);
	shuffle_uv = spu_add(shuffle_uv, 2);
        ... repeat this 3 more times for yn, un, and vn

We need to convert the integers to floats before we can use them. Relatively expensive extra step, even if it saves us any shifting later on.

        // pre-calculate common subexpression
        y0 = spu_mul(y0, yscale);
        v0 = spu_sub(v0, half);
        u0 = spu_sub(u0, half);
        .. repeat for 1-3

        // perform basic calculation
        r0 = spu_madd(v0, rvscale, y0);
        g0 = spu_madd(v0, gvscale, y0);
        g0 = spu_madd(u0, guscale, g0);
	b0 = spu_madd(u0, buscale, y0);
	.. repeat for 1-3

Now comes a choice - do we clamp the floats, or clamp integers afterwards? If we're storing the RGB values as floats then there's no question. Clamp and we're done. - maybe we could pack them into an ARGB quadword, but storing them as planes would be more efficient.

If we're going to pack them back into integers, then we could go either way. Clamping the integers would save us an instruction - we need to subtract 1 otherwise since when we scale 1.0f back to an 8 bit integer we will get 256, not 255. Although I can't really see any point to it, i'll re-pack the data anyway, just for comparisons sake. So we'll clamp at the ints.

        vec_int4 ir0, ir1, ir2, ir3;
        vec_int4 ig0, ig1, ig2, ig3;
        vec_int4 ib0, ib1, ib2, ib3;

	// back to integers
	ir0 = spu_convtu(r0, 8);
	ig0 = spu_convtu(g0, 8);
	ib0 = spu_convtu(b0, 8);
	.. repeat for 1-3
And now it's the same as the integer version.

        // clamp to maximum
	ir0 = spu_sel(ir0, clamp,  spu_cmpgt(ir0, clamp));
	ig0 = spu_sel(ig0, clamp,  spu_cmpgt(ig0, clamp));
	ib0 = spu_sel(ib0, clamp,  spu_cmpgt(ib0, clamp));
	.. repeat for 1-3

	// clamp to minimum
	ir0 = spu_and(r0, spu_cmpgt(ir0, 0));
	ig0 = spu_and(g0, spu_cmpgt(ig0, 0));
	ib0 = spu_and(b0, spu_cmpgt(ib0, 0));
	.. repeat for 1-3

        // pack and write argb value
	argbp[x*4+0] = spu_or(ib0, spu_shuffle(ir0, ig0, shuffle_rg));
	.. repeat for 1-3
    }
}

I haven't actually bothered to compile this version, so it's probably buggy.

Clearly, with all of the extra conversions, this is just going to be slower - we not only use more intstructions, some of them are slower (e.g. float converts rather than nothing or shifts).

However in reality we might want to do multiple passes on the data while it is in the spu - local accesses are very cheap, so lets make hay whilst the sun is shining. Since working with floats may simplify some of the other calculations we might want to do, we could store the values as floats here (or even fixed point integers), and only perform the byte packing once we've finished. Maybe the YUV unpacking could be done separately too, although that might just overload the load/store pipeline with all of the load/stores and shuffles going on.

Comparing the compiled code

I wrote up complete versions of the integer and short versions above and started looking at the assembly language output from gcc. It showed up some interesting results most of which agree with my comments above.

I used the -S flag to gcc to dump the intermediate assembly language generated by the compiler.

spu-gcc -S -O2 -fno-inline test-yuv.c

This generates a file test-yuv.s, and then executing

<path-to-ibm-sdk>/spu_timing test-yuv.c  (TO BE CHECKED)

Generates a file test-yuv.s.timing including static timing and pipeline results from the generated assmebly language. From this file you can easily see how instructions are scheduled and executed, any data dependency stalls, and so forth.

I use -fno-inline to gcc just helps analyze the basic code - if the code is inlined things could change significantly - but we'd hope only for the better, so it should show worst-case.

yuv-int

If we just consider the main loop, we see that the scheduler in gcc has done a reasonably job with ordering, so we only get 3 data dependency stalls. I didn't bother analyzing the instructions to see if it could've done better - often it can be.

So lets look at the timing and try to break it down roughly where it corresponds to the C source.

   ; loop book keeping/address calculation/start some char->int convertion
                                                               .L4:
0D                     0123                                    	rotmi	$21,$27,-31
1D                     012345                                  	lqr	$19,.LC0
0D                      12                                     	andi	$39,$27,1
1D                      1                                      	hbrp	# 1
0D                       2                                     	nop	127
1D                       234567                                	lqd	$18,0($26)
0D                        34                                   	ai	$25,$25,-1
1D                        345678901234567                      	hbrr	.L11,.L4
0                          45                                  	ceqi	$20,$39,0
0                           56                                 	a	$9,$21,$27
0D                           67                                	ai	$27,$27,1
1D                           6789                              	fsm	$4,$20
0d                            7890                             	rotmai	$12,$9,-1
1d                            -8901                            	shufb	$11,$18,$18,$36
0D                              9                              	nop	127
1D                              9012                           	shufb	$13,$18,$18,$37
0D                               01                            	selb	$75,$7,$19,$4
1D                               0123                          	shufb	$10,$18,$18,$38
0D                                1                            	nop	127
1D                                1                            	hbrp	# 2
0D                                 2345                        	shli	$22,$12,4
   ; get stuck into the char-> int conversion, start pipelining the initial calculations
1D                                 2345                        	shufb	$6,$18,$18,$35
0                                   34                         	ai	$73,$75,2
0                                    4567890                   	mpy	$57,$13,$32
0D                                    56                       	ai	$71,$73,2
1D                                    5                        	lnop
0D                                     6789012                 	mpy	$55,$10,$32
1D                                     678901                  	lqx	$79,$22,$33
0D                                      7890123                	mpy	$53,$6,$32
1D                                      789012                 	lqx	$70,$22,$34
0                                        89                    	ai	$22,$71,2
0                                         9012345              	mpy	$51,$11,$32
1                                          --2345              	shufb	$78,$79,$79,$75
1                                             3456             	shufb	$74,$79,$79,$73
1                                              4567            	shufb	$72,$79,$79,$71
1                                               5678           	shufb	$76,$79,$79,$22
0D                                               67            	ai	$61,$78,-128
1D                                               6789          	shufb	$69,$70,$70,$75
0D                                                78           	ai	$60,$74,-128
1D 0                                              789          	shufb	$68,$70,$70,$73
0D                                                 89          	ai	$59,$72,-128
1D 01                                              89          	shufb	$67,$70,$70,$71
0D 0                                                9          	ai	$58,$76,-128
1D 012                                              9          	shufb	$66,$70,$70,$22
   ; finishes the conversions, now its the arithmetic of the algorithm
0  0123456                                                     	mpya	$65,$61,$30,$57
0   1234567                                                    	mpya	$64,$60,$30,$55
0    2345678                                                   	mpya	$63,$59,$30,$53
0     3456789                                                  	mpya	$62,$58,$30,$51
0      45                                                      	ai	$56,$69,-128
0       56                                                     	ai	$54,$68,-128
0        67                                                    	ai	$52,$67,-128
0         78                                                   	ai	$50,$66,-128
0          8901234                                             	mpya	$49,$56,$29,$65
0           9012345                                            	mpya	$48,$54,$29,$64
0            0123456                                           	mpya	$47,$52,$29,$63
0             1234567                                          	mpya	$17,$50,$29,$62
0              2345678                                         	mpya	$46,$61,$28,$57
0               3456789                                        	mpya	$8,$60,$28,$55
0                4567890                                       	mpya	$45,$59,$28,$53
0                 5678901                                      	mpya	$44,$58,$28,$51
0                  6789012                                     	mpya	$43,$56,$31,$57
0                   7890123                                    	mpya	$5,$54,$31,$55
0                    8901234                                   	mpya	$2,$52,$31,$53
0                     9012345                                  	mpya	$3,$50,$31,$51
   ; fixed-point reset (most of it)
0                      0123                                    	rotmai	$14,$49,-14
0                       1234                                   	rotmai	$15,$48,-14
0                        2345                                  	rotmai	$16,$47,-14
0                         3456                                 	rotmai	$21,$17,-14
0                          4567                                	rotmai	$18,$8,-14
0                           5678                               	rotmai	$10,$44,-14
0                            6789                              	rotmai	$9,$46,-14
0                             7890                             	rotmai	$4,$45,-14
0                              8901                            	rotmai	$7,$43,-14
0                               9012                           	rotmai	$79,$5,-14
   ; start clamping ... 
0                                01                            	cgti	$42,$14,255
0                                 12                           	cgti	$41,$15,255
0                                  23                          	cgti	$40,$16,255
0                                   34                         	cgti	$39,$21,255
0                                    4567                      	rotmai	$77,$2,-14
0                                     5678                     	rotmai	$75,$3,-14
0                                      67                      	selb	$73,$14,$23,$42
0                                       78                     	selb	$71,$15,$23,$41
0                                        89                    	selb	$69,$16,$23,$40
0                                         90                   	selb	$67,$21,$23,$39
0                                          01                  	cgti	$19,$18,255
0                                           12                 	cgti	$13,$10,255
0                                            23                	cgti	$20,$9,255
0                                             34               	cgti	$12,$4,255
0                                              45              	selb	$63,$18,$23,$19
0                                               56             	selb	$59,$10,$23,$13
0                                                67            	selb	$65,$9,$23,$20
0                                                 78           	selb	$61,$4,$23,$12
0                                                  89          	cgti	$6,$7,255
0  0                                                9          	cgti	$11,$79,255
0  01                                                          	cgti	$74,$73,0
0   12                                                         	cgti	$72,$71,0
0    23                                                        	cgti	$70,$69,0
0     34                                                       	cgti	$68,$67,0
0      45                                                      	cgti	$78,$77,255
0       56                                                     	cgti	$76,$75,255
0        67                                                    	selb	$53,$7,$23,$6
0         78                                                   	selb	$51,$79,$23,$11
0          89                                                  	selb	$49,$77,$23,$78
0           90                                                 	selb	$47,$75,$23,$76
0            01                                                	and	$58,$73,$74
0             12                                               	and	$57,$71,$72
0              23                                              	and	$56,$69,$70
0               34                                             	and	$55,$67,$68
0                45                                            	cgti	$66,$65,0
0                 56                                           	cgti	$64,$63,0
0                  67                                          	cgti	$62,$61,0
0                   78                                         	cgti	$60,$59,0
0                    89                                        	and	$46,$65,$66
0                     90                                       	and	$45,$63,$64
0                      01                                      	and	$44,$61,$62
0                       12                                     	and	$43,$59,$60
0                        2345                                  	shli	$42,$58,8
0                         3456                                 	shli	$15,$57,8
0                          4567                                	shli	$16,$56,8
0                           5678                               	shli	$17,$55,8
0                            67                                	cgti	$54,$53,0
0                             78                               	cgti	$52,$51,0
0                              89                              	cgti	$50,$49,0
0                               90                             	cgti	$48,$47,0
0                                01                            	and	$41,$53,$54
0                                 12                           	and	$40,$51,$52
0                                  23                          	and	$39,$49,$50
0                                   34                         	and	$21,$47,$48
   ; form result
0                                    4567                      	shli	$20,$46,16
0                                     5678                     	shli	$8,$45,16
0                                      6789                    	shli	$13,$44,16
0                                       7890                   	shli	$10,$43,16
0                                        89                    	or	$19,$41,$42
0                                         90                   	or	$18,$40,$15
0                                          01                  	or	$12,$39,$16
0                                           12                 	or	$11,$21,$17
0                                            23                	or	$7,$19,$20
0                                             34               	or	$14,$18,$8
0D                                             45              	or	$6,$12,$13
   ; write results
1D                                             456789          	stqd	$7,0($24)
0D                                              56             	or	$5,$11,$10
1D 0                                            56789          	stqd	$14,16($24)
0D                                               67            	ai	$7,$22,2
1D 01                                            6789          	stqd	$6,32($24)
1  012                                            789          	stqd	$5,48($24)
1  01                                              89          	biz	$25,$lr
0  0                                                9          	ai	$26,$26,16
0  01                                                          	ai	$24,$24,64
                                                               .L11:
1   1234                                                       	br	.L4
                                                               	.size	yuv2rgb, .-yuv2rgb

(131 cycles to the loop branch)

Book keeping

The initial section spends a fair bit of time dealing with loop book keeping, addressing and the trinary operator for selecting the shuffle_uv value.

Just for interests sake, lets break it out. The first (and last) part is the loop book keeping and address calculation. yp[x] addressing calculated using a simple add, but u[x/2] requires a bunch of arithmetic - it could definitely be simplified. Strangely, the compiler is including the branch hint inside the loop which isn't required, too.

0D                     0123                                    	rotmi	$21,$27,-31
0D                        34                                   	ai	$25,$25,-1
1D                        345678901234567                      	hbrr	.L11,.L4
0                           56                                 	a	$9,$21,$27
0D                           67                                	ai	$27,$27,1
0d                            7890                             	rotmai	$12,$9,-1
1D                                1                            	hbrp	# 2
0D                                 2345                        	shli	$22,$12,4   ; x/2 ready for indexing up and vp
 -- rest of code --
0  0                                                9          	ai	$26,$26,16  ; yp
0  01                                                          	ai	$24,$24,64  ; output pointer
                                                               .L11:
1   1234                                                       	br	.L4

The trinary operator also frobs around a bit - it isn't bad, and is branchless, but this was a known size/speed trade-off anyway.

0D                      12                                     	andi	$39,$27,1
0                          45                                  	ceqi	$20,$39,0
1D                           6789                              	fsm	$4,$20
0D                               01                            	selb	$75,$7,$19,$4

Unpacking Data

The data unpacking is straightforward - it can dual issue with some of the initial arithmetic as well. But storing only 4 results per register it just has to do quite a few operations.

Main arithmetic

There's nothing redundant in the multiply and add section, or fixing the point of the result, so nothing needs to be said.

Clamping

Again, because we're only storing 4 values per register - clamping just takes a lot of instructions. 4 per word. This is something using shorts will help tremendously.

Packing result

Note that this is an older version which uses shifts to form the result. Although there are no stalls in this section there are no dual-issue options either. By using shuffles this calculation can be made almost entirely 'free', so long as we have time to start writing anyway.

yuv-short

Now i'm going to have a look at the short version. This should be noticably better because it halves the number of unpacking and clamping instructions required.

And not surprisingly, it is.

   ; loop book keeping/address calculation/start some char->int convertion
                                                               .L4:
0D                        3456                                 	rotmi	$56,$22,-31
1D                        345678                               	lqd	$53,0($20)
0D                         45                                  	andi	$57,$22,1
1D                         4                                   	hbrp	# 1
0D                          5                                  	nop	127
1D                          567890                             	lqr	$52,.LC0
0D                           67                                	ai	$18,$18,-1
1D                           678901234567890                   	hbrr	.L11,.L4
0                             78                               	ceqi	$55,$57,0
0                              89                              	a	$54,$56,$22
0D                              90                             	ai	$22,$22,1
1D                              9012                           	fsm	$51,$55
0D                               0123                          	rotmai	$50,$54,-1
1D                               0123                          	shufb	$47,$53,$53,$34
1                                 1234                         	shufb	$48,$53,$53,$35
0                                  -34                         	selb	$21,$21,$52,$51
0D                                   4567                      	shli	$49,$50,4
1D                                   4                         	hbrp	# 2
   ; already getting stuck into some of the maths while still loading/format conversion
0D                                    5678901                  	mpyhh	$74,$48,$27
1D                                    5678                     	shlqbyi	$43,$21,0
0                                      6789012                 	mpyhh	$70,$47,$27
0                                       78                     	ahi	$46,$21,4
0D                                       8901234               	mpy	$11,$48,$27
1D                                       890123                	lqx	$45,$49,$32
0D                                        9012345              	mpy	$15,$47,$27
1D                                        901234               	lqx	$42,$49,$33
0D                                         01                  	ahi	$21,$46,4
1D                                         0                   	lnop
0D                                          -23                	ori	$10,$74,0
1D                                           2345              	shlqbyi	$5,$74,0
0D                                            34               	ori	$78,$70,0
1D                                            3456             	shlqbyi	$3,$70,0
1                                              4567            	shufb	$41,$45,$45,$46
1                                               5678           	shufb	$40,$45,$45,$43
1                                                6789          	shufb	$38,$42,$42,$46
1  0                                              789          	shufb	$39,$42,$42,$43
   ; finish the maths proper
0                                                  89          	ahi	$36,$41,-128
0  0                                                9          	ahi	$37,$40,-128
0  0123456                                                     	mpya	$14,$36,$25,$15
0   1234567                                                    	mpya	$13,$37,$25,$11
0    2345678                                                   	mpyhha	$78,$36,$25
0     3456789                                                  	mpyhha	$10,$37,$25
0      45                                                      	ahi	$9,$38,-128
0       56                                                     	ahi	$4,$39,-128
0        6789012                                               	mpya	$7,$37,$23,$11
0         7890123                                              	mpya	$6,$36,$23,$15
0          8901234                                             	mpyhha	$5,$37,$23
0           9012345                                            	mpya	$12,$4,$24,$13
0            0123456                                           	mpya	$79,$9,$24,$14
0             1234567                                          	mpyhha	$78,$9,$24
0              2345678                                         	mpyhha	$10,$4,$24
0               3456789                                        	mpyhha	$3,$36,$23
0                4567890                                       	mpya	$71,$9,$26,$15
0                 5678901                                      	mpyhha	$70,$9,$26
0                  6789012                                     	mpya	$75,$4,$26,$11
0                   7890123                                    	mpyhha	$74,$4,$26
   ; fix the point and start converting back to shorts
0                    8901                                      	rotmai	$8,$5,-14
0                     9012                                     	rotmai	$2,$7,-14
0                      0123                                    	rotmai	$77,$12,-14
0                       1234                                   	rotmai	$76,$10,-14
0D                       2345                                  	rotmai	$73,$6,-14
1D                       2                                     	lnop
0D                        3456                                 	rotmai	$72,$3,-14
1D                        3456                                 	shufb	$62,$8,$2,$17
0D                         4567                                	rotmai	$69,$79,-14
1D                         4                                   	lnop
0D                          5678                               	rotmai	$68,$78,-14
1D                          5678                               	shufb	$60,$76,$77,$17
0D                           6789                              	rotmai	$67,$75,-14
1D                           6                                 	lnop
0D                            7890                             	rotmai	$66,$74,-14
1D                            7890                             	shufb	$58,$72,$73,$17
0D                             8901                            	rotmai	$65,$71,-14
1D                             8                               	lnop
0D                              9012                           	rotmai	$64,$70,-14
1D                              9012                           	shufb	$56,$68,$69,$17
   ; ... and start on the clamping
0D                               01                            	cgthi	$63,$62,255
1D                               0                             	lnop
0D                                12                           	cgthi	$61,$60,255
1D                                1234                         	shufb	$54,$66,$67,$17
0D                                 23                          	cgthi	$59,$58,255
1D                                 2                           	lnop
0D                                  34                         	cgthi	$57,$56,255
1D                                  3456                       	shufb	$52,$64,$65,$17
0                                    45                        	selb	$50,$62,$19,$63
0                                     56                       	selb	$48,$60,$19,$61
0                                      67                      	selb	$46,$58,$19,$59
0                                       78                     	selb	$44,$56,$19,$57
0                                        89                    	cgthi	$55,$54,255
0                                         90                   	cgthi	$53,$52,255
0                                          01                  	cgthi	$51,$50,0
0                                           12                 	cgthi	$49,$48,0
0                                            23                	cgthi	$47,$46,0
0                                             34               	cgthi	$45,$44,0
0                                              45              	selb	$40,$54,$19,$55
0                                               56             	selb	$36,$52,$19,$53
0                                                67            	and	$42,$50,$51
0                                                 78           	and	$43,$48,$49
0D                                                 89          	and	$38,$46,$47
1D                                                 8           	lnop
0D 0                                                9          	and	$39,$44,$45
   ; start packing result
1D 012                                              9          	shufb	$15,$42,$43,$29
0D 01                                                          	cgthi	$41,$40,0
1D 0123                                                        	shufb	$13,$42,$43,$31
0D  12                                                         	cgthi	$37,$36,0
1D  1234                                                       	shufb	$14,$38,$39,$29
0D   23                                                        	and	$7,$40,$41
1D   2345                                                      	shufb	$11,$38,$39,$31
0d    34                                                       	and	$12,$36,$37
1d    -4567                                                    	shufb	$10,$15,$7,$28
1       5678                                                   	shufb	$9,$14,$12,$28
1        6789                                                  	shufb	$5,$13,$7,$30
1         7890                                                 	shufb	$6,$11,$12,$30
1          890123                                              	stqd	$10,16($16)
1           901234                                             	stqd	$9,48($16)
1            012345                                            	stqd	$5,0($16)
1             123456                                           	stqd	$6,32($16)
1              2345                                            	biz	$18,$lr
0               34                                             	ai	$20,$20,16
0                45                                            	ai	$16,$16,64
                                                               .L11:
1                 5678                                         	br	.L4

This is just 92 cycles to the loop branch, and it does the same thing.

Book keeping

This is basically the same as the previous version.

Unpacking Data

Because we're only unpacking to half as many results, this is only half the size. There still isn't a great deal of overlapping with arithmetic though - mainly because it's waiting for the results of the address calculation/uv load before it can get stuck into it.

Main arithmetic

This is more or less the same. Even though we're working with a very different format, short vs int, we use different instructions that hide that detail, so we don't need to play around with shifts.

Re-packing shorts

Although this requires an extra step of re-packing the shorts before we get to the clamping, it is basically for free since it can interleave (dual issue) with the point fixing code.

Clamping

Clamping is much faster - we only have half as much to do.

Packing result

This uses the shuffle packer - which means the compiler can dual-issue some of it with the clamping.

yuv-int-2

So there are a couple of opportunities for improvements in the code. One is the data loads - it currently loads the u and v data twice, and althought the actual load isn't bad, it also has to do the address calculation twice which increases the latency a lot. The other is the trinary operator. So lets see what happens if I just get rid of those by copying the loop contents once and fixing things up.

It's an improvement - 12 cycles per 16 pixels. 238 cycles for twice the work vs 131.

yuv-int-2-a

Still, looking at the timing output something looks wrong - there's a bunch of stalls between each section of the unrolled loop? The problem is probably something to do with C pointer access rules. It seems to be ensuring pointer writes and reads in the same order as given, even though it makes no difference.

If I just move the next load of y to just after the last usage of it, I get another improvement - 226 cycles total now (or 18 per 16 pixels).

I then remembered that this is the sort of thing the restrict keyword was for. So I tried just adding restrict to the definitions of all the pointers in the original version. But it made no difference to the compiled output. Either i'm using it wrong, gcc isn't taking advantage of it, or i'm misunderstanding it's use.

Actually I just had another thought - in my simple example i'm passing the same value to all pointers, maybe gcc is taking that into account - although afaik it shouldn't make any difference - I will have to verify.

yuv-short-2, yuv-short-2-a

And I did the same treatment for the short versions. Again, with about the same fixed-amount of improvement (for a larger relative amount).

yuv-short-2 takes 168 cycles, and with the re-ordering of the second y load, it's down to 150.

yuv-short-3

The last thing to look at is the nasty address calculations. There's no reason the compiler should be performing (x/2<<4) every loop - it knows the full state of x throughout the loop. It has already done it for y, so why not u and v too? Well, lets see what can be done about it.

We have 4 basic housekeeping calculations required per loop (of the -2 version).

yp is incremented by 2x16 bytes each loop
the offset of up and vp is incremented 1x16 each loop
and argbp is incremented by 8x16 for each loop.
an end of loop test

But we have these being calculated:

the first y load is loaded off its own pointer register incremented each loop
the second y load is loaded off another pointer register incremented each loop
x is being maintained, incrementing once per loop and used for loop termination.
x is being converted to an array index using several shifts and an add.
argbp is being incremented once per loop by 8x8 (tick - ok)

We can fix the y issue easily, argbp is already fine, and we can totally remove the uv address calculation as well.

The y-doing-silly-things issue should be fixed easily. The following should change it to a single add, and the array addressing changed to use immediate offsets.

    y = yp[0];
    ...
    y = yp[1];
    yp += 2;

We could just do the same for u and v, but since the calculation is the same, we can use the lqx instruction to help us. That will load a value relative to the sum of two registers. We could try using array indexing and incrementing a normal int by 1 every time, but the compiler might still do some silly things, so rather than wasting time, lets just tell it what we really want.

   vec_uint4 uvoffset = spu_promote((unsigned int)0, 0);

   loop {

     u = si_lqx((qword)up, (qword)uvoffset); 
     v = si_lqx((qword)yp, (qword)uvoffset); 

     ...

     uvoffset = spu_add(uvoffset, 16);
   }

And finally we could use any of the incrementing values for a loop condition. uvoffset is handy and requires less calculation, so use that. To make it explictly clear to the compiler what we want to do - do it all manually directly on vector types again.

  vec_uint4 loopend = spu_rlmask(spu_promote(count, 0), -1);

  while (spu_extract(spu_cmpgt(loopend, uvoffset), 0) {
   ...
  }

Of course none of this should be necessary, but the compiler is being stubborn. So what did the compiler do with it?

Fucked it up, at least partially. It's still using 2 pointers and thus 2 sets of pointer arithmetic for the y accesses. Sigh.

0D               45                                            	ai	$38,$38,32
1D               456789                                        	lqx	$61,$53,$44
1                 567890                                       	lqd	$63,0($39)
1                  6                                           	hbrp	# 1
0D                  78                                         	ai	$39,$39,32
1D                  789012                                     	lqd	$73,-32($38)
1                    890123                                    	lqx	$70,$52,$44
0                     90                                       	ai	$44,$44,16

It's also done something silly with the loop - moved the loop test to the end of the loop - ok, not great but ok - but then that forces the compare and branch to run consecutively and adds another stall. Ok, it's only 1 cycle, but ...

At least the pre-loop setup is nice and compact, it's dual-issued almost everything.

0D 0123                                                        	rotmi	$45,$7,-1
1D 012345678901234                                             	hbrr	.L8,.L2
0D  12                                                         	ori	$39,$3,0
1D  123456                                                     	lqr	$51,.LC0
0D   23                                                        	ori	$52,$4,0
1D   234567                                                    	lqr	$50,.LC1
0D    34                                                       	ori	$53,$5,0
1D    345678                                                   	lqr	$48,.LC2
0D     45                                                      	ori	$30,$6,0
1D     456789                                                  	lqr	$49,.LC3
0D      56                                                     	ilh	$37,16384
1D      567890                                                 	lqr	$31,.LC4
0D       67                                                    	ilh	$36,29032
1D       678901                                                	lqr	$43,.LC5
0D        78                                                   	ilh	$35,-11700
1D        789012                                               	lqr	$42,.LC6
0D         89                                                  	ilh	$34,-5638
1D         890123                                              	lqr	$41,.LC7
0D          90                                                 	ilh	$33,22970
1D          9                                                  	hbrp	# 3
0D           01                                                	ilh	$32,255
1D           012345                                            	lqr	$40,.LC8
0D            12                                               	il	$44,0
1D            123456                                           	lqr	$47,.LC9
0D             23                                              	ai	$38,$3,16
1D             234567                                          	lqr	$46,.LC10
0D              3                                              	nop	127
                                                               .L8:
1D              3456                                           	br	.L2

yuv-short-4

So this point I tried a few other things and eventually had enough, so I had a guess at what would work the best. Use direct loads for all the loads, and let the compiler do some of the loop stuff just using ints - using a vector for the loop counter is just too unreadable anyway.

The promote/extract stuff really does nothing but keep the compiler happy, which means it just gets in the way of the user ... but its easier than working on vectors directly the whole time.

    for (x=0;x<count>>;x+=16) {

        ...

        y = (vec_short8)si_lqd((qword)spu_promote((unsigned int)yp, 0), 0);
	u = (vec_short8)si_lqx((qword)spu_promote((unsigned int)up, 0), (qword)spu_promote(x, 0));
	v = (vec_short8)si_lqx((qword)spu_promote((unsigned int)vp, 0), (qword)spu_promote(x, 0));

        ...

	y = (vec_short8)si_lqd((qword)spu_promote((unsigned int)yp, 0), 16);
	yp += 2;

       ...
    }

And horrah - it just about did the right thing. Now it's got a nice tight load section, so that's all good - although it didn't make any difference to the execution time. And it got rid of the test-branch, at least it tried - it moved the loop test to well before it's needed. But it moved it too early - so it still stalls 1 cycle. Oh well. By this stage i've had enough, so it'll have to do (actually I did try some manual tweaking of the assembly, but that is difficult as re-ordering or inserting instructions messes up all of the dual-issue stuff).

1                       123456                                 	lqd	$62,0($39)
1                        234567                                	lqd	$59,16($39)
0D                        34                                   	ai	$39,$39,32
1D                        3                                    	hbrp	# 1
1                          456789                              	lqx	$76,$48,$40
1                           567890                             	lqx	$72,$50,$40
0D                           67                                	ai	$40,$40,16
1D                           6                                 	lnop
0D                            -89                              	cgt	$30,$45,$40

It also made a mess of the pre-loop setup. Before it interleaved large constant loads and small ones - so they executed in half the time. Now it's got a short-circuit 'quit quickly' test (i.e. a useless one), that messes all of that up. Maybe I need to use __builtin_expect() or something - but really, the compiler should be doing this sort of shit for me. I could always make the api demand a count of > 0 and just use a do {} while loop, although who knows what the compiler will do with that.

 
                                                              yuv2rgb_short:
0D 0123                                                        	rotmai	$45,$7,-1
1D 0123                                                        	shlqbyi	$39,$3,0
0D  12                                                         	ori	$31,$6,0
1D  1                                                          	hbrp	# 1
0    23                                                        	ilh	$38,16384
0     34                                                       	ilh	$37,29032
0      45                                                      	cgti	$2,$45,0
0       56                                                     	ilh	$36,-11700
0        67                                                    	ilh	$35,-5638
0         78                                                   	ilh	$34,22970
0D         89                                                  	ilh	$33,255
1D         8901                                                	biz	$2,$lr
0D          90                                                 	ori	$50,$4,0
1D          901234                                             	lqr	$52,.LC0
0D           01                                                	ori	$48,$5,0
1D           012345                                            	lqr	$51,.LC1
0D            1                                                	nop	127
1D            1                                                	hbrp	# 2
0D             23                                              	il	$40,0
1D             234567                                          	lqr	$53,.LC2
1               345678                                         	lqr	$49,.LC3
1                456789                                        	lqr	$32,.LC4
1                 567890                                       	lqr	$44,.LC5
1                  678901                                      	lqr	$43,.LC6
1                   789012                                     	lqr	$42,.LC7
1                    890123                                    	lqr	$41,.LC8
1                     901234                                   	lqr	$46,.LC9
1                      012345                                  	lqr	$47,.LC10

It's not in the loop at least - so it's not the end of the world. A bit silly though. If the biz just happened near the end of all the loads it could half this time and still avoid the unconditional branch to the loop test.

Timing Results And Conclusions

I wrote the following test programme:

static vec_uchar16 buffer[2048];

int main()
{
	int i;

	for (i = 0;i<100000;i++)
		yuv2rgb(buffer, buffer, buffer, buffer, 2048);
}

And timed completed versions of the integer code above using the following command:

time elfspe ./yuv2rgb

And here are the results.

  version             description                          time (s)

   yuv2rgb *           4xint based throughout               0.535
   yuv2rgb_2           4xint based with 1 loop unroll       0.489
   yuv2rgb_2-a         yuv2rgb_2 with manual reorder        0.467

   yuv2rgb_short       8xshort based where possible         0.382
   yuv2rgb_short_2     8xshort based with 1 loop unroll     0.347
   yuv2rgb_short_2-a   yuv2rgb_short_2 with manual reorder  0.313
   yuv2rgb_short_2-b   yuv2rgb_short_2 with restrict        0.347

   yuv2rgb_short-3     different loop/index logic           0.314
   yuv2rgb_short-4     all manual loads                     0.304

* - this is actually a slightly older version that does shift/or instead
    of shuffle for forming the ARGB result - i'm too lazy to rerun the
    tests/compiles.

Conclusions:

Using shorts is the way to go - it is significantly faster and mathematically identical, with only a minor increase in code complexity.
Unrolling the loop once is definitely worth it as well.
The restrict keyword doesn't seem to be doing what I thought it should be, or i'm using it wrong.
Looking at the compiled assembly and tweaking the C source in even tiny ways can have a measuable impact on performance.
The compiler often seems to mess up array references in some way or another.
Using C intrinsics kinda sucks. You end up fighting with the compiler too much, even when you give it what should be more than enough of a hint of what to do.
Straight ASM is very hard - the dual issue rules are hard to take advantage of without tedious instruction counting (and they make a huge difference), not to mention tracking all of the registers. Maybe some sort of rescheduling/nop inserting register tracking macro assembler would be nice.

Appendix A - Sources

All of the C source and annotated assembly listings. These are all in the public domain.

C source              Annotated Assembly

 test-yuv.c            test-yuv.s.timing
 test-yuv-2.c          test-yuv-2.s.timing
 test-yuv-2-a.c        test-yuv-2-a.s.timing

 test-yuv-short.c      test-yuv-short.s.timing
 test-yuv-short-2.c    test-yuv-short-2.s.timing
 test-yuv-short-2-a.c  test-yuv-short-2-a.s.timing
 test-yuv-short-2-b.c  test-yuv-short-2-b.s.timing

 test-yuv-short-3.c    test-yuv-short-3.s.timing
 test-yuv-short-4.c    test-yuv-short-4.s.timing

References

YUV conversion article on fourcc.org.

Cell Broadband Engine master index page at IBM.

Note that these documents update fairly frequently, but the latest should be available at the above index if these links are stale.

C/C++ Language Extensions for Cell Broadband Architecture

Synergistic Processing Unit Instruction Set Architecture

Cell Broadband Engine Programming Handbook