About Me

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as Zed
  to his mates & enemies!

notzed at gmail >
fosstodon.org/@notzed >


android (44)
beagle (63)
biographical (104)
blogz (9)
business (1)
code (77)
compilerz (1)
cooking (31)
dez (7)
dusk (31)
esp32 (4)
extensionz (1)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (459)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (231)
java ee (3)
javafx (49)
jjmpeg (81)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (10)
opencl (120)
os (17)
panamaz (5)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
players (1)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
vulkan (3)
wanki (3)
workshop (3)
zcl (4)
zedzone (26)
Monday, 10 June 2013, 03:32

Clamping, scaling, format conversion

Got to spend a few hours poking at the photo-effects app i'm doing in conjunction with 'ffts'. I ended up having to use some NEON for performance.

One interesting solution along the way was code that took 2x2-channel float sequences (i.e. 2xcomplex number arrays) and re-wound them back to 4-channel bytes, including scaling and clamping.

I utilised the fixed-point variant of the VCVT instruction which performs the scaling to 8 bits with clamping below 0. For the high bits I used the saturating VQMOVN variant of move with narrow.

I haven't run it through the cycle counter (or looked the details up) so it could probably do with some jiggling or widening to 32 bytes/iteration but the current main loop is below.

        vld1.32         { d0[], d1[] }, [sp]

        vld1.32         { d16-d19 },[r0]!
        vld1.32         { d20-d23 },[r1]!     
        vmul.f32        q12,q8,q0               @ scale
        vmul.f32        q13,q9,q0
        vmul.f32        q14,q10,q0
        vmul.f32        q15,q11,q0

        vld1.32         { d16-d19 },[r0]!       @ pre-load next iteration
        vld1.32         { d20-d23 },[r1]!

        vcvt.u32.f32    q12,q12,#8              @ to int + clamp lower in one step
        vcvt.u32.f32    q13,q13,#8
        vcvt.u32.f32    q14,q14,#8
        vcvt.u32.f32    q15,q15,#8

        vqmovn.u32      d24,q12                 @ to short, clamp upper
        vqmovn.u32      d25,q13
        vqmovn.u32      d26,q14
        vqmovn.u32      d27,q15

        vqmovn.u16      d24,q12                 @ to byte, clamp upper
        vqmovn.u16      d25,q13

        vst2.16         { d24,d25 },[r3]!

        subs    r12,#1
        bhi     1b

The loading of all elements of q0 from the stack was the first time I've done this:

        vld1.32         { d0[], d1[] }, [sp]

Last time I did this I thing I did a load to a single-point register or an ARM register then moved it across, and I thought that was unnecessarily clumsy. It isn't terribly obvious from the manual how the various versions of VLD1 differentiate themselves unless you look closely at the register lists. d0[],d1[] loads a single 32-bit value to every lane of the two registers, or all lanes of q0.

The VST2 line:

        vst2.16         { d24,d25 },[r3]!

Performs a neat trick of shuffling the 8-bit values back in to the correct order - although it relies on the machine operating in little-endian mode.

The data flow is something like this:

 input bytes:        ABCD ABCD ABCD
 float AB channel:   AAAA BBBB AAAA BBBB
 float CD channel:   CCCC DDDD CCCC DDDD   
 output bytes:       ABCD ABCD ABCD

As the process of performing a forward then inverse FFT ends up scaling the result by the number of elements (i.e. *(width*height)) the output stage requires scaling by 1/(width*height) anyway. But this routine requires further scaling by (1/255) so that the fixed-point 8-bit conversion works and is performed 'for free' using the same multiplies.

This is the kind of stuff that is much faster in NEON than C, and compilers are a long way from doing it automatically.

The loop in C would be something like:

float clampf(float v, float l, float u) {
   return v < l ? l : (v < u ? v : u);

    complex float *a;
    complex float *b;
    uint8_t *d;
    float scale = 1.0f / (width * height);
    for (int i=0;i<width;i++) {
       complex float A = a[i] * scale;
       complex float B = b[i] * scale;

       float are = clampf(creal(A), 0, 255);
       float aim = clampf(cimag(A), 0, 255);
       float bre = clampf(creal(B), 0, 255);
       float bim = clampf(cimag(B), 0, 255);

       d[i*4+0] = (uint8_t)are;
       d[i*4+1] = (uint8_t)aim;
       d[i*4+2] = (uint8_t)bre;
       d[i*4+3] = (uint8_t)bim;

And it's interesting to me that the NEON isn't much bulkier than the C - despite performing 4x the amount of work per loop.

I setup a github account today - which was a bit of a pain as it doesn't work properly with my main browser machine - but I haven't put anything there yet. I want to bed down the basic data flow and user-interaction first.

Tagged android, beagle, code, hacking, picfx.
Into the cloud! | on google
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!