Vectors and Bits again

Well I fixed the `c long' version of the rect-fill from the update mentioned a couple of posts ago ... and a bit more besides.

After sleeping in a bit I worked on some MMU code so I can start using the CPU cache. Most of that was just gaining a deeper understanding of the permission and memory type bits, which are a little confusing in places. It looks like it's been extended a couple of times whilst keeping compatability so there's multiple combinations that appear to do the same thing but with different nomenclature. Hmm, I have it more or less worked out ... I think. So once I got the MMU code working, it allowed me to enable caches and play a bit with various options. I used only section and super-section pages - 1MB or 16MB, so i'm probably only using a couple of TLB entries to run everything (= no page table walks).

I was assuming the caches were on when i enabled the MMU ... oh but they weren't, of course ... stupid me. Wow does that make a difference ... Wow.

Ok, pause to run a few more timings. ... Here goes.

Code                   Total    Slowest Fastest
C short             36097442    0.89    5.22
C long              40526536    1.00    5.86
ARM asm             15801430    0.38    2.28
NEON                 9654736    0.23    1.39
NEON2                9982542    0.24    1.44
NEON3                9421366    0.23    1.36
NEON4                9467262    0.23    1.37
sDMA                 6904794    0.17    1.00

(see 2 posts ago, or render-rect.c for what they mean)

This is the original scenario from a previous post, but with a 'fixed' C long version. Strangely, it runs slower than the short version. A cursory look at the assembly looks like it's doing the right thing - but it's not worth looking deeper. My guess is the extra logic required for the un-aligned edges is throwing it out or the pointer aliasing is making the compiler angry. Oddly, the performance monitor is registering the same number of data writes too.

Anyway, who cares. Lets turn the MMU on and set the memory regions up properly and and see what happens. Even with the caches off things happen, although not much.

With MMU on, graphics = wt                    With MMU on, graphics = wb

Code           Total    Slowest Fastest       Code           Total    Slowest Fastest
C short     36058684    0.89    7.33       C short     36233408    0.89    5.23
C long      40496404    1.00    8.23       C long      40584664    1.00    5.86
ARM asm      9367578    0.23    1.90       ARM asm     15811204    0.38    2.28
NEON         5332580    0.13    1.08       NEON         9653676    0.23    1.39
NEON2        4917308    0.12    1.00       NEON2       10057086    0.24    1.45
NEON3        5598968    0.13    1.13       NEON3        9555816    0.23    1.38
NEON4        5685246    0.14    1.15       NEON4        9431842    0.23    1.36
sDMA         6908602    0.17    1.40       sDMA         6917612    0.17    1.00

We're starting to beat the system DMA - I presume that even with the cache off this enables some sort of write-combining/write-buffering. It's interesting that the NEON2 code speeds up the most (nearly 2x) - probably given it has the smallest loop the CPU isn't in contention for memory bandwidth as much. You'd never use a write-back cache for video memory, but I timed it anyway. I really have no idea how or why using it is making any difference whatsoever though, since the global cache bits are all off!

Ok, so ... der, lets turn on the caches properly.

The way I set the MMU up is to have the first bank of memory - where all code and data resides - as write-back write-allocate (writes also read a cache-line), and the second - where the frame-buffer resides - as write-through no-write-allocate. For the `graphics = wb' case, I also set write-back write-allocate on the second bank of memory (in a separate run). All the IO devices are using shared-device mode.

First, with unrolled loops.

MMU on, graphics = wt, -O3 -funroll-loops     MMU on, graphics = wb, -O3 -funroll-loops
             -- lots of artifacts
Code           Total    Slowest Fastest       Code           Total    Slowest Fastest
C short       957743    0.14    1.02       C short      1816546    0.28    1.00
C long        956818    0.14    1.02       C long       1992627    0.30    1.09
ARM asm       933198    0.14    1.00       ARM asm      1871829    0.28    1.03
NEON          930448    0.14    1.00       NEON         1857085    0.28    1.02
NEON2         945969    0.14    1.01       NEON2        1862711    0.28    1.02
NEON3         946522    0.14    1.01       NEON3        1848473    0.28    1.01
NEON4         945739    0.14    1.01       NEON4        1861538    0.28    1.02
sDMA         6456313    1.00    6.93       sDMA         6455228    1.00    3.55

Ahh, now this is more like it. Getting over 800MB/S (if my timing calculations are right).

Even the basic crappy C code is within a whisker of everything else - even though it executes about 3.5x as many instructions to get the same work done. The system DMA has fallen right off; but run asynchronously it would probably still be worth using since it is basically `free', and the CPU can do a lot more than just write memory. This code also polls the DMA status in a tight loop, I don't know if that is having any bandwidth effects

The write-back timing is all out of whack - the C short version is the first to run, so it gets a benefit of having an empty cache and nothing to write-back. You also get to see the CPU write stuff back to the screen when it feels the need - lots of weird visual artifacts. And the explicit cache flushing required would only make it slower on top of that. In short - useless for a framebuffer. Any performance issues you might expect a write-back cache to address are handled much better by using proper algorithms. I saw it mentioned on the beagleboard list, so it seemed worthy of comment ...

And lastly, just with -O3, a typical compile flag (-funroll-loops generates much bigger code so might not always be desirable). I also added in a `hyper-optimised' memset implementation for good measure.

MMU on, graphics = wt, -O3

Code                   Total    Slowest Fastest
C short              1372096    0.21    1.47
C long               1038868    0.16    1.11
ARM asm               948600    0.14    1.02
NEON                  929968    0.14    1.00
NEON2                 939165    0.14    1.00
NEON3                 946102    0.14    1.01
NEON4                 945702    0.14    1.01
msNEON               1309313    0.20    1.40   (see memset_armneon())
sDMA                 6462071    1.00    6.94

The C is still ok, if a bit slower, but barely worth `optimising' in this trivial case.

The msNEON code is from the link indicated ... interesting that a more complex C loop beats it somewhat; the msNEON code is only writing the same amount of memory linearly not as a rectangle, and with severe alignment restrictions.

The NEON2 code has such a simple inner loop, yet is the most consistently top performer. Good to see that KISS sometimes still works.

 // write out 32-byte chunks
2: subs r6,#1
 vst1.64 { d0, d1, d2, d3 }, [r5, :64]! // ARM syntax is `r5 @ 64'
 bgt 2b

The ARM code is quite a mess by comparison:

 // write out 32-byte chunks
2: strd r2,[r5]
 strd r2,[r5, #8]
 strd r2,[r5, #16]
 subs r6,#1
 strd r2,[r5, #24]
 add r5,r5,#32
 bgt 2b

(FWIW I tried a similar trivial loop in ARM, a direct translation of the `C long' code, and that wasn't terribly fast).

Anyway, I think i've done memory fill/rect fill to bloody death (and beyond!) now. It's just not a terribly interesting problem - particularly for a SIMD unit. Apart from evaluating raw memory performance. Actually it is kind of handy for that since it will easily show if things aren't configured properly.

PS Code changes not committed yet.

About Me

Tags

Vectors and Bits again