64 core Parallela & ARM A9 Zynq

Well it looks like the Parallella endeavour did get funded after-all (just minutes ago). They only really got organised in the last 3 days so I really thought it was too late but they managed to get the word out and excitement up enough to make it.

Well done.

As I said previously, the 16-core chip is a teaser and the 64-core is where the action is ... So I was pleased they offered a guaranteed 64-core version once it was clear that the $3M target was a bit optimistic. So even I, the cynical old c**t that I am, got caught up a bit in it myself and went for the 64 core chip plus the early 16 core one, cases and a t-shirt. A bit of an indulgence but I can afford it.

I'm sure most people don't really understand what they're getting into (that's pretty much the modus operandi of Kickstarter), but a Zynq board for $100 is still good value apparently even without the floating-point accelerator tacked on.

Although they just made it now, it's still open for a few hours, so get over there and have a poke if you're interested in a fully documented open embeddable low-power platform - i.e. ALL of the components will be documented and include ALL the free software to access it, including the accelerator. (well, this is what is promised, i'm not sure how far it stretches to the ARM/Zynq, but I presume that is already covered elsewhere).

This is in stark contrast to other 'open' boards such as:

Raspberry PI - GPU is completely closed. The graphics driver is still just a big binary blob despite the simply fraudulent recent announcement that it was now 'fully open'. One wonders at the timing of that announcement.
Beagleboard & friends - GPU is completely closed. The DSP and most of the on-board hardware is well documented but the OS it runs and the compiler is proprietary (last i looked).
Allwinner A10 - GPU and VPU are both completely closed. Even if the lima driver ever gets there, it wont be thanks to the vendor.
Any other ARM SOC you care to think of: programmable GPU ALL closed, VPU ALL closed.

As the basic board comes with a 'Zynq' processor, which is a dual-core A9 plus a FPGA on chip, it opens up more than just parallel processing to 'the masses' to include reconfigurable hardware too. I don't know much about these but I have it on good authority that they are very cool chips and i'm looking forward to investigating that aspect as well - if i ever get the time to (the lack of free tools there might impede too).

Given this open nature i've been a bit bummed by some of the hostile reception it's received in some of the 'open hardware/software' forums and mailing lists. Come on fellas, the world is big enough for more players - no need to get so defensive. And given how much of a whinge they've all had about vendor documentation, GPL violations, tainting buggy binary driver blobs, and everything else the cool reception is more than a little baffling. If nothing else some competition has to help making progress with other vendors who have all closed ranks.

Scalar vs SIMD, not all FLOPS are equal

I think some just don't see what the big deal is - it's just a chip not a solution, so and so have a chip that does x flops too, blah has something coming that will blow it all away, or those total flops just aren't that much ...

The problem with marketing numbers is that they're just marketing numbers. Peak FLOPS are impossible to achieve with any cpu and any algorithm - but the main avenue for increasing the FLOP count for the last 20 years - SIMD - only makes this much harder to achieve.

GPU's only make this worse. They throw so much hardware at it you still get very good results - but they aren't efficient at many tasks, and difficult enough to programme for the ones they are.

I'm sure you'd have to be living under a rock to miss the fact that when the Playstation 3 came out, a lot of developers made a lot of noise about how difficult it was to programme for. If you put in the time - and you really had to resort to assembly language - you could get phenomenal through-put through the SPUs, but if you didn't, all you had were 6 fairly gutless cores which were on top of that - a bit tricky to use. And it used a lot of power to get there.

Although the Eiphany shares some of the trickiness of use that the SPU's do (including the cache-less local memory, although it's easier to access off-core memory), simply because it is scalar a higher flop utilisation rate should be achievable for normal code. Without having to resort to assembly language or even worse - intrinsics. Not to mention the power differential: 90 odd gflops for the 64-core version ... in 5w system power.

A floating point MUL only has a latency of 4 cyles too - rather than the 7 on the CELL or 6 (iirc) for NEON, which makes the compiler or assembly language writer's job of scheduling that bit easier as well. Although assembly is an absolute must for NEON, the instruction set it so simple and there are so many registers i'd be surprised if it was needed in practice for the epiphany core.

Another point about competitors - ziilabs thing looks awesome! An embedded chip with a programmable multi-core co-processor! Yay! Oh, I can't actually get a machine with one in it? Oh. It only runs Android - a cut down, appliance version of Linux? Boo. Even if you could get one the grid-cpu is proprietary and secret and only we're allowed to use it, and you must go through the framework we provide? Blah, who cares.

Nothing's perfect

Engineering is not mathematics or science. Mathematics is absolute. Science is knowing to within a known degree of knowing. Engineering is a constant compromise. The real world has a habit of getting in the way. Cost, time, knowledge, physics, they all conspire to prevent the attainment of mathematical perfection.

The human curse that we all bear is that if we ever actually got what we truly wished for, we'd just think of something else we wanted.

PS This list is just my take on a fairly quick reading of the architecture documents and instruction set, it may contain wildly inaccurate misreadings and other mistakes.

Software

Well they're opening everything up for a reason, the 4 (or 5?) man team just doesn't have the resources to fill out everything. It's cheap enough that there should be no real barriers to entry to poking around, and the more that poke the more that gets done for free. This will be an interesting test to see how a loose group fares against multinational corporations and commercial standards bodies in coming up with usable solutions.

So it might be a while before X is accelerated, if ever.

The thing that most gives me the willies is that the sdk is based on eclipse, but it uses gcc as the backend anyway.

Memory

The 16 core version only has 32K SDRAM per core and 512K total per chip. And that includes both instruction and data. This per-core amount is the same as the cache on most ARM chips and will be a bit tricky to deal with. However this isn't a hard limit, just the limit on what is cheapest to access. OpenCL kernels are usually a lot smaller than this though, and so you can certainly get real work done with it. Not being confined to the OpenCL programming model would also enable efficient implementation of streaming (which is another way to save memory use).

The low latency instructions means loops wont have to be unrolled so much to hide them, so it should be able to achieve a higher code density anyway (not to mention the 16-bit versions of every instruction).

No cache

Only local memory. Programmers do hate this ... but the benefits you get from not having one are worth it here. A lot less power and silicon on the hardware side, and even though it might be a bit tricker to write efficient code, you're not getting hit with weird an unexpected results either because some data size hit cache tag aliasing. It goes a bit further than that too - no need for hardware memory barriers either, a write or read is a write or read to or from the target memory, not some half-way house. No cache snooping required.

I thought this (LDS) was one of the coolest features of SPUs, and it's a must-have in OpenCL too.

Latency

As one goes further from your local cell, the latency of memory access goes up quite quickly because as far as I can tell, each lane only goes one over and it requires multiple hops. But application-accessible DMA can be used to hide this and since you'll use it with the small local memory size anyway, it kind of comes for free.

Memory protection, virtual memory

None at all whatsoever on the accelerator. This is another bullet point as to how it achieves such a high flops/watt ratio.

Hardware threads

None. Rather than hide latency using multiple threads, one uses DMA.

Synchronisation primitives

None none that I noticed beyond a test and set instruction. This is a bit of a bummer actually as this kind of stuff can be very cool and very fast - but unfortunately it is also a gigantic patent minefield so i'm not surprised none is included. I'm talking about mailbox queues and mark/release type instructions for non-blocking primitives.

Since a core can only talk to its neighbours, this is probably not so useful or important anyway now I think about it.

About Me

Tags

64 core Parallela & ARM A9 Zynq

Scalar vs SIMD, not all FLOPS are equal

Nothing's perfect