Cell Programming Difficulties

July 26, 2008 8:06 am

So when IBM said that they traded programming difficulty for speed, they meant it, but probably not in the most obvious of ways.

The current philosophy for x86 architecture to increase speed has been to throw as much hardware as possible into the core to improve performance for existing programs and more importantly programming style. This has worked out somewhat well as GCC has matured over the years to provide highly optimized code to take advantage of the hardware but places a bottleneck on total throughput as the number of ALU's on a chip are quite small because of all the extra hardware added by out-of-order and speculative exectution, branch prediction, etc. IBM has decided to eschew the additional hardware to implement these features to use that room for more ALU's.

The SPU's on the Cell processor for example does not have any out of order execution, branch prediction or cache. This space for rename registers has been traded for 128 architectural 128-bit vector registers. The cache has been traded for a programmer accessible 256KB local store. And all the other space left over is used to fill the die with enough ALU's to get 25.6 GFLOPS of single precision performance.

So why's this an issue? For a program I coded to work on x86 to apply a filter to a 6min WAV file, it took a little over 10 seconds Although if run with -mno-sse2 it will take about a 1 min 10. A direct port of this code to my PS3 also took a 1 min 10 secs to execute. This shows how well GCC is optimized to run on x86. Unlike on x86, GCC is not quite optimized to automatically vectorize the code into Altivec. It gets even worse when run on the SPU, they have no scalar registers so the code took 7 mins to execute. All scalar operation are done by using a splat (copies one value into all places in a vector register) does the operation, then masks the value. Also since there is no branch prediction with a 21 stage pipeline, all my branches guessed incorrectly would incur a ~23 cycle penalty.

However despite all these shortcomings it is quite simple to achieve high performance but requires the programmer to think differently. If the programmer vectorizes as much code as possible, the program runs quite quickly. I was able to vectorize just the arithmetic and achieved a 30 second run time. The biggest trick to optimization is to avoid branch penalties by unrolling (which I've done somewhat) and using predication (which I haven't). Predication is where you execute both sides of the branch and select the right result. On a scalar processor this is time consuming because this requires 2 operations when you could do just one. However, on the Cell, scalar operations are costly as every instruction is vectorized so a vector operation that does executes both sides of the branch then using a vector select takes up far fewer cycles than branching incorrectly, flushing, then executing.

Currently I'm having quite a few issues with context switching as I tried to thread my filter program, which is an optimization I'll save for another post. Once I find the solution, I'll post my findings.

Comments

Huan said on August 28, 2008 3:05 am

Still haven't made any progress on this.

Leave a Reply

Name:

Website (optional):

Comment:

CAPTCHA Image
Reload Image

Enter code: