Cell Programming Difficulties

July 26, 2008 8:06 am

So when IBM said that they traded programming difficulty for speed, they meant it, but probably not in the most obvious of ways.

The current philosophy for x86 architecture to increase speed has been to throw as much hardware as possible into the core to improve performance for existing programs and more importantly programming style. This has worked out somewhat well as GCC has matured over the years to provide highly optimized code to take advantage of the hardware but places a bottleneck on total throughput as the number of ALU's on a chip are quite small because of all the extra hardware added by out-of-order and speculative exectution, branch prediction, etc. IBM has decided to eschew the additional hardware to implement these features to use that room for more ALU's.

The SPU's on the Cell processor for example does not have any out of order execution, branch prediction or cache. This space for rename registers has been traded for 128 architectural 128-bit vector registers. The cache has been traded for a programmer accessible 256KB local store. And all the other space left over is used to fill the die with enough ALU's to get 25.6 GFLOPS of single precision performance.

So why's this an issue? For a program I coded to work on x86 to apply a filter to a 6min WAV file, it took a little over 10 seconds Although if run with -mno-sse2 it will take about a 1 min 10. A direct port of this code to my PS3 also took a 1 min 10 secs to execute. This shows how well GCC is optimized to run on x86. Unlike on x86, GCC is not quite optimized to automatically vectorize the code into Altivec. It gets even worse when run on the SPU, they have no scalar registers so the code took 7 mins to execute. All scalar operation are done by using a splat (copies one value into all places in a vector register) does the operation, then masks the value. Also since there is no branch prediction with a 21 stage pipeline, all my branches guessed incorrectly would incur a ~23 cycle penalty.

However despite all these shortcomings it is quite simple to achieve high performance but requires the programmer to think differently. If the programmer vectorizes as much code as possible, the program runs quite quickly. I was able to vectorize just the arithmetic and achieved a 30 second run time. The biggest trick to optimization is to avoid branch penalties by unrolling (which I've done somewhat) and using predication (which I haven't). Predication is where you execute both sides of the branch and select the right result. On a scalar processor this is time consuming because this requires 2 operations when you could do just one. However, on the Cell, scalar operations are costly as every instruction is vectorized so a vector operation that does executes both sides of the branch then using a vector select takes up far fewer cycles than branching incorrectly, flushing, then executing.

Currently I'm having quite a few issues with context switching as I tried to thread my filter program, which is an optimization I'll save for another post. Once I find the solution, I'll post my findings.

Why we like H!P Kids

July 31, 2008 4:15 am

So closely listening to the new C-ute single, Edo no Temari II (embedded below), I'm finding out why the Kids, esp. C-ute have been much more popular nowadays. Vocally they sound somewhat similar to old Momusu.

The current Momusu, though decently vocally talented (with the exception of Sayumi) sound very homogeneous. It's difficult to pick out their solo lines at times, but in chorus it's near impossible. Their sound has become choral like. However, the first and second gen members had very different tonalities that were able to harmonize with each other rather than blend.

The Kids are similar to old Momusu in that its members are quite tonally different and it's possible to determine voices in a chorus. I'm a big fan of harmonizing and there are just some dissonant parts in Edo no Temari that make me feel good. It's not the same level as Memory Seishun no Hikari, but I've enjoyed this song more already than Resonant Blue, which I liked a lot. Another thing I think that has helped this song has been how the line distribution is a bit more balanced, Airi doesn't have all the lines (thank you) and even Erika has a few.

As for the video, I really like it. It's quite well done, the costumes are pretty nice, the neon fits with the feeling of the song, and the closeups aren't too close. The closeups are probably one of my favorite parts of the PV; it's like a nice solo shot where I can see more than just their face. And as always, they're really good dancers, but from what I can see it's probably not an overly difficult dance, but I should probably wait for the dance version to make a final verdict there. (Considering how I was wrong about CC). Tsunku, you need to start treating Momusu like you treat the Kids or otherwise C-ute could become the H!P flagship group!