No direct experience (no ps3 to play on), but from what I have read, most parallel algorithms drop down easily from a conceptual point of view, but not so from an implementation point of view. To get the true power from the CELL CPU, you need to get your hands dirty with issues like loop-unrolling, bad branch prediction, memory coalescing (a GPU issue as well). You really have to make sure your code is appropriately vectorized. You also need to watch out for the hard memory limit.
Basically, you're going to need to redesign your algorithms to fit into the constraints of the processor, but also to make sure that they are taking full advantage of the processors capabilities. Job management and what-not.
From a Stack-Overflow thread. Issues you need to deal with when handling a SPU:
* Atomic operations (lock-free try-discard style).
* Strong distinction between memory areas. You have to know which pointer is pointing to which memory area or you'll screw everything up.
* No enforced hardware distinction between data and code. This is actually a fun thing, you can setup dynamic code loading and essentially stream subroutines in and out. Self-modifying code is possible but not necessarily practical on SPU.
* Lack of hardware debugging aids.
* Limited memory size.
* Fast memory access.
* Instruction set balanced toward SIMD operations.
* Floating point "gotchas".
My experience with GPUs is basically that getting a naturally parallel algorithm over is pretty simple, and as long as the memory transfer isn't your bottleneck (i.e. you are giving it enough to work with), you get a free speed-up. The basics are easy. However, really getting nitty gritty can be difficult, and getting the MOST out of your GPU is quite difficult. I've mainly used it for back-testing and that sort of stuff.