I have found moving calculations from CPU (multithreaded C++) to GPU (opencl 1.2) can decrease execution times by a factor of 30 - 60.
But, porting to make good use of a GPU is not always that easy. For example, instead of three nested loops with sizes M, N, and O, one might be able to have...