GPU accelerated high-frequency trading

cashcow · Nov 18, 2009

I bit the bullet and got some of these cards inside a PC for development alongside a good Nvidia card. Here are my thoughts:

1. Call of Duty 2 runs amazingly - if you are interested in doing any development on these cards do not buy COD2, because you will not get any work done.

2. Development for these cards is not easy - the average developer will find it difficult to optimize for these cards. However speed-ups of between 12% and 100%+ are possible for a range of algorithms with not too much work.

3. Reaction speed is not necessarily fast. Loading the data into the card, processing and returning the data takes some time. For highly reactive systems, these cards are not particularly great.

4. What makes the difference are algorithms which can be split into (a high number of) distinct sub-units each with a high number of arithmetic operations per unit and little branching.

5. Double precision floating point performance sucks at the moment. If you need it, I would not bother at the moment.

6. Development/Profiling tools included are ok, but imho require more work.

7. It is not possible to create a coherent 'link' between device and system memory with current architectures. All atomic operations MUST be performed on the device and then all results transferred from the device.

Overall view. If a developer is in the top 10% of developers, and understands the following:

How to optimize algorithms, decomposition, branch removal etc.
A good understanding of memory models.

Then, you can probably expect some success on algorithms which involve LOTS of computation. Development time of systems will be much slower than conventional programming, development is in C++ which is inherently more costly.

If you have a high speed trading system which has a few calculations and you want to make it faster, don't even bother, these cards will make no difference whatsoever.

cashcow · Nov 18, 2009

If anyone out there does want to make their trading system faster, drop me a message. I am so confident that I can improve the (speed) performance of ANY system that I would work on a no performance improvement - no fee basis.
Only C++ as C# performance only really applies on a higher architectural level. From experience I have managed to get an original algo written in C# executing in 30 milliseconds down to (averaged execution time) 120 nanoseconds in C++.
London only.

rockbrain · Dec 11, 2009

Hey Cashcow, I would like to talk to you about optimizing trading algos.

mcgene4xpro · Aug 25, 2010

Quote from cashcow:

If anyone out there does want to make their trading system faster, drop me a message. I am so confident that I can improve the (speed) performance of ANY system that I would work on a no performance improvement - no fee basis.
Only C++ as C# performance only really applies on a higher architectural level. From experience I have managed to get an original algo written in C# executing in 30 milliseconds down to (averaged execution time) 120 nanoseconds in C++.
London only.

I have a code needs to be speed up but i am worry about the protection of the logic. Any suggestions?

WinstonTJ · Aug 26, 2010

There are quad socket motherboards that accept FOUR, SIX-CORE CPUs... that's 24 cores on one motherboard, why would you ever want to go this route when you can put 2-4 Xeon/AMD CPUs with 2-6 cores each into a single motherboard...???

Equalizer · Aug 26, 2010

Quote from WinstonTJ:

There are quad socket motherboards that accept FOUR, SIX-CORE CPUs... that's 24 cores on one motherboard, why would you ever want to go this route when you can put 2-4 Xeon/AMD CPUs with 2-6 cores each into a single motherboard...???

- Price per teraflop
- Power consumption (ask the risk guys at JPM about this!)
- Algos that map to the architecture

Both approaches have +ves and -ves. At the end it depends what you want to do. i.e., do you need a Cray or a Connection-Machine to solve a particular problem?

Certainly CUDA is still close enough to the metal to keep out all those brought up on "quiche" languages. I expect this to change in the near future...

propseeker · Aug 26, 2010

Quote from WinstonTJ:

There are quad socket motherboards that accept FOUR, SIX-CORE CPUs... that's 24 cores on one motherboard, why would you ever want to go this route when you can put 2-4 Xeon/AMD CPUs with 2-6 cores each into a single motherboard...???

nvidia's new fermi board has around 450 cores. you can fit 4 in a 4slot pci server. so 1800 highly specialized cores vs 24 general purpose. if you have problems that are highly parallel and aren't too memory intensive (ie trading problems), it can be worth it to muck around in specialized code.

having said that, i think once intel gets sandy bridge out, nvidia's days of providing cheap super-computing power are numbered. unless nvidia can pull a rabbit out of its hat, end of 2011 will be the beginning of the end for cuda.

Equalizer · Aug 26, 2010

Hmm... are they really comparable? We are talking completely different architecture and number of cores here. No?

dloyer · Aug 26, 2010

GPU's are just chips optimized for the problem they are designed to solve. If your problem fits, then it is hard to beat.

Much of the space on an general CPU is dedicated to cache. Most of the space on a GPU is dedicated to cores. Basic trade off.

A lot of this is about memory bandwidth and latency. General CPU loads have lots of random memory access so need the big cache. GPU loads can stream data so can use a wide path to memory and hide latncy by executing other threads while waiting for a memory request.

The CUDA programing guide covers in detail

Equalizer · Aug 27, 2010

Quote from dloyer:

GPU's are just chips optimized for the problem they are designed to solve. If your problem fits, then it is hard to beat.

My point exactly. I just don't buy the prediction that Intel's new releases will be the end for CUDA and/or NVIDIA. OpenCL, maybe, but for our specific applications which map nicely (and which I can't get into on a public forum), I just can't see how, say, using even a 64-core machine (multi-core-multi-CPU) can beat a GPU/multi-GPU solution.