Trading System Development

IAS_LLC · Jan 21, 2015

AdrianHagh81 said:
You know linux allocator have a set limit of 64k i think before they go hit up.

By this, you mean that only 64k of data can be stored in cache?

AdrianHagh81 · Jan 21, 2015

IAS_LLC said:
By this, you mean that only 64k of data can be stored in cache?

run cat /proc/slabinfo

and check the results.

volpunter · Jan 21, 2015

IAS_LCC, you should really profile your application to pinpoint where the problem actually lies. I am pretty sure it has nothing to do with your cache. First of all, it might be useful to know whether issues arise at the point of data acquisition/loading or the streaming part and injection into strategies or elsewhere. Have you already been able to pinpoint the exact problem?

IAS_LLC · Jan 21, 2015

No, but i haven't put a lot of effort into it yet. Its low on my priority list right now as I'm more concerned with strategy development than software optimization. I know its related to getting the data from the feed handler to my "trading platform". I use shared memory to do this, so im fairly certain its a cache hit problem or the shared memory mutex is blocking the other thread more often than I'd like.

volpunter said:
IAS_LCC, you should really profile your application to pinpoint where the problem actually lies. I am pretty sure it has nothing to do with your cache. First of all, it might be useful to know whether issues arise at the point of data acquisition/loading or the streaming part and injection into strategies or elsewhere. Have you already been able to pinpoint the exact problem?

volpunter · Jan 21, 2015

that is what I suspect, that one thread blocks the other...

IAS_LLC said:
No, but i haven't put a lot of effort into it yet. Its low on my priority list right now as I'm more concerned with strategy development than software optimization. I know its related to getting the data from the feed handler to my "trading platform". I use shared memory to do this, so im fairly certain its a cache hit problem or the shared memory mutex is blocking the other thread more often than I'd like.

hft_boy · Jan 23, 2015

AdrianHagh81 said:
Hmm, unless you've developed a new OS and filesystem I've never heard about
then this I find hard to believe.

Anyone using the standard linux kernel fread() will not accomplish this.

Guys Listening to volpunter makes you think that retail automated trading is hopeless,

well it's not, I'm proof of it

fread() is not a Linux kernel, or even standard system call. It's a standard C library call. And yes, you can load millions or even hundreds of millions of ticks per second on a commodity quad core if you know what you are doing and are willing to get your hands a little dirty. Just wanted to clear up those two points.

As an aside, you it's true you don't necessarily need to be able to do this to be successful at trading. In the same sense that you don't need a computer to do accounting and end of year taxes. You could use paper and pen, or an abacus. But it certainly makes certain processes a lot smoother.

volpunter · Jan 24, 2015

technically there isn't any limitation why one should not achieve to load tens of millions of ticks, at least the limitation at the moment is not posed by throughput on the memory, bus, or cache side. Given that dated 1066Mhz main memory has a throughput of about 7gb/sec, L3 3x the one of main memory, L2 1.5x of L3, and L1 1.5x of L2, neither memory nor bus throughputs pose a serious challenge to loading many tens of millions of data points. The work involved to deserialize data, for example, and other computationally expensive operations that tax the CPU or GPUs on the other hand heavily depends on the quality of software implementations of algorithms.

But those points are moot because the bottleneck from my experience is not the loading, ordering/sorting of ticks but the actual time and resources spent on operating on the actual algorithmic strategies. (I strictly limit the discussion to iterating over historical tick based data and not at all digress into handling live data feeds).

hft_boy said:
fread() is not a Linux kernel, or even standard system call. It's a standard C library call. And yes, you can load millions or even hundreds of millions of ticks per second on a commodity quad core if you know what you are doing and are willing to get your hands a little dirty. Just wanted to clear up those two points.

As an aside, you it's true you don't necessarily need to be able to do this to be successful at trading. In the same sense that you don't need a computer to do accounting and end of year taxes. You could use paper and pen, or an abacus. But it certainly makes certain processes a lot smoother.

hft_boy · Jan 24, 2015

volpunter said:
technically there isn't any limitation why one should not achieve to load tens of millions of ticks, at least the limitation at the moment is not posed by throughput on the memory, bus, or cache side. Given that dated 1066Mhz main memory has a throughput of about 7gb/sec, L3 3x the one of main memory, L2 1.5x of L3, and L1 1.5x of L2, neither memory nor bus throughputs pose a serious challenge to loading many tens of millions of data points. The work involved to deserialize data, for example, and other computationally expensive operations that tax the CPU or GPUs on the other hand heavily depends on the quality of software implementations of algorithms.

But those points are moot because the bottleneck from my experience is not the loading, ordering/sorting of ticks but the actual time and resources spent on operating on the actual algorithmic strategies. (I strictly limit the discussion to iterating over historical tick based data and not at all digress into handling live data feeds).

Agreed on all points. As perverse as it may seem, I have toyed with the idea of writing a custom compression / decompression scheme so that you can in fact significantly exceed the memory bus by decompressing in cache using only a few instructions per tick. My back of the envelope calculation says that you can achieve something like 50-75GB/s, per machine, hence my quote of hundreds of millions or perhaps billions of ticks per second. But like you said, after you hit a few million ticks per second, the overhead happens when you actually do something interesting with those ticks. So there is really not much point implementing this (maybe I will someday, when I am on vacation). And of course, with enough work, you can get most queries to run with arbitrarily low overhead, but you run into the programmer time / cpu time tradeoff.

EDIT: note, I did my back of envelope calculation using DDR3-1600. Using the newest DDR4-2400 or whatever the theoretical upper bound increases by a factor of five or so.