How to build an automated system

Makis · Feb 7, 2013

Quote from SeventhCereal:

o rly? stop firing off buzzwords like its an exam. boost has been around for over 5 years now. fail.

I want to thank you for promoting boost and everyone else that uses boost on low latency HFT projects on Wall Street. When I join such project and my mandate is to improve latency, my first action is to remove any boost dependency and I produce an immediate drop on latency numbers and increase on throughput. You have made my career so much better and easier.

clearinghouse · Feb 7, 2013

Quote from SeventhCereal:

o rly? stop firing off buzzwords like its an exam. boost has been around for over 5 years now. fail.

boost doesn't solve all the problems the standard solves.

Scoped enums are a language feature, not a boost feature. The threading library in the standard is more full featured. Variadic templates had to be done using the hacky typelist pattern.

... and on and on.

Counter-fail.

hft_boy · Feb 7, 2013

Quote from SeventhCereal:

moron.

It was funny the first time. Now you are really starting to hurt my feelings! Seriously though let's keep this thread on topic. Attacks aimed at OP only, please. No need to harass other thread contributors.

2rosy · Feb 9, 2013

so when are you going to tell us "How to build an automated system"

hft_boy · Feb 9, 2013

Okay, here is my first real post

.

In this post I'm going to do some IPC (inter-process communication), and how to share memory between processes. I'm including a bit of code which transfers memory from one process to another. I'm not going to go into the general architecture of building a system and leave it for a different day.

Obviously multiprocessing is a good thing. You can split up compute power between different cores and thus make it go faster. In a trading system, you generally split up computing between some different modules. E.g. feed handler -> strategies -> output (order placement). What Iâve done before is to have the feed handler listen on whatever data source it has and then the strategies can subscribe to various symbols etc. The question is how to implement this thing.

Multiprocessing is incredibly tricky to get right. Humans aren't good at thinking about the logic of concurrent computing, and there are all sorts of things which lead to unintended consequences. Different threads overwrite each others' memory and everything gets all messed up. Or they starve each other by locking each other out of things. And then bugs are hard to reproduce because of timing issues and it's impossible to debug.

So don't use locks. Locked data structures are slow! Google for some benchmarks. Not only does the locking call have overhead, but it can also lead to things like deadlocking, which is no fun. One solution is to use sockets. I've used sockets in the past to split up computing power between I/O and trading logic, which works although has some overhead. What I'm going to discuss here is using ring buffers.

There are some usage scenarios which lend themselves nicely to multiprocessing, and that is the one producer, one consumer scenario. Luckily, this happens all the time in trading where you get some inputs, pass it to some logic thing, and then spit things out. The reason it is lock free is because there are only two shared objects, and they are the reader offset and the writer offset (actually you can make it just one shared object, the distance between them, but itâs easier to keep two different ones). All operations on these two objects are atomic, which means that operations on these things are not interruptible, so you donât need to lock them! (Okay Iâm kind of lying since there is actually locking done at the hardware level but weâll sweep that under the rug for now).

Hereâs how to set this thing up, and itâs actually pretty simple. The hard part is writing a good API so that you can abstract away the inner workings and so it looks like a file descriptor or socket or something, which I won't go into here. Note: Iâve only tested this on Linux (Debian Wheezy). Itâll probably work on Windows with some trivial tweaking; let me know if youâre able to get it to work. It also probably needs an x86 processor (Iâm using an i5).

First, you open file descriptor which points to a region of shared memory, using the call shm_open. Then you memory map the thing so that it looks just like a region of memory to the program. Then you reserve part of that thing (past the end of the ring buffer) for the read and write pointers. Then you set up the ring buffer, with the read and write pointers. Use the macros in <linux/circ_buf.h> to do some calculations for you (read more here). And then thatâs basically it. The trickiest part is that you have to put in an instruction telling the CPU not to reorder the read/write instruction with the instruction to increment the read/write pointer.

So it took a lot longer than I thought to write the code and make it readable and stuff. Itâs GPLâed since I copied some code from the Linux kernel. Donât redistribute it improperly or Linus will come after you! Iâm just going to attach it and let you figure it out for yourselves. It includes four files, ipc_build.sh (which builds it), ipc_consumer.c, ipc_producer.c and ipc.h. Once it is built, two processes need to be run, ./produce and ./consume. ./produce will open up the shared memory, and write a bunch of ints to it. Once it write a zero, it will quit. ./consume will open it up and read a bunch of ints. Once it receives a zero, it will quit. On my machine it achieves about 145MB / sec, or 40M writes / sec in terms raw throughput. Size of throughput can probably be increased by transferring in chunks of 8, 16 or 32 bytes or 64 bytes, although writes / sec will probably remain constant. Considering that I can get about 500M order book messages a day, that is quite fast. Chunks must be powers of two, otherwise they will break the shared memory and the ring buffer.

Open it up and let me know if you can use it! Iâd appreciate feedback, especially critique.

hft_boy · Feb 9, 2013

Hmm, here's a couple more implementations.
LockFree++
LMAX's Disruptor

The first is 100 ns, and the second is 52 ns per message, both on better hardware (i7). So I'd say that my <30 ns per 4 byte message is quite fast

.

vincegata · Feb 9, 2013

@hft_boy Excellent post and special thanks for the code.

Sometimes ago 2rosy (I think it was him) mentioned ZeroMQ which looks like is using TCP underneath. Shared memory should be faster. I am using named pipes right now that takes ~900 microseconds hence I am looking for something to replace it.

hftvol · Feb 9, 2013

doing IPC in-memory is a VERY BAD IDEA. You restrict yourself to the processes on the same machine. What is the point of IPC other than just splitting up processes? They still share the same resources: Memory, threads, CPU cores. No point!!! Memory mapping has its applications but I do not see the point in using it for trading framework architectures. Someone with knowledge of how to program concurrently and async will beat the crap out of any app that achieves the same naturally through segregated processes as long as resources are constrained by same-machine-hardware.

Instead make IPC communicate over sockets. TCP is fine but other sockets are also used. UDP is a bad idea because its not reliable albeit faster. I use ZeroMQ and I transport about 16million 16bit messages per second over tcp ports. The advantage here is that you can fire to another machine and forget. For example you can have a dedicated machine that handles all logging of quote, trade, order data for later analysis without impacting the actual machine on which your execution module runs. I wrote a wrapper around ZeroMQ which can pub/sub, request/reply, filter messages, all that in a brokerless environment.

Quote from hft_boy:

Okay, here is my first real post .

In this post I'm going to do some IPC (inter-process communication), and how to share memory between processes. I'm including a bit of code which transfers memory from one process to another. I'm not going to go into the general architecture of building a system and leave it for a different day.

Obviously multiprocessing is a good thing. You can split up compute power between different cores and thus make it go faster. In a trading system, you generally split up computing between some different modules. E.g. feed handler -> strategies -> output (order placement). What Iâve done before is to have the feed handler listen on whatever data source it has and then the strategies can subscribe to various symbols etc. The question is how to implement this thing.

Multiprocessing is incredibly tricky to get right. Humans aren't good at thinking about the logic of concurrent computing, and there are all sorts of things which lead to unintended consequences. Different threads overwrite each others' memory and everything gets all messed up. Or they starve each other by locking each other out of things. And then bugs are hard to reproduce because of timing issues and it's impossible to debug.

So don't use locks. Locked data structures are slow! Google for some benchmarks. Not only does the locking call have overhead, but it can also lead to things like deadlocking, which is no fun. One solution is to use sockets. I've used sockets in the past to split up computing power between I/O and trading logic, which works although has some overhead. What I'm going to discuss here is using ring buffers.

There are some usage scenarios which lend themselves nicely to multiprocessing, and that is the one producer, one consumer scenario. Luckily, this happens all the time in trading where you get some inputs, pass it to some logic thing, and then spit things out. The reason it is lock free is because there are only two shared objects, and they are the reader offset and the writer offset (actually you can make it just one shared object, the distance between them, but itâs easier to keep two different ones). All operations on these two objects are atomic, which means that operations on these things are not interruptible, so you donât need to lock them! (Okay Iâm kind of lying since there is actually locking done at the hardware level but weâll sweep that under the rug for now).

Hereâs how to set this thing up, and itâs actually pretty simple. The hard part is writing a good API so that you can abstract away the inner workings and so it looks like a file descriptor or socket or something, which I won't go into here. Note: Iâve only tested this on Linux (Debian Wheezy). Itâll probably work on Windows with some trivial tweaking; let me know if youâre able to get it to work. It also probably needs an x86 processor (Iâm using an i5).

First, you open file descriptor which points to a region of shared memory, using the call shm_open. Then you memory map the thing so that it looks just like a region of memory to the program. Then you reserve part of that thing (past the end of the ring buffer) for the read and write pointers. Then you set up the ring buffer, with the read and write pointers. Use the macros in <linux/circ_buf.h> to do some calculations for you (read more here). And then thatâs basically it. The trickiest part is that you have to put in an instruction telling the CPU not to reorder the read/write instruction with the instruction to increment the read/write pointer.

So it took a lot longer than I thought to write the code and make it readable and stuff. Itâs GPLâed since I copied some code from the Linux kernel. Donât redistribute it improperly or Linus will come after you! Iâm just going to attach it and let you figure it out for yourselves. It includes four files, ipc_build.sh (which builds it), ipc_consumer.c, ipc_producer.c and ipc.h. Once it is built, two processes need to be run, ./produce and ./consume. ./produce will open up the shared memory, and write a bunch of ints to it. Once it write a zero, it will quit. ./consume will open it up and read a bunch of ints. Once it receives a zero, it will quit. On my machine it achieves about 145MB / sec, or 40M writes / sec in terms raw throughput. Size of throughput can probably be increased by transferring in chunks of 8, 16 or 32 bytes or 64 bytes, although writes / sec will probably remain constant. Considering that I can get about 500M order book messages a day, that is quite fast. Chunks must be powers of two, otherwise they will break the shared memory and the ring buffer.

Open it up and let me know if you can use it! Iâd appreciate feedback, especially critique.

hftvol · Feb 9, 2013

Again those are both derivatives of memory mapping. They are fast, yes, but thats it. To be honest with you, in this age of task processing, async programming, data flow (actor model...) programming its actually wasting resources to split up modules that run ON THE SAME MACHINE and have them communicate IPC. IPC really comes in handy when you want to communicate with modules that run on different machines. For example, it is a very bad idea to run a database server instance on the same machine than your strategy algorithms.

Quote from hft_boy:

Hmm, here's a couple more implementations.
LockFree++
LMAX's Disruptor

The first is 100 ns, and the second is 52 ns per message, both on better hardware (i7). So I'd say that my <30 ns per 4 byte message is quite fast .

hftvol · Feb 10, 2013

it was me and ZeroMq also supports UDP and multiplex though its not the recommended way in trading application where you are cannot afford to lose messages.

Quote from vincegata:

@hft_boy Excellent post and special thanks for the code.

Sometimes ago 2rosy (I think it was him) mentioned ZeroMQ which looks like is using TCP underneath. Shared memory should be faster. I am using named pipes right now that takes ~900 microseconds hence I am looking for something to replace it.

How to build an automated system

Attachments