Okay, here is my first real post

.
In this post I'm going to do some IPC (inter-process communication), and how to share memory between processes. I'm including a bit of code which transfers memory from one process to another. I'm not going to go into the general architecture of building a system and leave it for a different day.
Obviously multiprocessing is a good thing. You can split up compute power between different cores and thus make it go faster. In a trading system, you generally split up computing between some different modules. E.g. feed handler -> strategies -> output (order placement). What Iâve done before is to have the feed handler listen on whatever data source it has and then the strategies can subscribe to various symbols etc. The question is how to implement this thing.
Multiprocessing is incredibly tricky to get right. Humans aren't good at thinking about the logic of concurrent computing, and there are all sorts of things which lead to unintended consequences. Different threads overwrite each others' memory and everything gets all messed up. Or they starve each other by locking each other out of things. And then bugs are hard to reproduce because of timing issues and it's impossible to debug.
So don't use locks. Locked data structures are slow! Google for some benchmarks. Not only does the locking call have overhead, but it can also lead to things like deadlocking, which is no fun. One solution is to use sockets. I've used sockets in the past to split up computing power between I/O and trading logic, which works although has some overhead. What I'm going to discuss here is using ring buffers.
There are some usage scenarios which lend themselves nicely to multiprocessing, and that is the one producer, one consumer scenario. Luckily, this happens all the time in trading where you get some inputs, pass it to some logic thing, and then spit things out. The reason it is lock free is because there are only two shared objects, and they are the reader offset and the writer offset (actually you can make it just one shared object, the distance between them, but itâs easier to keep two different ones). All operations on these two objects are atomic, which means that operations on these things are not interruptible, so you donât need to lock them! (Okay Iâm kind of lying since
there is actually locking done at the hardware level but weâll sweep that under the rug for now).
Hereâs how to set this thing up, and itâs actually pretty simple. The hard part is writing a good API so that you can abstract away the inner workings and so it looks like a file descriptor or socket or something, which I won't go into here. Note: Iâve only tested this on Linux (Debian Wheezy). Itâll probably work on Windows with some trivial tweaking; let me know if youâre able to get it to work. It also probably needs an x86 processor (Iâm using an i5).
First, you open file descriptor which points to a region of shared memory, using the call shm_open. Then you memory map the thing so that it looks just like a region of memory to the program. Then you reserve part of that thing (past the end of the ring buffer) for the read and write pointers. Then you set up the ring buffer, with the read and write pointers. Use the macros in <linux/circ_buf.h> to do some calculations for you (
read more here). And then thatâs basically it. The trickiest part is that you have to put in an instruction telling the CPU not to reorder the read/write instruction with the instruction to increment the read/write pointer.
So it took a lot longer than I thought to write the code and make it readable and stuff. Itâs GPLâed since I copied some code from the Linux kernel. Donât redistribute it improperly or Linus will come after you! Iâm just going to attach it and let you figure it out for yourselves. It includes four files, ipc_build.sh (which builds it), ipc_consumer.c, ipc_producer.c and ipc.h. Once it is built, two processes need to be run, ./produce and ./consume. ./produce will open up the shared memory, and write a bunch of ints to it. Once it write a zero, it will quit. ./consume will open it up and read a bunch of ints. Once it receives a zero, it will quit. On my machine it achieves about 145MB / sec, or 40M writes / sec in terms raw throughput. Size of throughput can probably be increased by transferring in chunks of 8, 16 or 32 bytes or 64 bytes, although writes / sec will probably remain constant. Considering that I can get about 500M order book messages a day, that is quite fast. Chunks must be powers of two, otherwise they will break the shared memory and the ring buffer.
Open it up and let me know if you can use it! Iâd appreciate feedback, especially critique.