Small sample of ARCA FAST data

nitro · Jul 27, 2009

Quote from squeeze:

I can see the slightly geeky pleasure in achieving this but given exchanges don't have 10gbps trading interfaces I can't see that this is that useful.

http://www.nyse.com/press/1245839193594.html

squeeze · Jul 27, 2009

Well I suppose it was only a matter of time. Happy to leave all this to someone else.

propseeker · Jul 27, 2009

Quote from rufus_4000:

..And this is much faster than our own parser for the flat format, which is only around 500 - 600k msg / sec...

is this because you're IO bound here, or by 'flat' are you referring to the non-fast protocol?

rufus_4000 · Jul 28, 2009

Quote from propseeker:

is this because you're IO bound here, or by 'flat' are you referring to the non-fast protocol?

We started with the CME RLC legacy and FIXFAST feeds. I wrote the original RLC code, it was your standard fixed length string cut up routine, nothing fancy. In fact, we were about 80% into a fairly conventional (table lookup) implementation of FIXFAST, and for the life of me, I couldn't understand why we were only barely breaking the 1.5M msg/sec mark. Then, one day, look at a piece of fragment, it hit me. That FIXFAST encoding / decoding is really a lot like a bit shifting state machine (if you think about it right), there are fairly new techniques using SIMD (single instruction, multiple data) to do stream data processing using data level parallelism, or under Intel, known as SSE2/3.

After that "aha" moment, it was pretty straight forward, it was just a matter of experimenting with some primitives, and seeing how they can stack up. And to fit the different bits into one processing "clock". I mean, this is not rocket science. We also took advantage of some compiler trickery to minimize data movement, template initialization, etc.

Funny you mentioned I/O bound, we actually had a substantially more difficult time to construct a simulator that pump a few hundred MBs of captured FIXFAST data into our system.

propseeker · Jul 28, 2009

Quote from rufus_4000:

We started with the CME RLC legacy and FIXFAST feeds. I wrote the original RLC code, it was your standard fixed length string cut up routine, nothing fancy. In fact, we were about 80% into a fairly conventional (table lookup) implementation of FIXFAST, and for the life of me, I couldn't understand why we were only barely breaking the 1.5M msg/sec mark. Then, one day, look at a piece of fragment, it hit me. That FIXFAST encoding / decoding is really a lot like a bit shifting state machine (if you think about it right), there are fairly new techniques using SIMD (single instruction, multiple data) to do stream data processing using data level parallelism, or under Intel, known as SSE2/3.

After that "aha" moment, it was pretty straight forward, it was just a matter of experimenting with some primitives, and seeing how they can stack up. And to fit the different bits into one processing "clock". I mean, this is not rocket science. We also took advantage of some compiler trickery to minimize data movement, template initialization, etc.

Funny you mentioned I/O bound, we actually had a substantially more difficult time to construct a simulator that pump a few hundred MBs of captured FIXFAST data into our system.

yea that's funny, those second numbers you posted looked a lot like my i/o bound simulator, so thought i'd ask.

thanks for the tip on sse2, i've never thought of using it for this type of problem scope, looks like a very nice fit. hopefully i can make some time to attempt something... although, i'll need to dust up on my assembler a bit first

. regards

nitro · Jul 28, 2009

Very nice. Impressive.

Quote from rufus_4000:

We started with the CME RLC legacy and FIXFAST feeds. I wrote the original RLC code, it was your standard fixed length string cut up routine, nothing fancy. In fact, we were about 80% into a fairly conventional (table lookup) implementation of FIXFAST, and for the life of me, I couldn't understand why we were only barely breaking the 1.5M msg/sec mark. Then, one day, look at a piece of fragment, it hit me. That FIXFAST encoding / decoding is really a lot like a bit shifting state machine (if you think about it right), there are fairly new techniques using SIMD (single instruction, multiple data) to do stream data processing using data level parallelism, or under Intel, known as SSE2/3.

After that "aha" moment, it was pretty straight forward, it was just a matter of experimenting with some primitives, and seeing how they can stack up. And to fit the different bits into one processing "clock". I mean, this is not rocket science. We also took advantage of some compiler trickery to minimize data movement, template initialization, etc.

Funny you mentioned I/O bound, we actually had a substantially more difficult time to construct a simulator that pump a few hundred MBs of captured FIXFAST data into our system.

Small sample of ARCA FAST data

nitro

squeeze

propseeker

rufus_4000

propseeker

nitro