Small sample of ARCA FAST data

Quote from rufus_4000:

..And this is much faster than our own parser for the flat format, which is only around 500 - 600k msg / sec...
is this because you're IO bound here, or by 'flat' are you referring to the non-fast protocol?
 
Quote from propseeker:

is this because you're IO bound here, or by 'flat' are you referring to the non-fast protocol?

We started with the CME RLC legacy and FIXFAST feeds. I wrote the original RLC code, it was your standard fixed length string cut up routine, nothing fancy. In fact, we were about 80% into a fairly conventional (table lookup) implementation of FIXFAST, and for the life of me, I couldn't understand why we were only barely breaking the 1.5M msg/sec mark. Then, one day, look at a piece of fragment, it hit me. That FIXFAST encoding / decoding is really a lot like a bit shifting state machine (if you think about it right), there are fairly new techniques using SIMD (single instruction, multiple data) to do stream data processing using data level parallelism, or under Intel, known as SSE2/3.

After that "aha" moment, it was pretty straight forward, it was just a matter of experimenting with some primitives, and seeing how they can stack up. And to fit the different bits into one processing "clock". I mean, this is not rocket science. We also took advantage of some compiler trickery to minimize data movement, template initialization, etc.

Funny you mentioned I/O bound, we actually had a substantially more difficult time to construct a simulator that pump a few hundred MBs of captured FIXFAST data into our system.
 
Quote from rufus_4000:

We started with the CME RLC legacy and FIXFAST feeds. I wrote the original RLC code, it was your standard fixed length string cut up routine, nothing fancy. In fact, we were about 80% into a fairly conventional (table lookup) implementation of FIXFAST, and for the life of me, I couldn't understand why we were only barely breaking the 1.5M msg/sec mark. Then, one day, look at a piece of fragment, it hit me. That FIXFAST encoding / decoding is really a lot like a bit shifting state machine (if you think about it right), there are fairly new techniques using SIMD (single instruction, multiple data) to do stream data processing using data level parallelism, or under Intel, known as SSE2/3.

After that "aha" moment, it was pretty straight forward, it was just a matter of experimenting with some primitives, and seeing how they can stack up. And to fit the different bits into one processing "clock". I mean, this is not rocket science. We also took advantage of some compiler trickery to minimize data movement, template initialization, etc.

Funny you mentioned I/O bound, we actually had a substantially more difficult time to construct a simulator that pump a few hundred MBs of captured FIXFAST data into our system.

yea that's funny, those second numbers you posted looked a lot like my i/o bound simulator, so thought i'd ask.

thanks for the tip on sse2, i've never thought of using it for this type of problem scope, looks like a very nice fit. hopefully i can make some time to attempt something... although, i'll need to dust up on my assembler a bit first ;). regards
 
Very nice. Impressive.

Quote from rufus_4000:

We started with the CME RLC legacy and FIXFAST feeds. I wrote the original RLC code, it was your standard fixed length string cut up routine, nothing fancy. In fact, we were about 80% into a fairly conventional (table lookup) implementation of FIXFAST, and for the life of me, I couldn't understand why we were only barely breaking the 1.5M msg/sec mark. Then, one day, look at a piece of fragment, it hit me. That FIXFAST encoding / decoding is really a lot like a bit shifting state machine (if you think about it right), there are fairly new techniques using SIMD (single instruction, multiple data) to do stream data processing using data level parallelism, or under Intel, known as SSE2/3.

After that "aha" moment, it was pretty straight forward, it was just a matter of experimenting with some primitives, and seeing how they can stack up. And to fit the different bits into one processing "clock". I mean, this is not rocket science. We also took advantage of some compiler trickery to minimize data movement, template initialization, etc.

Funny you mentioned I/O bound, we actually had a substantially more difficult time to construct a simulator that pump a few hundred MBs of captured FIXFAST data into our system.
 
Back
Top