Server options

i960 · Jun 20, 2015

hft_boy said:
Logging to disk shouldn't be too much a performance hit. Main thing you should be aware of is to use buffered writes (which is conceptually similar to keeping a queue of structs but is more standard) because calling the write() syscall is expensive. On Unix you can achieve 'background'** writes by piping the output to an intermediate buffer (e.g. mbuffer, pv, buffer). Another thing you might want to consider is to use a very light compressor (lzop or lz4) in between the application and the disk. Counter-intuitively you can actually get higher throughput because the cost of using CPU cycles to do compression is offset by the gain in disk bandwidth usage.

write(2) will still be relatively buffered unless one explicitly opens the FD it uses with O_DIRECT. The main performance hit will be in the context switch from crossing the userland<>kernel barrier in calling a syscall. None of that is needed though as typical stdio.h routines are all buffered (fwrite, fprintf, etc). I honestly don't think the disk i/o will even be a concern here as the data will have been reduced from the network side into smaller units. Even when the stdio routines (and write() for that matter) flush buffers there's still the filesystem cache which will have it's own buffering. In short the writing out of data will hit multiple buffers and be done efficiently by the OS for the most part.

**Occuring in the 'background' is a bit misleading here. Technically writes are performed to the pipe, typically implemented in the kernel as a ring buffer. Some memory copies are therefore being performed but the overhead is much lower and more deterministic than hitting the disk. If you were really enterprising you could avoid the overhead; manage the buffers in the application and use non-blocking writes, or multiple threads with some machinery for synchronization like some sort of ring buffer, or abuse the garbage collector by maintaining some sort of linked list, or get more creative. Sky's the limit really.

Any reasonably modern unix kernel will buffer pages to be written to backing store via a page cache and the kernel will handle that on it's own (Linux users can see metrics on this in /proc/meminfo). Where it would be an issue is if the buffers are not able to be flushed faster than the caller is writing to them. I doubt that'll be the case. WRT to non-blocking file I/O, either threads or async IO (aio.h) for files as files specifically are already non-blocking in nature.

Best bet here is to write a coarse prototype, get it working, make it correct, then profile it to see where the actual bottlenecks are. My bet is it'll be entirely I/O bound on the network.

i960 · Jun 20, 2015

cjbuckley4 said:
I have also decided that it may be best to skip parsing the IQFeed data into a struct all together and simply add the data as it comes off the sockets to the queue and then write to binary files in a separate thread. This way I can do away with an entire (not very intensive) step.

Don't do this. For one, you're already pre-optimizing here. Secondly, that intermediate struct is your friend. Serialization from the read buffers into an intermediate struct abstracts the data into an atomic unit you can pass around at very little cost. Writing it direct to disk via a separate thread would already require a struct or class to be placed into a thread-aware queue anyway, and if you just pass it a buffer of temporary bytes or even the FD itself you're not really saving any time or resources here you're just shifting them around. Remember, the stack is queuing everything being received by the network driver into a receive queue (netstat -an on a Linux host and you'll see it [recv-q/send-q]). The size remaining in this queue is communicated to the sender via TCP windowing to provide flow control, you don't need to worry about it unless you're filling up queues consistently.

If you instead don't use a separate thread and also don't use a separate struct but just write it right to storage from the function processing the input data off the stack you now have a function that's basically highly coupled to network input and storage output and that's not good design nor is it really saving *that* much in the grand scheme of things.

Write it straightforward. Since you haven't worked directly with Berkeley style sockets and non-blocking I/O I can *guarantee you* that the initial 90% of your time is going to be spent trying to figure out how to even do it correctly. It's not rocket science, but it's not hello world stuff. There are multiple avenues which you will hang yourself while learning it and that's going to take vast the majority of initial time to learn.

cjbuckley4 said:
Forgive me if my understanding is off here, but even if I do get the full 1 Gbs data that you can fit through a standard port, I could store it all in a queue and write it to disk at about 200 mbs, so as long as I don't see sustained bursts of multiple seconds like that, it will be a total nonissue.

You will not see line rate from an off the shelf network card. You may see 600-800Mbps at best, but you'll never see flat out 1Gbps (nor will you even be able to receive that without direct switch port connectivity). Not gonna happen. On top of that, the amount of data written over the wire (actual bandwidth required) is not the same amount of bandwidth which will be needed for storage. You're going to be taking data that is of lower density and higher frequency and distilling it down to a more efficient final form. In fact what benefits you highly here is if the provider is able to give you raw binary data and a protocol API of some sort. This reduces the amount of assembly/reassembly and bandwidth needed on both sides . It is more proprietary though so finding a "retail" provider who will do this might be more difficult.

Also, might consider talking to the Nanex people about this - as I'm sure they have direct experience on this type of thing (NxCore?). On top of that, consider this as well: http://www.cmegroup.com/market-data/distributor/market-data-platform.html

cjbuckley4 · Jun 20, 2015

Thanks for the link from CME. I obviously am not watching the full depth nor every future in each subcategory, but it still helps me establish expectations.

With regard to parsing to a struct, you may be right, first let me make sure we're clear here. I'm using System.Net.Sockets to receive the byte[], I could write that directly to a plain unstructured binary file with BinaryWriter. Alternatively, I could then use the ASCIIEncoder.getString(byte[] message, int 0, int bytesRead) method to convert the incoming message into a string which I would then parse into a struct and use the method here under .Net to write it a teafile structured binary file. My concern was not actually whether parsing into a struct would be too computational intensive, it was whether the teafile.Write() would be able to achieve the same IO as BinaryWriter since I could find no info on it, but plenty on the performance of BinaryWriter. I realize that Teafile.Write() is probably just a thin wrapper around the underlying C# or even inherently the C mechanisms for writing to a file, but I was still unsure so I wanted to play it safe. Parsing server side is undoubtably easier since I already wrote all the code and--as I start to understand the buffering of sockets above and the max possible network IO vs what sort of write performance is easily achievable--I've become much less concerned about where I parse it and how much horsepower I actually need. I've also found some server providers who will let me scale up without much trouble and quit with no commitment so I can start small and take baby steps toward more horsepower as needed. I spoke to Rithmic, and their feed doesn't have any parsing involved, which to me says that the data arrives in a structured format already. Based on that, it might be easier to keep everything I get via FTP from the server normalized to one format anyway, so using some kind of structure in my IQFeed reader is gonna be necessary if I go that route.

With regard to NxCore, I've received several recommendations to go that route in the past, but have been hesitant for a number of reasons. I believe Rithmic is the best feed available to retail traders for futures data that arrives timestamped, and because I plan to use IQFeed to ultimately watch 500-1000 backup orderbooks on the futures I watch, futures options, futures Rithmic doesn't cover, ETPs with futures as their basis, and the equity Principle Components of a few index futures, I see no reason to pay more for the NxCore feed which timestamps at a lower granularity than IQFeed. My opinion may change there as I decide to watch more instruments and my pockets get deeper (remember we have a college student here). Thanks for your excellent advice!

hft_boy · Jun 20, 2015

i960 said:
Any reasonably modern unix kernel will buffer pages to be written to backing store via a page cache and the kernel will handle that on it's own (Linux users can see metrics on this in /proc/meminfo). Where it would be an issue is if the buffers are not able to be flushed faster than the caller is writing to them. I doubt that'll be the case. WRT to non-blocking file I/O, either threads or async IO (aio.h) for files as files specifically are already non-blocking in nature.

Best bet here is to write a coarse prototype, get it working, make it correct, then profile it to see where the actual bottlenecks are. My bet is it'll be entirely I/O bound on the network.

Good point about the buffer cache. Totally agree, it is best to implement first -- and then if the disk [slash insert X here] is a bottleneck go crazy trying to optimize.

volpunter · Jun 21, 2015

You are concerning yourself with the wrong issues. There are no hardware nor network bandwidth issues you need to concern yourself with. Instead any potential bottleneck is to be avoided on the software side. Any standard .Net TCP client does fine for your purposes. The main issue that I would focus on is how to store data in memory and infrequently persist to disk. I have not yet read other posts than just this first one of yours.

cjbuckley4 said:
I'm hoping someone here could offer me some guidance with regard to servers. Here is my problem:

I've written a C# TCP/IP Protocol application that receives both level I trade and quote data as well as depth of market data from IQFeed. This program takes these incoming messages, parses them into a struct, and writes them to a structured binary file. I've tested this program on my home machine with a 3.4ghz Intel ivy bridge processor and a cheap SSD. When watching between 10-20 EMinis, it consistently uses less than 7% of my processor, but it's not like I'm watching this for all possible market scenarios or during peak hours as I have an internship and school and whatnot. Consider 8% my best conservative guess. The program is not yet multithreaded, but I hope to learn about how to do this properly soon. Additionally, I hope to add a similar depth of market feed handler for Rithmic in the near future, but since I'm still hammering out the final details before I deploy this program, I see no reason to spend more on multiple feeds. This is more of a "case study" about how to do this properly at the moment.

The purpose of this program is to persist incoming depth of market and trade and quote data for research purposes. I am doing this because depth of market historical data is quite expensive, difficult to find, and impossible to know the quality of unless you record it yourself. Having spoken to many HFT folks (some on this forum whose time I greatly appreciate), they assure my that this is the way to go. To do this, I must deploy this program to a server. I am not interested in a colocation solution. The discussions here that are labelled "colocation" do not even pertain to real colocation. Call or email someone to get a quote and you'll see what I'm taking about. This will simply be a server solution perhaps *proximity hosted* at Cermak because I would like to avoid the public internet as much as possible. IQFeed doesn't disseminate from Chicago, so my reason for wanting to host there is mainly to avoid possible points of failure and make the transition to Rithmic or TT/CQG/CTS smoother when it happens. If hosting elsewhere is dramatically cheaper, I'm open to that as well. I do record the latency of every message coming in, but that's only as accurate as the windows system clock and whatever IQFeed does to normalize time, so I don't put a whole lot of stock into it...more of a heuristic. Additionally, my program and IQFeed are written in C# and must run on Windows Server 2008 or higher--I promise I like linux as much as anyone here, but I'm even less interested in discussing that than I am colocation, so let's just assume that I'm not flexible on my OS. My main concern is handling the feeds without missing a tick and keeping good uptime before latency.

My usage will be as follows: I will start by tracking only EMinis and some major commodity futures as well as exchange supported spreads. I will likely scale up to the full allowance of 500 symbols at some point. Additionally, once I add Rithmic, one could assume the demands would move even higher as I plan to track the full allowable depth of the book on at least 100 different contracts at a time, as well as my original IQFeed subscriptions.

So now that I've laid out what I have going on, here's my question: what kind of server accommodations do I need to pull this off in a scalable yet cost effective manner? Although the name doesn't inspire much confidence, this site has a variety of VPS, VDS, and dedicated server options that might be a launching point for this discussion, although few are in Cermak. If anyone here has any experience with this sort of stuff, please offer whatever info you can. I'm hoping that by writing to binary files incrementally and keeping the file size fairly small, I can keep this rather cheap, but I simply have no idea what is necessary other than what I've told you above about my experiences watching Eminis at home. Also, if anyone has any experience with logging server resource usage, I'd be interested in hearing about it. I know there's "Performance Monitor" in Windows Server, but as you can see, I'm just trying to get an idea of how all this stuff works. Finally, would it be better to store the IQFeed message structures in a queue and then write them to the binary file in a dedicated thread versus writing them directly as they come in?

volpunter · Jun 21, 2015

why should he re-write anything in another language? C# is perfectly capable of handling this. He can write to memory, while another thread infrequently makes a copy of a certain chunk of data, serializes them and writes them to disk. A circular buffer comes to mind but there are tons of other simple solutions. Why making life complicated when it can be easy? (I will refer to .NET TPL Dataflow in later comments which is the perfect solution for such scenario).

i960 said:
This is pretty much nothing for a modern CPU to handle. You're literally just processing incoming socket data and serializing to binary. Even making it multithreaded is not going to result in a significant improvement because you're I/O bound on network. The network stack is also queuing while you're serializing so there's already some implicit concurrency going on that you're not directly in control of. A typical event loop with a reasonable select/poll() derivative (which widows has) is a decent approach here.

I know you don't want to hear about Linux but you *should* rewrite this in POSIX C using standard Berkeley sockets and then you'll have no issue running on any platform (including linux). Otherwise you're stuck using a Windows solution and C#.

Where you will probably start running into issues is with 500+ symbols and pure network I/O load.

volpunter · Jun 21, 2015

Don't waste time re-inventing the wheel when you already have something in C#. Have you looked at TPL Dataflow? It would be the perfect approach for your situation. It is probably as performant as any C++ solution, given you are not latency dependent. You can post your incoming socket data to a dataflow component and can do all deserialization/serialization work there, you can with one simple command specify whether you want to do all of that on multiple threads at the same time or not, you can fan out data, merge data, or simply perform operations on the data and move the results to the next data block. You should really take a look at tpl dataflow because this is exactly what this technology aims to solve.

cjbuckley4 said:
Thanks for your reassuring reply. You're correct about the sockets. I dispensed with their significantly easier COM API library (which I now understand to be just a wrapper around the sockets) because I believed this would be (foremost) a useful exercise and marginally faster.

My hands are pretty much tied with IQFeed as they don't support linux, else I would've gone that route if only because servers are cheaper. I'm a young CS major...not yet that experienced with real world problems like this, and frankly using WINE with sockets scares me. I am however open to/considering doing my Rithmic implementation on a separate linux server in C++ as my anecdotal research gives me the impression they're platform independent. In that case, I would use POSIX sockets if possible, but I don't even know how their feed works so I don't want to speculate as to whether they have a TCP/IP Protocol or what. Obviously the easy route is to just stick with one server and keep everything in C#, but I haven't really gotten to the Rithmic bridge yet. I'm taking a course on C/C++ in a linux environment that covers Berkeley sockets in fall so maybe I'll be better prepared to tackle that aspect of this project then.

If I was to look at 500+ symbols, what sort of hardware would you speculate would be necessary? I hear conflicting things between "it will run on your laptop" and "you need a direct line and a network card that costs more than your car," so I'm obviously a bit concerned. From what I've seen, I can get four dedicated cores of 3.0+ghz Xeon processors at a reasonable price, but that might be overkill...I really don't know.

I think my next move will be to simply build a list of any old active symbols between futures and stocks and add them to my watch list and just see what happens.

volpunter · Jun 21, 2015

exactly, no need to talk about latency here.

i960 said:
To clarify the latency concern: If the packets you're receiving contain a server side timestamp (usually 8 bytes for epoch+nsec) then the latency simply doesn't matter. You could receive the packets a day later and would have the exact same accuracy and precision than if you received them 1msec ago. You're not measuring how quickly you can receive DOM updates you're simply storing time series data.

cjbuckley4 · Jun 21, 2015

Thanks @volpunter , you're correct about where I should be concerning myself I think. No need to read it all really. The rest of the thread is pretty much me coming to that realization with the help of others on here. Although IQFeed doesn't send any data that I'm aware of/interested in over the weekend, I got a cheap starter Virtual "Dedicated" Server up and running and everything seems to be working as planned. The real testing will start in a few hours when things heat up and I get my first real load on the system server side. I'm not even 100% sure what my full watch list will include at the end of the day, but I have some ES and GE contracts on there that I'm sure will give me a good indication of how I'll do with more active contracts. As you can see from reading through though, I'm really not worried at all at this point. I think I'll probably be able to handle everything IQFeed throws at me as well as Rithmic most likely on an 80-120 dollar a month virtual "dedicated" server. The only concern I have with the VDS approach is that I don't know how exactly the writing to hard disk will work in this kind of environment. I will move it to a dedicated server if it becomes an issue though.

EDIT: just saw your last post about TPL. I have not looked into it, but it does sound like a possibility. I'll look into it. Thanks for the suggestion.

volpunter · Jun 21, 2015

agree, its very important to deserialize the data and create your own data structure (struct or class objects) which you then later serialize and persist to disk.

I can show how this whole solution is done with C# TPL Dataflow in less than 200 lines of code and without the slightest bottleneck on the implementation side. It sounds like you (i960) make it sound a lot harder than it really is. With TPL DF you can simply have the socket write incoming data asyc to the first dataflow component and the rest is down right away on different threads. You would never even have to start to concern yourself with blocking I/O. With dataflow blocks in C# I am absolutely sure that a stock i5 machine with 8gb of memory and ssd drive will be fully sufficient to handle any traffic up to the limitations of the network card capacity.

i960 said:
Don't do this. For one, you're already pre-optimizing here. Secondly, that intermediate struct is your friend. Serialization from the read buffers into an intermediate struct abstracts the data into an atomic unit you can pass around at very little cost. Writing it direct to disk via a separate thread would already require a struct or class to be placed into a thread-aware queue anyway, and if you just pass it a buffer of temporary bytes or even the FD itself you're not really saving any time or resources here you're just shifting them around. Remember, the stack is queuing everything being received by the network driver into a receive queue (netstat -an on a Linux host and you'll see it [recv-q/send-q]). The size remaining in this queue is communicated to the sender via TCP windowing to provide flow control, you don't need to worry about it unless you're filling up queues consistently.

If you instead don't use a separate thread and also don't use a separate struct but just write it right to storage from the function processing the input data off the stack you now have a function that's basically highly coupled to network input and storage output and that's not good design nor is it really saving *that* much in the grand scheme of things.

Write it straightforward. Since you haven't worked directly with Berkeley style sockets and non-blocking I/O I can *guarantee you* that the initial 90% of your time is going to be spent trying to figure out how to even do it correctly. It's not rocket science, but it's not hello world stuff. There are multiple avenues which you will hang yourself while learning it and that's going to take vast the majority of initial time to learn.

You will not see line rate from an off the shelf network card. You may see 600-800Mbps at best, but you'll never see flat out 1Gbps (nor will you even be able to receive that without direct switch port connectivity). Not gonna happen. On top of that, the amount of data written over the wire (actual bandwidth required) is not the same amount of bandwidth which will be needed for storage. You're going to be taking data that is of lower density and higher frequency and distilling it down to a more efficient final form. In fact what benefits you highly here is if the provider is able to give you raw binary data and a protocol API of some sort. This reduces the amount of assembly/reassembly and bandwidth needed on both sides . It is more proprietary though so finding a "retail" provider who will do this might be more difficult.

Also, might consider talking to the Nanex people about this - as I'm sure they have direct experience on this type of thing (NxCore?). On top of that, consider this as well: http://www.cmegroup.com/market-data/distributor/market-data-platform.html