Originally posted by metooxx
God bless you if you did that, but I have been doing this for a long time, with a lot of different groups and I do not believe it unless you are talking about a minor subset; last sale on say NYSE.
Your technique falls apart on multiple systems trading multiple markets or a large number of symbols.
I'm processing every trade for every issue in the TAQ database. I start with 36 months of data (49 GB), which contains 14958 unique CUSIP numbers. I filter this down to about 120 statistical channels per-day per-issue (2.4 GB cached data). Most of these statistics are derived from T&S. That operation takes about an hour. That data gets loaded into Matlab, as needed to design indicators, which are cached between simulations, along with data from any redundant calculations. These additional caches occupy a few more GB. My simulations only examine the top 3000 to 6000 stocks by liquidity. But, to avoid survivorship issues, I rebalance that set and smaller sets rapidly.
The TAL data interface is very similar. In any case, I like to use my own database and file formats, structured as simple 2D or 3D arrays, ordered according to how the data is demanded. If demand patterns are different, I'll just transpose, select or sort rows or columns and cache it again. Matlab can do this in one line of code.
Originally posted by saschabr
you will find that many other brillant coders have tried
what you want to do now - and they failed before, at least found
nothing really exciting.
i knew of a software firm here in germany (forgot it's name - sorry), they employed 3 really brillant math guys who developed
a neural net stragegy game - of course without any measurable success.
It's not easy! But, you can always learn something as long as you truly understand issues of statistical significance, hindsight bias (overtraining), survivorship bias, and the problems relating to singular systems (inputs that cannot explain outputs). Neural networks and high level AI methods can easily get confused trying to predict aspects of price changes that are inherently unpredictable. System complexity is not the answer. If you cannot generate a reliable forecast for a future variable A, it may help to find some future variables B and C that can be forecasted accurately, then find a route from B and C to reach A. The point is to maximize the statistical significance of all forecasted quantities, and to allow the designer to understand what is going on. A lot of AI methods have convergence issues, and thus can never solidly explain what they are learning.
The most important thing I have learned from designing quantitative systems is:
If you don't know how stable your statistics are over time, and over different issues, you are inviting the market to evolve and break your systems. I always start with an over-determined linear system and understand the statistics before I add non-linearities. Otherwise, it's a black box system and you learn nothing except by brute force trial and error.
I'll use whatever 3rd party tool that gives me the most design dexterity, elegance and power. Matlab seems to be best in this regard. As for database issues, that has taken about 2% of my time because the TAQ files are rather simple in structure.
I don't mean to sound argumentative or arrogant. This stuff really fascinates me. I also want to invite criticism and refine my ideas.
Jay