SQL for trading...?

How do you feel about SQL?

  • Good tool for trading.

    Votes: 27 47.4%
  • Awful tool for trading.

    Votes: 11 19.3%
  • Never tried it / no opinion.

    Votes: 18 31.6%
  • Heard so many bad things I'd never want to try it.

    Votes: 1 1.8%

  • Total voters
    57
Hi Bob,
thanks for your response.


...You are saying quite a bit in a short amount of discussion...



It seems that much of what you are describing is actually expressing an opinion about what types of data you think is useful for creating a stock trading strategy.

I believe what we can create as a trading strategy can based only the available data. The list above covers a lot of different angles and gives a good enough picture for creating a trading strategy based on 3-5 days period from buying to selling.


...Other strategy researchers I am sure will have other ideas about what input data they need as input to their backtesting and scans. ...

I believe if we are basing our decisions on ~same data, the trading decisions are a matter of intepretation. It depends of money management issues too. So potentially there can be a trading strategies as many traders there are.

My idea is that everybody creates its own trading strategy. What is needed is a easy way to try and test different strategies. At the end it comes to the input data you have. If you slice and dice enough the different independent sources of data
you can later play with more variables.

The problem with existing tools is you cannot do this efficiently.


Also (in your full original post that I trimmed down quoted above), you discuss a specific analysis strategy, as well as some specific user interface.
While indeed out of topic this is important too. You have to be able to analyse, interpret and visualize all of this data.


All of this is a very wide discussion, where the topic I am interested in discussing is specifically the data handling, computation speed, and strategy backtesting (and scanning) capability part of this. That seems to be the main focus of this thread. It is the part of the thread that interests me.

So, into that topic...


I agree


It seems you are saying that what you do is to generate large quantities of data (the ~1,000 columns), then you query the data to do scans and backtests. Certainly this is an approach, but I think not the only approach to the task.

Yes this is one approach. If you pre generate ~1,000 columns in advance - the permutations you will need later - it is much easier for the next phase - the actual screening and back test.
There is not needed to generate all possible permutation, just that make a sense later. If you have them pre computer performance wise this is the fastest way later to screen and back test. Why? Because you pre generate one time, but screen and back test many times. If you generate your data dynamically every time when you screen and back test then the performance is not the best. At the end it comes down for the need of better database engine specific for these kind of tasks.





Let me say that something which has always helped me with coding the rules of trading strategies. I tell myself, think bar by bar like the decisions are being made in real time actual trading. Sometimes there are decisions to be made at multiple points in the bar. On the open, during the bar, after the close. Write the trading rules code to do processing in sequential order as if you were trading the strategy in real time. After all, the only goal of this is the ability to trade a strategy in real time. Any decision that actually can be made in real time trading can by definition also be made at a point in time in simulation of trading.

This is very true. I would add that when I have all data sliced and diced with our software I can go back in time and inspect bar by bar all factors I pre calculated in advance for every trading day as they occurred.


Do this one day at a time. Move forward to the next day. I find that thinking in this way helps me to code complex strategies.

Absolutely - this is a very simple and effective way to analyse the history and that is how we are working too.

What I needed too is after analyzing the factors day bay day to try and find similar events using the same factors - e.g. to make a screener with the factors as parameters. What is more I needed a screener working on the all history available. The screeners we analyzed are working only on today's data. That is because the amount of data is too much to store for all the history available. The problem is that I cannot know if the screening criteria are working if I cannot go back in history to see what happened next. This is different than back testing. For back testing you are testing a specific strategy after you formulated it. I consider the screening the phase where you are still not committed to specific strategy but are still researching for formulating a better strategy. With back testing you can test it and they have to help each other and to easily switch between screening and back testing.




Write code to process only the specific data needed for a specific backtest or scan (or write code to process lots of various data with user interface options enabling processing of a subset of data for a specific backtest, similar to your "cabd", 5% Gain 5 Days after D-Day" options). Write code to move through the data forward from first date to last date as if you were actually trading. With this approach, much of your 1,000 columns of data could be dynamically created on the fly, processed, discarded. Then move on to the next bar. This would greatly reduce the quantity of stored data.

I agree, the only difference is that our approach is data driven. I don't want to write a code for each strategy I am testing. It is much more flexible and more productive If the end user just picks for example with check boxes which pre calculated data he will will use and formulate the relationships between these data. All these ~1,000 columns - it is not practical and not efficient to use all of them in specific screener and back testing, but they are lot enough to pick from to test one or other idea. The trick is to choose to compute just enough permutations not all of them. This is made one time, but used many times and only part of this data is used in one particular time. The discussion about SQL was that SQL prevents this way of data to be organized and used effectively.



...enabling processing of a subset of data for a specific backtest, similar to your "cabd", 5% Gain 5 Days after D-Day" options)...

What I am describing is how most backtesting software does it.

I believe screener and back testing working together on much more data as I described them is better idea.



So the alternative approach might be to dynamically calculate only the data desired for the specific backtest or scan bar by bar as I describe above (as well as some of the fundamental data you mention will come from the disk). Move forward, calculate much of this input data on the fly, consume the calculated data, discard calculated data you are finished with, move on to the next bar. This is how I have approached this type of task, and it works very well and with very good performance.

I agree it is a fast and effective approach for small amount of data. If you calculate data on the fly you eliminate the DB data handling in case of pre generated data.

From other side it is slow. That is why we do it using parallel computing SIMD on CPU and CUDA on NVIDIA. The latest NVIDIA cards have a lot of memory and we tested an approach holding all data in memory. The computations are in order of magnitude faster than CPU so that this is a feasible approach. The problems is - not all kind of computations are appropriate for GPU, some of them are more suited for SIMD CPU vectorized computation. But it is not so fast as NVIDIA CUDA based. So we have a hybrid approach - pre calculated part of data using fast parallel SIMD CPU programming, leave most often used on the fly computations on NVIDIA CUDA and that way we have the best if both worlds. Combining this with an effective hybrid column based database we cover the effective data storage optimized for this kind of data too.



What you refer to as "sliding window calculations of 5,10,20,40,80,160,240,360 trading days" is I believe what I would term an indicator of these various lengths. It is no problem to calculate multiple indicator lengths on the fly bar by bar.

It not a problem but when you have more instruments and more history it is not fast enough and you cannot make a screener that way. If you make a screener that way this is basically sequential scanning day by day, instrument by instrument and it it would be slow.
 

This whole thread strikes me as a very detailed explanation based upon extensive experience why SQL is not the right tool for the job of processing portfolios of market data. I have encountered this before talking to others. SQL seems like a great idea at first because much of the low level nuts and bolts work is already done for you. But what you point out in detail is what I have heard before, that performance is a huge problem.

I believe it was good idea to describe our experience because this happens often and it takes a lot of time to understand what actually happens and why. That explains better why we made a decision to create a better solution which we eventually can offer to others too.

Basically this is the reason why currently we don't have good enough understanding what happens in the market - we cannot process the data fast enough to make an educated decision:

1) There is a data explosion. I believe the decimalization of stock market prices is a part of the problem. Now the possible prices are much more than before - possibly 10 times more. This is a huge increase of the data which just cannot be handled and processed fast enough.

2) The OS we mostly use is getting slower and slower, not faster. MS is adding a features and features and features, but it gets more bloated every single month with patches, etc.

3) The microprocessor vendors cannot just increase the clock speed anymore - otherwise the computers will burn in flames. They are talking about water cooling, etc. to solve the power problem. A bright light comes from NVIDIA CUDA and it will solve a lot of specific problems.

It is all related.

That means a new approach is needed to handle all of this.


...For example, how many stocks do you run the strategy against?...

...How many years is the time span of the backtest? How long does that backtest take, and on what kind of hardware is that backtest running?...

We wanted to be able to test at least
the 500 S&P stocks. It turned out the extensive tests with all factors we wanted are not possible to finish in a reasonable time. We are in Orange County, CA. The data vendor send us the processed data at 4:00 PM, they are very good - CSI and have reliable service. Until we download the data it takes say 15 minutes it is as fast as you get it, you cannot get it faster.

Now plus minus other things our processing can begin at 4:30PM. It has to finish at reasonable time in order to have a list of possible instruments for the next day. We need a very fast screener and back testing too.

If the processing and analysis takes a lot of time it simply is not possible to finish at time. You cannot sit till 10:00 PM to do that every day. So it turns out the basic requirements are these - no matter what the processing and analysis are - it has to finish up to 1 hour plus max 2 hours. It turned out with the hardware mentioned - a high speed server Dell PowerEdge 2850 2XCPU XEON 3 GHZ, 4 GB RAM, RAID 2 x 75 GB HDD, which is a reasonable investment, you cannot do a lot with the MS or even Sybase tools - it is too slow. If you calculate all these data ~1,000 columns just for DJIA 50 years history with MS tools it takes 1 hour, Sybase - 5 min.

If you pre calculate, store this history up to last year end and just calculate only Year to day data and merge with pre calculated history, the merging only takes 5-10 min. And that is just for DJIA. So we got rid of all these tools and created our own. Also we decided on hybrid approach to pre calculate a part of it and use NVIDIA CUDA for on the fly computations.

For comparison if you take just one operation - LOAD data store - MS tools (MS SQL , .NET) show exponential dependency from the amount of data you LOAD in the DB, Sybase shows relatively linear dependency and is 8-10 times faster than MS, with THStart (our software) it is linear and 10 times faster than Sybase and you can figure out about how much faster relative to MS.
 
1) There is a data explosion. I believe the decimalization of stock market prices is a part of the problem. Now the possible prices are much more than before - possibly 10 times more. This is a huge increase of the data which just cannot be handled and processed fast enough.

2) The OS we mostly use is getting slower and slower, not faster. MS is adding a features and features and features, but it gets more bloated every single month with patches, etc.

Both these arguments are basically - bullocks. Sorry.

To the first: I dont see a 10 times data explosion by decimalization of stocks. Maybe one should get a grip of how to represent stock prices. NxCore (high end data feed) does that now still in a nice way - every price is a 32 bit integer representing the TICKS of the price. The symbol knows how to transpose this into a "float value". That said, for most maths it is good to keep the calculations in float or double, and that has been the same before, too - math libraries never were designed to properly deal with "1/xth". SO, sorry, decimalization has done nothing.

The OS you are using is not getting slower and slower. In fact, the OS is totally aside for anything CPU heavy. If you properly install it (server core for server not using .net, until R2 is out this month then it can handle .net, or at least non-aero on the server) the CPU and memory usage of the OS has - as share of the machine - decreased tremendously. In fact, it is barely noticable. My cpu allocation on a non-used system normally stays below 1% for the OS, and the memory usage is a joke, too, compared to what I have in the computers these days.

The OS may have a lot of things that are nice, but note most of that is either irrelevant for a duty server, and / or not actually using a lot of resources when not in use. And yes, my bars are pretty high - 1 have nothing against the OS using 512Mb. After all, a heavy duty workstation has between 8g and 16gb of memory to start with ;)

That said, Iwish Server 2008 R2 comes out soon, then the .NET framework is installable on the server core ;) Btw., "slow" is not an issue of the .NET framework, and HPC server would incidentally not help at all (the comment made further up basically says the poster has no clue what HPC is all about - total clueless).

Oh, and processors are getting more and more powerfull. One just has to be smart enough (and that is not always simple) to program an algorythm / platform in a way that allows all those cores to be used. Not always feasible. But if I look at today's 6 core processors and compare them to the dual core I still use in my workstation.... then.... there is a LOT more power.

Blaming crappy programming on "the OS" or "decimalization" is not objective. As is blaming faster and faster processors.

Same to the whole SQL debate - pretty obvious that this was simply the wrong tool for the job. Bad news: using the wrong tool is never satisfactory. SQL Is a generic data store. Generalization has costs.

CUDA / offloading processing to a higher end graphics card (which have interesting amounts of extremely high speed memory these days) may though really be a solution. SIMD is a joke compared to the gains that one can get from actually using the calculation units on a graphics card. I would still be doubtfull of the memory (especially: graphics cards really shine on floats / doubles, not integers), but fact is that a IF the analysis can be offloaded to the graphics card.... there are a TON of procesing units there that can work in parallel.

What I would see as possible negatives of CUDA In this scenario:
* Data amount. I am not sure how much data we talk about, but.... the memory in the graphics cards is pretty limited ;)
* It does not store the data storage problem. That said, this is a separate issue at all (storage can happen in the background).

For most cases I would say CUDA is total overhead and wasted. I do day trading, and my trading platform reall never spikes cpu for calculations. But a screener checking many many many data series.... is really sensible. This may be a lot of parallel data operations happening. Definitely worth a try ;)
 
Quote from NetTecture:

...To the first: I dont see a 10 times data explosion by decimalization of stocks.

It has to do with compression. 10 times more possible numbers - less compression. The more data is similar - better compression.


Maybe one should get a grip of how to represent stock prices. NxCore (high end data feed) does that now still in a nice way - every price is a 32 bit integer representing the TICKS of the price. The symbol knows how to transpose this into a "float value".

32 bit is still a lot.

That said, for most maths it is good to keep the calculations in float or double, and that has been the same before, too - math libraries never were designed to properly deal with "1/xth". SO, sorry, decimalization has done nothing.

Float and double in financial computation is not the best idea.

The OS you are using is not getting slower and slower. In fact, the OS is totally aside for anything CPU heavy.

OS is making things such as read and write from disk too. The disk being the slowest operation this is the most important factor when you are dealing with a lot of data which cannot be handled in the memory. Currently the OS is optimized for some typical tasks, the most effective of which a sequential read and write. which forces for specific data access patters if you want to get the max performance.

...And yes, my bars are pretty high - ...

Mine are very high too. ;)

... .NET framework is installable on the server core ...

This is not the best idea for high performance but we have to live with that.

Oh, and processors are getting more and more powerfull. One just has to be smart enough (and that is not always simple) to program an algorythm / platform in a way that allows all those cores to be used. Not always feasible. But if I look at today's 6 core processors and compare them to the dual core I still use in my workstation.... then.... there is a LOT more power.

I was referring that the processor clock cannot be increased anymore, so they are adding more cores. But it needs a better way to program. The today's compilers are not as good to write a parallel programs to extract the potential of many cores.


Same to the whole SQL debate - pretty obvious that this was simply the wrong tool for the job. Bad news: using the wrong tool is never satisfactory. SQL Is a generic data store. Generalization has costs.
The was the original subject of thread
and we tried to fit in this box ;)
Also it is obvious now, not so obvious in the beginning.

CUDA / offloading processing to a higher end graphics card (which have interesting amounts of extremely high speed memory these days) may though really be a solution.

It gives a good results for us for some specific tasks.

SIMD is a joke compared to the gains that one can get from actually using the calculation units on a graphics card.

Some tasks are still good for SIMD too. Some - better for NVIDIA CUDA, we try to use the best of both worlds.

I would still be doubtfull of the memory (especially: graphics cards really shine on floats / doubles, not integers), but fact is that a IF the analysis can be offloaded to the graphics card.... there are a TON of procesing units there that can work in parallel.

Float and integer calculations takes the same amount of time on NVIDIA CUDA - please do your homework firstly.

What I would see as possible negatives of CUDA In this scenario:
* Data amount. I am not sure how much data we talk about, but.... the memory in the graphics cards is pretty limited ;)

There are a reasonable priced CUDA cards with enough memory. Try FX 1800 - it has 64 processors, ~800MB memory it takes only 59W and can fit in most of computers without needing bigger power requirements. To the other end is Tesla 4 GB, 240 processors. Be careful when you choose the computer to install the card to. Be sure it has at least 2 16xPCIe slots and better be latest model because older PCIe interfaces actually are working at half the possible I/O speed. Theoretically you can get 6-7GB/s data transfer speed between the card and motherboard, in practice you get 4-5GB/s.

But you need a software for a general computing on CUDA otherwise it will be just a GPU for you.

* It does not store the data storage problem. That said, this is a separate issue at all (storage can happen in the background).

For most cases I would say CUDA is total overhead and wasted. I do day trading, and my trading platform reall never spikes cpu for calculations.

But a screener checking many many many data series.... is really sensible. This may be a lot of parallel data operations happening. Definitely worth a try ;)

You don't need this for the trading itself, but for screening and back testing it will help a lot.
 
Quote from NetTecture:

Same to the whole SQL debate - pretty obvious that this was simply the wrong tool for the job. Bad news: using the wrong tool is never satisfactory. SQL Is a generic data store. Generalization has costs.

Not obvious at all...that's why I started the thread. I'm still not convinced that people are fully exploring what SQL has to offer. For instance, there are potential solutions to the "disk access" problem that don't require you to abandon SQL.

**I'm not saying SQL will ever be faster in a head to head race against a completely custom solution. But writing that custom solution has big costs too.

My trading needs are not at the millisecond level and I wouldn't recommend SQL for that job. If most people reading this are trying to scalp ticks or play market maker, then building something like thstart suggests would probably be worth the time.

But for anything in the minute range or beyond, SQL can get it done and quite fast IF you are willing to experiment and get creative with how you're storing and accessing data. For my needs, the difference in speed wouldn't be very noticable and it doesn't justify me going down the C++, GPU road. YMMV

Also, I wasn't originally thinking of backtesting applications since I don't do much backtesting myself (flames welcome:p)...but I'm still interested in reading about it.
 
Quote from CloroxCowboy:

Not obvious at all...that's why I started the thread.

That is right.

I'm still not convinced that people are fully exploring what SQL has to offer. For instance, there are potential solutions to the "disk access" problem that don't require you to abandon SQL.

This is true. By SQL I was referring mostly to the SQL data management implementation mostly, not the SQL as a language.

My trading needs are not at the millisecond level and I wouldn't recommend SQL for that job. If most people reading this are trying to scalp ticks or play market maker, then building something like thstart suggests would probably be worth the time.

Screening and back testing is where my focus was in the entire discussion. There is a lot of amount of data generated and it needs better performance.

But for anything in the minute range or beyond, SQL can get it done and quite fast IF you are willing to experiment and get creative with how you're storing and accessing data. For my needs, the difference in speed wouldn't be very noticable and it doesn't justify me going down the C++, GPU road. YMMV

For moderate data sizes it is fine. If the amount of data fits in memory the free SQLite is better solution - it has in-memory feature. But SQLite needs more low level programming.

Also, I wasn't originally thinking of backtesting applications since I don't do much backtesting myself (flames welcome:p)...but I'm still interested in reading about it.

Screening and back testing are essential for experimenting and testing your trading strategy. The trading itself is relatively simple and boring - if you have a discipline you get the signals according your trading strategy and just execute, or have somebody to execute for you, it does not matter much who executes if you follow a procedure.
 
Quote from thstart:

Screening and back testing are essential for experimenting and testing your trading strategy. The trading itself is relatively simple and boring - if you have a discipline you get the signals according your trading strategy and just execute, or have somebody to execute for you, it does not matter much who executes if you follow a procedure.

I partially agree. Not to get too far off-topic, but my personal feeling is that after a certain point (which varies with every strategy), backtesting quickly devolves into curve-fitting. For most of my strategies, which I intentionally keep very simple, that point is reached pretty early and I haven't had the need for elaborate back tests that SQL can't handle. So I think we may be on the same page...just that our needs and strategies seem to be different.
 
Quote from CloroxCowboy:

I partially agree. Not to get too far off-topic, but my personal feeling is that after a certain point (which varies with every strategy), backtesting quickly devolves into curve-fitting. For most of my strategies, which I intentionally keep very simple, that point is reached pretty early and I haven't had the need for elaborate back tests that SQL can't handle. So I think we may be on the same page...just that our needs and strategies seem to be different.

The simpler is always better, and no curve fitting of course.


...that point is reached pretty early and I haven't had the need for elaborate back tests that SQL can't handle....

You said it ;)

That point is reached pretty early because SQL engines cannot handle more of it. We don't wanted to be limited from SQL or whatever.
 
thstart:

This is an interesting discussion because it seems you and I are in exactly the same line of work. We have both spent years working in the subject matter of software for large scale strategy backtesting.

Quote from thstart:
I believe it was good idea to describe our experience because this happens often and it takes a lot of time to understand what actually happens and why. That explains better why we made a decision to create a better solution which we eventually can offer to others too.

Absolutely! There is great interest value and information value in this thread, at least for me. It is a topic I have not previously heard discussed in so much detail and with so much experience. It is clearly an important topic for anyone interested in large scale strategy backtesting. What we are discussing is software design for large scale backtesting (and scanning), and I imagine there are a lot of people on this board interested in large scale backtesting.

This has bugged me for a long time. I have talked to people using SQL for this kind of processing, and have been told that SQL is slow, but you have really dug into the topic and are discussing your findings in detail. As I said in my last post "This whole thread strikes me as a very detailed explanation based upon extensive experience why SQL is not the right tool for the job of processing portfolios of market data.". You are saying that you agree with this, that (to use your term) Conventional Off the Shelf databases is the wrong tool. That conclusion led you to develop an alternative approach which you are only partly disclosing because of non-disclosure agreement, which is fine.

However, we still have significantly different viewpoints about implementations of alternative approaches. You have approached the task in a completely different way than I did.

Quotes from PowerST / thstart:
PowerST: So the alternative approach might be to dynamically calculate only the data desired for the specific backtest or scan bar by bar as I describe above (as well as some of the fundamental data you mention will come from the disk). Move forward, calculate much of this input data on the fly, consume the calculated data, discard calculated data you are finished with, move on to the next bar. This is how I have approached this type of task, and it works very well and with very good performance.

thstart: I agree it is a fast and effective approach for small amount of data. If you calculate data on the fly you eliminate the DB data handling in case of pre generated data... From other side it is slow. That is why we do it using parallel computing SIMD on CPU and CUDA on NVIDIA....

PowerST: What you refer to as "sliding window calculations of 5,10,20,40,80,160,240,360 trading days" is I believe what I would term an indicator of these various lengths. It is no problem to calculate multiple indicator lengths on the fly bar by bar.

thstart: It not a problem but when you have more instruments and more history it is not fast enough....

The approach to backtesting that I was describing in my last post (with your responses quoted above) is not a theoretical discussion. I have personally developed backtesting software named PowerST based upon this approach that has been in use by real world customers for years, and the performance is excellent (as I will detail below).

Your approach requires a database engine where my approach does not. Remember that was the source of your problems. It was the performance of the database engine that drew you into three years of work how to solve performance problems, not to mention that your approach still requires expensive hardware (high speed server Dell PowerEdge 2850 2XCPU XEON 3 GHZ, 4 GB RAM, RAID 2 x 75 GB HDD, SIMD, NVIDIA CUDA GPU parallel computing, up to 240 processors, SSDs).

On the other hand, there are real world users of the software that I developed getting excellent performance with this kind of testing on low cost mainstream computers. Here is a recent quote from a user:

I would safely say that the calculation engine is probably the most powerful and debugged engine out there in terms of capabilities, efficiency and speed. We test more than 7000 markets simultaneously from a 22,000 database of symbols including defunct and delisted stocks. This takes about 3 minutes.

Also, there is another real world PowerST user that runs the software against the entire universe of US stocks past and present every trading day. I was mentioning to him a while back that I has some ideas for performance improvements and he said don't bother to do this for him because performance is not a concern for him. Originally, a number of years ago, he was running on a laptop with 1 GB of memory. More recently for reliability and availability they moved to a shared virtual Windows server in a data center, which I understand costs $100/month. He said that after moving to the server that his daily running of PowerST is now... I forget the exact number but it was in the range of 5 minutes to 15 minutes. His point was that the data updates takes much longer then running the backtesting software, and that the running of the backtesting software is fast enough that any potential performance improvement isn't worth much effort.

I need to break this into multiple posts due to post text length limits. The next post continues...
 
Continuing my previous post:

However, you talk about "sliding window calculations of 5,10,20,40,80,160,240,360 trading days", which I term as indicator calculations of these various lengths. The users I refer to above are not running against hundreds or 1,000 columns of input data that you talk about. I think it would be extremely unlikely for any one strategy to be backtested or a scan to use this many input variables, and you have indicated agreement with this saying your goal is to have 1,000 columns of input data available but that a strategy will use some subset. You gave an example "cabd, 5% Gain 5 Days after D-Day" options that I quoted in my last post.

However, even though I think it is extreme, to stress the performance capability I put together a test strategy that does a large quantity of calculations. I took the moving average crossover system that you used as an example in a previous post and added additional moving average calculations of length 5 to 360 step by 1 to be similar to your "sliding window calculations of 5,10,20,40,80,160,240,360 trading days". These moving averages are calculated each day. So this is 5 to 360 inclusive plus the original two moving average for the crossover system = (360-5)+1+2 = 358 simple moving average indicator calculations per day. In your terminology this is 358 columns of sliding window calculations.

I ran this on an 1800 stocks subset of Russell 5000 stocks, with date range from November, 1994 through December, 2007 (don't ask me why! it is the data that I happen to have available). So this is 13 years of EOD stock data. I ran this on my day to day Windows Vista 32-bit laptop dual-core 3 GB memory that I paid $479.99 for in November, 2008 (I am making the point this is a very low end consumer level laptop).

Calculation time for 358 simple moving average sliding window calculations per stock each day, 1,800 stocks, 13 years of data: It takes around 2 minutes.

As I say, the performance is excellent. And this is on a cheap consumer level laptop. I understand that a Core 2 Duo processor and/or 64 bit with a little more memory will improve this performance substantially (I believe that is the environment the above quote about "7000 markets from a 22,000 database of symbols takes about 3 minutes" came from).

With this kind of performance there is no need for a database and all of this huge performance problems that has caused you.

However, it is true that a simple moving average is not a calculation intensive task. What about a more calculation intensive task? You mentioned the approach of pre-calculating to save time in later backtesting, scans, and trade signal generation.

It isn't hard to pre-calculate calculation intensive indicators and put the results into a flat file (or flat files). I understand that one PowerST user is doing this. I don't know exactly why because their strategy is proprietary and I am not involved in the programming, and they don't give me details. My assumption is that they must have some kind of highly calculation intensive indicators. I assume the point is that certain parts of their strategy is already fixed so there is no need to calculate that part of the strategy with each backtest. So instead they decided to calculate those input variables once and save the results, then read those results from disk as input to subsequent backtests. This is similar to what you describe that you are doing, but it can be done with flat files which are then very fast to read back into subsequent backtests.

To explain some more, PowerST has facilities to calculate indicators and log the data items to file(s). It would only be a few lines of code to do that. Then there are also facilities to help read the input data and make that data part of subsequent backtests. It would be a very small amount of code for the generation and later consumption of the pre-calculated data because the backtesting software is already suited to this kind of task of processing time series data. It doesn't need a database. Rather, you can run the same portfolio one time to pre-calculate into flat files, then backtest reading the pre-calculated input from the files. The trick to this is to order the data in the flat file in the order that it will be consumed in the subsequent backtests, and the backtesting software has capability to do that. Once the pre-calculated data is then in a flat file there is nothing faster than reading a flat file intermixed with the backtesting calculation processing.

However, the customer I mention above is the only PowerST user I have heard of that has bothered to pre-calculate indicators. The reason is because unless there is some extreme amount of number crunching, calculating everything on the fly (the approach I described in my first post on this topic) is plenty fast enough. Again going back to my benchmark test of 1800 stocks, 13 years, 358 various length indicator calculations per day for each stock, on a consumer level laptop takes only around 2 minutes.

Quote from thstart:
If you generate your data dynamically every time when you screen and back test then the performance is not the best. At the end it comes down for the need of better database engine specific for these kind of tasks.

Quote from thstart:
It not a problem but when you have more instruments and more history it is not fast enough....

I believe that I have demonstrated above that I am actually getting much better performance calculating on the fly than you are getting with your database approach. Besides, if necessary for performance reasons, it is easy enough to pre-calculate to flat files then consume the pre-calculate input data in subsequent backtests.

- Bob Bolotin
 
Back
Top