Let's suppose price patterns are mined from price data (data sample).
So we come up with a set of patterns that conforms to our preset requirements.
These requirements could be:
1) support - how many times the pattern showed up in our data sample: s(A)=50 would mean that pattern A showed up 50 times.
2) confidence - % hit rate: what percentage the pattern predicted the target correctly, i.e. A->B (pattern A led to target B), so it's basically s(A,B)/s(A). An example could be 80% accuracy.
3) interest - what confidence is of interest to you. I'll try to explain it with a simple example. Suppose we're analysing data sample containing 1000 elements. We mark 400 elements as our targets.
Now, if you tried simple random prediction (guessing), you would expect accuracy of 40%.
If we mine patterns and let's say get three patterns that conform to our minimum support requirements. Pattern A has confidence of 48%, pattern B has confidence of 65% and pattern C has confidence of 32%.
How do you know if these patterns are significant? They should be better than random by some preset threshold. Random guess accuracy is 40%, so pattern A has advantage of 48%-40%=8%, pattern B has advantage of 65%-40%=15% and pattern C has negative advantage of 32%-40%=-8%, so we automatically reject pattern C.
If we had preset interest threshold to 10%, the pattern A is rejected (8%<10%) and pattern B is accepted (15%>10%).
The problem I see here is that validating patterns this way is not enough, because data mining for patterns produces large amount of garbage. So the subject I would like to discuss is PATTERN VALIDATION.
One known technique is out-of-sample testing: test the patterns and see if they still conform to our preset requirements.
Even here it's still unclear, how much data there should be in training (where we mine) and testing (out-of-sample) data samples? What ratio? In our example we used 1000 data elements to mine patterns, but we could have 3000 data elements in total, so out-of-sample data set size would be 2000, and the training:testing sample size ratio is 1:2. It's clear that the smaller this ratio, the better it is, but on the other hand, you have to have training sample big enough so you could actually mine something meaningful out of it. So what's the optimal ratio? And of course, the training sample should be wide enough to cover different market conditions (uptrend, downtrend, ranging, low volatility, high volatility etc.).
This one technique is widely known and used. But are there any other pattern validation techniques out there? Anyone experienced in statistical data analysis and/or data mining care to share their knowledge?
So we come up with a set of patterns that conforms to our preset requirements.
These requirements could be:
1) support - how many times the pattern showed up in our data sample: s(A)=50 would mean that pattern A showed up 50 times.
2) confidence - % hit rate: what percentage the pattern predicted the target correctly, i.e. A->B (pattern A led to target B), so it's basically s(A,B)/s(A). An example could be 80% accuracy.
3) interest - what confidence is of interest to you. I'll try to explain it with a simple example. Suppose we're analysing data sample containing 1000 elements. We mark 400 elements as our targets.
Now, if you tried simple random prediction (guessing), you would expect accuracy of 40%.
If we mine patterns and let's say get three patterns that conform to our minimum support requirements. Pattern A has confidence of 48%, pattern B has confidence of 65% and pattern C has confidence of 32%.
How do you know if these patterns are significant? They should be better than random by some preset threshold. Random guess accuracy is 40%, so pattern A has advantage of 48%-40%=8%, pattern B has advantage of 65%-40%=15% and pattern C has negative advantage of 32%-40%=-8%, so we automatically reject pattern C.
If we had preset interest threshold to 10%, the pattern A is rejected (8%<10%) and pattern B is accepted (15%>10%).
The problem I see here is that validating patterns this way is not enough, because data mining for patterns produces large amount of garbage. So the subject I would like to discuss is PATTERN VALIDATION.
One known technique is out-of-sample testing: test the patterns and see if they still conform to our preset requirements.
Even here it's still unclear, how much data there should be in training (where we mine) and testing (out-of-sample) data samples? What ratio? In our example we used 1000 data elements to mine patterns, but we could have 3000 data elements in total, so out-of-sample data set size would be 2000, and the training:testing sample size ratio is 1:2. It's clear that the smaller this ratio, the better it is, but on the other hand, you have to have training sample big enough so you could actually mine something meaningful out of it. So what's the optimal ratio? And of course, the training sample should be wide enough to cover different market conditions (uptrend, downtrend, ranging, low volatility, high volatility etc.).
This one technique is widely known and used. But are there any other pattern validation techniques out there? Anyone experienced in statistical data analysis and/or data mining care to share their knowledge?
