Data mining challenge

Indrionas · Dec 18, 2010

I believe there are a few people here who do this or similar kind of work in their process of model building.

Here is the deal:

The data is synthetic. Means I have generated it and I know the rules (real model) that generated it.

The values are separated by semicolons.
There are 3000 rows - 3000 data points.
Each column is one variable. The first column is target variable, the other 300 are input variables.

All variables are binary. For target it's either 1 (true) or 0 (false). For inputs it's 1 (true) or (-1) false.

The challenge is to find a set of patterns that generated this data set.
A pattern can be in the form of:
if (variable #35 is true) AND (variable #184 is false) then target is "true".

I am not disclosing the complexity of patterns.

There are 300 data points with target = true, and 2700 data points with target = false.

A few notes:
Not all targets are predictable. Some part of them were added as random noise.

Not all input variables are relevant. In fact, most of them are irrelevant.

This data is in no way related to any real financial time series (it's synthetic).

I've been toying around (actually doing quite serious work) with neural networks, testing how powerful they are in feature selection and modelling this kind of data. So far I couldn't find a single method/training technique to crack this problem (with NNs).

If you manage to crack this problem, please be free to post your results. I will reveal the real model that generated the data to compare with your results. Also, if you don't want to disclose your technique/method/algorithm, don't. I can understand this can be proprietary (and very expensive) information. My first priority is to find out if it's possible to crack the problem. The method itself is secondary.

Indrionas · Dec 18, 2010

The data.

jack hershey · Dec 18, 2010

Cracking the this correctly stated problem is possible and when completed, the solution declares that there is no noise nor anomalies in the correct market model.

For convenience you may wish to refine the binary variable description or the target decription Presently you cannot write one in terms of the other. This means you are violating paradigm theory and as a consequence you will remain stymied for the duration of your effort.

The text beginning with and ending before I've been toying with..... Is a dead end.

Obviously there are many ways to handle the oppotunity of havig a problem with which to work. You have that.

Converting the problem to a mathematical one would be a good first step. Above I suggested your data statement needs to be tweaked to satisfy paradigm theory. Read Keynes)

Most binary math follows Boole. You cant as it is.

If you like history you can drop back in history and observe a solution to THE problem that was qualitfied three ways (Boole, Keynes, and Carnap). I like your problem since it is a corollary to the general solution that was achieved historically.

THE problem turms out to have a solution that goes beyond establishing there are no noise and anomalies. Going beyond means having high utility for optimizing by effectivenss and utility.

You are stymied by your process. The reason is you have drawn a conclusion, apparently, form what you say.

I am not judging you, but when your statements are compared to accomplished historical work , you diverge off on to a branch that bears no fruit. If a person were doing problem solving and they when astray by drawing a conclusion different than an established solution, then they could consider changing course to deal with a correctly stated problem.

I looked at the data; Most anyone can tell you you have errors simply based on your data description.

I weigh the process of getting the solution as more important than the solution. I have both.

Their value, you have undervalued greatly.

The neatest thin you have proposed is that the binary values of the independent variables are vectors. Unfortunately you cannot create a paradigm that contains the traget as stated by the peculiar shift from one kind of data to another between the independent and dependent variable. Too bad.

If you put your data in a mathematical order you will find it is incomplete. How it did not get finished and expressed completely is a failure of your acquisition process. It was inductive in lieu of being deductive.

THE solution (historical) was deductively based and achieved a paradigm status (not formally stated, however), but it lacked completing the full process of deduction to arrive at a full set of corollaries which establish a fulll hierarchical solution.

Somewhere along the way if you emerge from induction you will see a hierarchy of functions. To include the degrees of freedom you introduced, you may want to consider if you went afield from the data source for the problem. It looks like you only have one level of data and not a hierarchy of data.

Most often hierarchies if information are generated to gain degrees of freedom which contribute more information to the data base.

intradaybill · Dec 18, 2010

P = NP?

goodgoing · Dec 18, 2010

If you want people to take you seriously you have to describe the problem in detail and provide an excel file they can work with. The way you described it is not intelligible. It makes no sense. You are talking about random entries without explaining why they are there in the first place, for example.

jack hershey · Dec 18, 2010

Quote from goodgoing:

If you want people to take you seriously you have to describe the problem in detail and provide an excel file they can work with. The way you described it is not intelligible. It makes no sense. You are talking about random entries without explaining why they are there in the first place, for example.

lol...

Indrionas · Dec 19, 2010

Quote from goodgoing:

If you want people to take you seriously you have to describe the problem in detail and provide an excel file they can work with. The way you described it is not intelligible. It makes no sense. You are talking about random entries without explaining why they are there in the first place, for example.

The data is in semicolon separated format. You can easily rename the file to .csv and open it with Excel if that's what you really want. But the data is for machine learning, there's no point in looking at it through Excel.

What particular part of the explanation did you find hard to understand? It's a inverse problem for machine learning. There are input (independent) variables and one target (dependent) variable.

Then again, if you really don't get it, then it's most likely not for you.

goodgoing · Dec 19, 2010

Quote from Indrionas:

Then again, if you really don't get it, then it's most likely not for you.

Yep, it is not for me. You said it. Listen you imbecile, I aksed you a specififc question and you came back attacking me. I asked you what the purpose of the random entries

(1) Is there a random process in the pattern that generated them

or

(2) You put them there yourself to make the problem harder for people?

Excel files, you imbecile, help people to see the data in an orderly way. FYI, you can output the data in .txt format from excel.

Now, I am not going to work on your homework problem. I know which class you are taking you imbecile. You are soooooooooooooooooo stupid you cannot solve a trivial problem.

"Introduction to Data Mining"

kut2k2 · Dec 19, 2010

http://www.jurikres.com/catalog/ms_ddr.htm#top

http://www.jurikres.com/faq/faq_ddr.htm#top

goodgoing · Dec 19, 2010

Quote from kut2k2:

http://www.jurikres.com/catalog/ms_ddr.htm#top

http://www.jurikres.com/faq/faq_ddr.htm#top

From the naive website you linked to:

"However, Chande and Kroll in "The New Technical Trader" (page 9) have shown that common indicators like momentum, RSI, stochastics, ADX, etc. have correlations with each other ranging from 0.77 to 0.93 (r-squared values). Consequently, these forecast models are being fed an empty diet of numbers mostly saying the same thing. "

Can I call the authors (burning) Candle and Troll?

Why do people who do not understand TA write books? For one, what is important when using indicators is not so much their values and their corellations but the patterns of divergences. These do not corellate that high, actually they may not corellate at all.

Then, the main objective of DDR is to de-corellate the inputs. Why? If I feed the same indicator data to a model twice does the result get worse although the data are 100% corellated? Obviously no, and this along with the above comment about divergences show that these tools actually remove valuable information along with redundant information.

Data mining challenge

Attachments