Data mining challenge

ronblack · Dec 31, 2010

Quote from Indrionas:

total: 58'049'982 patterns that have 1 to 5 attributes, that can have values both "true" and "false".

Why just 1 to 5 attributes? Did you pull that out of nowhere or it is related to the problem in a way you only knew? If you know in advance how many attributes are significant in a mining problem yes you can reduce the search space but in reality you don't know.

Indrionas · Dec 31, 2010

Quote from ronblack:

Why just 1 to 5 attributes? Did you pull that out of nowhere or it is related to the problem in a way you only knew? If you know in advance how many attributes are significant in a mining problem yes you can reduce the search space but in reality you don't know.

Yes, that's what I am talking about. In this example problem I know in advance that there are 5 attributes max.

The problem is what to do when it is not known in advance.
One approach could be to use this "max number of attributes" as a training parameter. Then run training gradually increasing this parameter until you either reach satisfactory result or computation becomes too expensive.

And even if that's the case, there's usually not that many patterns in the data as the theoretical size of search space. It all depends on the density of data.
For example, you don't have to explore patterns with attributes (A=1 AND B=1 AND C=1 AND X=x) if (A=1 AND B=1 AND C=1) occurs only a few times and adding complexity will reduce occurrence below the desired level of significance.

HowardCohodas · Dec 31, 2010

Quote from Indrionas:

I believe there are a few people here who do this or similar kind of work in their process of model building.

I've been toying around (actually doing quite serious work) with neural networks, testing how powerful they are in feature selection and modelling this kind of data. So far I couldn't find a single method/training technique to crack this problem (with NNs).

Several years back (7, I think) I was working on an AI project where I built a genetic algorithms tool kit. I know this will sound like the maxim, "If the tool you know is a hammer, every problem looks like a nail." From my recollection, this would be a good means of attach.

By now there are lots of these kits around. I found some free ones on SourceForge.net. Check them out.

I wish I had the time to pursue your challenge.

Indrionas · Jan 7, 2011

Quote from HowardCohodas:

Several years back (7, I think) I was working on an AI project where I built a genetic algorithms tool kit. I know this will sound like the maxim, "If the tool you know is a hammer, every problem looks like a nail." From my recollection, this would be a good means of attach.

By now there are lots of these kits around. I found some free ones on SourceForge.net. Check them out.

I wish I had the time to pursue your challenge.

Evolutionary algorithms are not suitable for this kind of problem. They rely on a fitness function. In order for an evolutionary algorithm (such as GA) to work productively, the fitness function must be monotonic. This is not the case with searching for exact patterns. A simple example:

Pattern ABCDE is significant. Meanwhile patterns ABCD, BCDE, DEF, ABDE, BDEFG are insignificant, therefore their fitness function values will be low and they won't be picked as parents to produce a new generation of patterns, even though they would have been good candidates.

intradaybill · Jan 7, 2011

Quote from Indrionas:

Evolutionary algorithms are not suitable for this kind of problem. They rely on a fitness function. In order for an evolutionary algorithm (such as GA) to work productively, the fitness function must be monotonic. This is not the case with searching for exact patterns.

Good points about evolutionary algorithms. What do you mean by exact patterns?

HowardCohodas · Jan 7, 2011

Quote from Indrionas:

Evolutionary algorithms are not suitable for this kind of problem. They rely on a fitness function. In order for an evolutionary algorithm (such as GA) to work productively, the fitness function must be monotonic. This is not the case with searching for exact patterns. A simple example:

Pattern ABCDE is significant. Meanwhile patterns ABCD, BCDE, DEF, ABDE, BDEFG are insignificant, therefore their fitness function values will be low and they won't be picked as parents to produce a new generation of patterns, even though they would have been good candidates.

It's been a while since I was intimate with this field so my memory could be faulty. However, I don't believe the fitness function had to be monotonic with proper gene pool management.

GA's lack of fitness for pattern recognition is beyond my recollection.

rdg · Jan 7, 2011

Quote from Indrionas:

Evolutionary algorithms are not suitable for this kind of problem. They rely on a fitness function. In order for an evolutionary algorithm (such as GA) to work productively, the fitness function must be monotonic. This is not the case with searching for exact patterns. A simple example:

Pattern ABCDE is significant. Meanwhile patterns ABCD, BCDE, DEF, ABDE, BDEFG are insignificant, therefore their fitness function values will be low and they won't be picked as parents to produce a new generation of patterns, even though they would have been good candidates.

What makes you say this? I can only see that being the case when your patterns are constructed with nonlinearities (ie, xor).

Hugin · Jan 10, 2011

Quote from Indrionas:

Evolutionary algorithms are not suitable for this kind of problem. They rely on a fitness function. In order for an evolutionary algorithm (such as GA) to work productively, the fitness function must be monotonic. This is not the case with searching for exact patterns. A simple example:

Pattern ABCDE is significant. Meanwhile patterns ABCD, BCDE, DEF, ABDE, BDEFG are insignificant, therefore their fitness function values will be low and they won't be picked as parents to produce a new generation of patterns, even though they would have been good candidates.

The problem is not constrained to evolutionary algorithms (and generally there is no requirement on the fitness function being a monotone function).

I think all optimization/classification algorithms require some structural information to guide the search. Traditional algorithms often use the gradient (e.g. in back-propagation training of neural networks).

Very simplified, genetic algorithms relies on some similarity (using some measure) between strings corresponding to good solutions. All that is required is a fitness function that can tell whether a solution is better than another. But, and I guess this is what you refer to, in order for the search to be efficient the fitness functions must be able to order the different solutions. In a âsparseâ solution space, e.g. where only a few solutions have a non zero fitness values, the search can degenerate to a (slow) random search.

You could argue that if there is no structure present then random search is the best method.

What method to use depends on how you know about the underlying structure and in academic examples, like this one, how much of this knowledge you are allowed to use when creating the search algorithm. I can definitely see GA/GP being used to solve this kind of problem provided that there are enough examples to explain the underlying generating algorithm. Whether this is the most efficient method is different question (probably not).

goodgoing · Jan 10, 2011

Quote from HowardCohodas:

It's been a while since I was intimate with this field so my memory could be faulty. However, I don't believe the fitness function had to be monotonic with proper gene pool management.

GA's lack of fitness for pattern recognition is beyond my recollection.

Why don't you read his post carefully before pulling the trigger? He said for the GA's to work productively the FF must be monotic. It doesn't have to be.

HowardCohodas · Jan 10, 2011

Quote from goodgoing:

Why don't you read his post carefully before pulling the trigger? He said for the GA's to work productively the FF must be monotic. It doesn't have to be.

One of us needs remedial English classes.

I agree that it doesn't have to be.