We know that frequent pattern mining from itemset database is a widely researched topic, good algorithms and tools have been developed. However there are still some problems to solve. One of the most important problems is that the size of the extracted pattern set can be much bigger than we really want. Sometimes it can be even bigger than the original database. This is called the pattern redundancy issue, which has become the mainstream of the research on this specific topic in the research community. People are working to reduce the size of the pattern set. They want to find some criteria other than just using the frequency to measure the goodness of the patterns.

One successful method is the minimum description length, where the goodness of a pattern is measured by how well it compress the raw data. Given the data, we usually do like this, we find a model, or a dictionary or you can call it pattern set in the context of pattern mining, this model is a summary of the original data. To fully express the data, we needs a encoding schema, so that the original data can be loselessly decoded from the model and the encoding. The size of the model plus the size of the encoding is much smaller than that of the raw data. In other words, the combination of the model and the encoding is a compact representation of the original data. We want to minimize this size to get the most compact representation of the original data. This is what we usually do using minimum description length. There is a theoretical foundation for this method, the size of the model plus encoding is called the description length of the data, minimizing this length gives the Kolmogorov complexity of the data, which is a intrinsic feature of the data in information theory.

This method has been successfully applied on itemset database, sequence database and even data streams. For sequence, things are more complex than itemset since we have to consider the issue of sequential structure, allowing gaps in pattern and pattern overlapping. For example, the patterns in the sequence can be interleaved with each other. This is problem is usually solved with statistical method such as dependency test. For streaming data, not only the description length but also the computational complexity is a critical issue since the large amount of data is dynamic, continuously reaches at high speed. Even quadratic complexity is not tolerable in this case. Some other constraints like single pass of the data should be addressed.

Above is the big picture of the most recent progress in the pattern mining research area. In fact, minimum description length is a method that requires more attention than just being used in pattern mining area. The power of this method lies in the fact that it takes the computer itself as a metric tool, thus can be used to discover some intrinsic feature of the data from a machine’s perspective, therefore it also can be used to reduce lots of parameters that should be tuned by human. Say, for a time series data, we want to find the regularity. This of course can be done by human interference, however it can be unreliable. There is a research work to discover the dimensionality of time series objects by incorporating the idea of minimum description length, where it is essentially a method of letting the computer to determine the dimension by measuring how much information it require to be stored. Another example is that in change detection, we usually use the sliding window model to keep track of the most recently occurred events. However it is always difficult to set the window size. By minimum description length, people can get rid of parameter setting by letting the computer make the decision.

Read Full Post »