Archive for the ‘Algorithms’ Category

We know that frequent pattern mining from itemset database is a widely researched topic, good algorithms and tools have been developed. However there are still some problems to solve. One of the most important problems is that the size of the extracted pattern set can be much bigger than we really want. Sometimes it can be even bigger than the original database. This is called the pattern redundancy issue, which has become the mainstream of the research on this specific topic in the research community. People are working to reduce the size of the pattern set. They want to find some criteria other than just using the frequency to measure the goodness of the patterns.

One successful method is the minimum description length, where the goodness of a pattern is measured by how well it compress the raw data. Given the data, we usually do like this, we find a model, or a dictionary or you can call it pattern set in the context of pattern mining, this model is a summary of the original data. To fully express the data, we needs a encoding schema, so that the original data can be loselessly decoded from the model and the encoding. The size of the model plus the size of the encoding is much smaller than that of the raw data. In other words, the combination of the model and the encoding is a compact representation of the original data. We want to minimize this size to get the most compact representation of the original data. This is what we usually do using minimum description length. There is a theoretical foundation for this method, the size of the model plus encoding is called the description length of the data, minimizing this length gives the Kolmogorov complexity of the data, which is a intrinsic feature of the data in information theory.

This method has been successfully applied on itemset database, sequence database and even data streams. For sequence, things are more complex than itemset since we have to consider the issue of sequential structure, allowing gaps in pattern and pattern overlapping. For example, the patterns in the sequence can be interleaved with each other. This is problem is usually solved with statistical method such as dependency test. For streaming data, not only the description length but also the computational complexity is a critical issue since the large amount of data is dynamic, continuously reaches at high speed. Even quadratic complexity is not tolerable in this case. Some other constraints like single pass of the data should be addressed.

Above is the big picture of the most recent progress in the pattern mining research area. In fact, minimum description length is a method that requires more attention than just being used in pattern mining area. The power of this method lies in the fact that it takes the computer itself as a metric tool, thus can be used to discover some intrinsic feature of the data from a machine’s perspective, therefore it also can be used to reduce lots of parameters that should be tuned by human. Say, for a time series data, we want to find the regularity. This of course can be done by human interference, however it can be unreliable. There is a research work to discover the dimensionality of time series objects by incorporating the idea of minimum description length, where it is essentially a method of letting the computer to determine the dimension by measuring how much information it require to be stored. Another example is that in change detection, we usually use the sliding window model to keep track of the most recently occurred events. However it is always difficult to set the window size. By minimum description length, people can get rid of parameter setting by letting the computer make the decision.


Read Full Post »

Graph problem list

1. Single-source shortest paths (directed (simpler for undirected), weighted)
NB. single-pair shortest path is as hard as SSSP.
a. Dijkstra’s algorithm (O(|V|log|V|+|E|), O((|V|+|E|)·log|V|) if common heap), works only for non-negative weighted graphs.
b. Bellman-Ford (O(|V|·|E|)), can handle graphs with negative weighted edges (such case will be reported), works even for graphs with negative cycles.
Special case: DAG, O(|V|+|E|), can handle graphs with negative weighted edges.

2. All-pair shorted paths
a. Dynamic-programming algorithm O(|V|^3 ·log|V|)
b. Ford-Warshall algorithm O(|V|^3)
c. Johnson’s algorithm (O(|V|^2 ·log|V|+|V|·|E| )), changes the edge weights so that Dijkstra’s algorithm can be used.
NB1. Graph modification (a useful idea).
NB2. all above algorithms can not handle graphs with negative cycles.

3. Maximum flow
a. Ford-Fulkerson method, based on Max-flow min-cut Theorem.
number of iterations in Edmonds-Karp is O(|V|·|E|), therefore, the running time is O(|V|·|E|^2)
1). Edge(vertex) disjoint path problem
2). Maximum matching in bipartite graphs
3). Capacity limitation for vertices
4). Point-to-rectangle problem
5). Edge(vertex) connectivity number problem (directed and undirected versions). This connectivity number problem of undirected graph can be solved by solving the edge disjoint-path problem O(|V|-1) times.

4. NP-Complete problems:
1). Clique
2). Vertex cover
3). Hamiltonian Cycle
4). TSP
5). Minimal vertex(edge) disjoint cycle covers
Minimum-Weight Cycle Cover Problem (MWCCP)
NB1. the proof of NP-hardness can be proceed by following the reduction path: 1->2->3->4.
NB2. approximation algorithms exists for vertex cover, and TSP if certain conditions are satisfied.

5. Bisimulation

6. Modular decomposition
a. Incremental method, O(|V|^2)
b. Linear complexity O(|V|+|E|), proposed in 2005.

Above are the graph related problems I met and studied so far. To know more about them if you are interested, you may refer to the book Introduction to algorithms by T.H.Cormen et al. Also, the slides given by Dr. Mark de Berg is quite helpful, which can be found here:

Read Full Post »