Archive for May, 2013

在荷兰的第二个春天

荷兰的春天来了那么几天时间以后又藏起来了。骚过一段时间以后我也藏起来了。

三月初的时候重新回到公司但是不同的组,谁能想到事情正在其变化呢。姑娘看起来挺有意思,第一次见六姑娘的时候心里留下了这个印象。后来组里中国人民一起去吃饭的时候才算认识。再后来就是强行把六姑娘拉到我们饭友圈里,一起吃饭吃完饭一群人坐着update自己大小事和别人的八卦。六姑娘看起来还挺乐意的,但是不太爱讲自个的事儿。感觉越来越好的时候我就行动了,每天打着各种一看就是幌子的幌子往姑娘那儿送东西。六姑娘最终受不了,在我最后一次强塞给她豆浆之后直接把我的保温杯没收了。连带送给我一个评语,“你这个人啊,又敏感又计较”。原因是因为那天饭后一顿聊之后姑娘讲说对我的事情没兴趣。我就打着试探的态度问说我好感还没表达完你就给我打枪了,于是收到了这样一句评语。我心想啊,敏感和计较,对我都是从来都没有过的高评价,欣然接受了。下一周的时候继续各种理由送东西。六姑娘很无奈,连推我几次不同理由的邀约。我心想按照平时使劲互黑的节奏姑娘出来吃个饭打个球不应该拒绝得那么干脆呀。但是拒绝就是拒绝,谁的脾气都没那么好摸,慢慢只能把自己按住了。到现在事情一多,虽然还是照常黑,主动挑起来的互动就少了很多。

跟哥聊天的时候哥开我玩笑说我骚,我想想确实是。2013年的春天我骚了一把,用几句梨花体记录下来:
一个叫六姑娘的姑娘,
闷闷的很可爱,
脾气摸不准,
可黑我黑得很来劲,
可惜没我来劲。

哦我想重新学习音乐了,学习乐器写词编曲,这么蹩脚的文字要是能改改唱成歌儿多好。

把话写在这儿,还是期盼你能看到。
我喜欢你,非常非常喜欢你。
有多喜欢呢?就是能在脑海里看到你清晰的笑脸,
和脸上转动的眼珠,笑起来挤得圆圆的脸颊。
但是喜欢又能怎么样呢?
喜欢不一定懂,懂了又一定合适吗。。
认识你的心情就跟看到漂亮的夜空时候一样开心,
我想懂你的时候也有这样的心情,
也想你了解我以后给我仍旧是那印象里的笑脸。
还有还有……
我只是抱着最美好的愿望,等待你…

Advertisements

Read Full Post »

We know that frequent pattern mining from itemset database is a widely researched topic, good algorithms and tools have been developed. However there are still some problems to solve. One of the most important problems is that the size of the extracted pattern set can be much bigger than we really want. Sometimes it can be even bigger than the original database. This is called the pattern redundancy issue, which has become the mainstream of the research on this specific topic in the research community. People are working to reduce the size of the pattern set. They want to find some criteria other than just using the frequency to measure the goodness of the patterns.

One successful method is the minimum description length, where the goodness of a pattern is measured by how well it compress the raw data. Given the data, we usually do like this, we find a model, or a dictionary or you can call it pattern set in the context of pattern mining, this model is a summary of the original data. To fully express the data, we needs a encoding schema, so that the original data can be loselessly decoded from the model and the encoding. The size of the model plus the size of the encoding is much smaller than that of the raw data. In other words, the combination of the model and the encoding is a compact representation of the original data. We want to minimize this size to get the most compact representation of the original data. This is what we usually do using minimum description length. There is a theoretical foundation for this method, the size of the model plus encoding is called the description length of the data, minimizing this length gives the Kolmogorov complexity of the data, which is a intrinsic feature of the data in information theory.

This method has been successfully applied on itemset database, sequence database and even data streams. For sequence, things are more complex than itemset since we have to consider the issue of sequential structure, allowing gaps in pattern and pattern overlapping. For example, the patterns in the sequence can be interleaved with each other. This is problem is usually solved with statistical method such as dependency test. For streaming data, not only the description length but also the computational complexity is a critical issue since the large amount of data is dynamic, continuously reaches at high speed. Even quadratic complexity is not tolerable in this case. Some other constraints like single pass of the data should be addressed.

Above is the big picture of the most recent progress in the pattern mining research area. In fact, minimum description length is a method that requires more attention than just being used in pattern mining area. The power of this method lies in the fact that it takes the computer itself as a metric tool, thus can be used to discover some intrinsic feature of the data from a machine’s perspective, therefore it also can be used to reduce lots of parameters that should be tuned by human. Say, for a time series data, we want to find the regularity. This of course can be done by human interference, however it can be unreliable. There is a research work to discover the dimensionality of time series objects by incorporating the idea of minimum description length, where it is essentially a method of letting the computer to determine the dimension by measuring how much information it require to be stored. Another example is that in change detection, we usually use the sliding window model to keep track of the most recently occurred events. However it is always difficult to set the window size. By minimum description length, people can get rid of parameter setting by letting the computer make the decision.

Read Full Post »