So for data clustering,
we will be able to cluster P1 P2 together using P2 to represent the pattern.
So that means if we do this data clustering,
then all the patterns in the cluster can be represented by one pattern, P.
So the problem becomes whether we should mine all the patterns then compress them,
or should directly mine these compressed patterns.
Actually there's a efficient method which can directly mine those compressed
patterns.
I'm not going to get into the detail but you may refer to this interesting paper.
Okay, then another interesting thinking is Redundancy-Aware Top-k Patterns.
That means we want to get a desired pattern which is similar to the compressed
one, because want to get high significance and low redundancy.
These kind of of set up patterns, okay.
Let's look at this a, b, c, d, four different kind of compression.
Actually a is a set of original patterns.
There are cluster shields, their pattern distance, and the color
the darker shows is more significant, the lighter shows is less significant.
Okay, in that case you probably can see in this bigger cluster,
there are three patterns.
They are quite significant.
If you just do the top-k pattern mining, that means you take it as a support count,
or other significant measure, you would only find these three patterns.
Suppose we wanted only find top three, then all the remaining
patterns like here in the other cluster is completely missing.
But if you say I just do the summarization, try to find no clusters.
And within each cluster try to find their centers.
Then you'll pretty well find those less significant patterns, so
this may not be a good balance.
Actually better balance is you take care of both significance and the redundancy.
Simply says, you look at this one, there is something very significant.
And that they are also in the cluster center
you may want to show these patterns.
In the meantime, suppose you can only show three,
you may show these are significant and less redundant.
This one is significant and also it represents this cluster.
So the problem becomes how to develop efficient and
effective method finding such redundancy aware top-k patterns.
There's an interesting study which uses the max marginal
significance to measure the combined significance of a pattern and
develop efficient methods to mine such patterns.
We are not going to get into detail of this method.
Interested readers we made read the paper we pointed out.
Thank you.
[MUSIC]