Data repositories in which cases are related to subcases are identified as hierarchical. This course covers the representation schemes of hierarchies and algorithms that enable analysis of hierarchical data, as well as provides opportunities to apply several methods of analysis.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

There are also what are known as a string kernel algorithms,

that are very often used to compare strings to each other.

Note that the difference between the string kernel algorithm and

the common w-gram counting is that in the w-gram counting,

we are given the error bound as an input,

if we're told that can be at most K errors.

In the case of the string kernel,

instead we are basically told to approximately compute how

well string P may

match string Q without actually computing the edit distance between them.

So there is no K errors as a perimeter,

error bound as a perimeter in this case.

So, how do the string kernels work?

So, let's see this in the other window.

So once again, we are given a string P,

for this can be a very long string P. We are also given a string or sequence Q,

this can also be very long.

This is of length N and this is of length M. So what we do is

we consider all w-grams of N,

and we also consider all w-grams of M. So,

we actually compute all w-grams of N and all w-grams of M. Now,

for the sake of the example,

let's assume that basically in this case w is equal to 2.

And let's assume that our symbols are a, b,

and c. So, what are the w-grams that basically we can have if we do that?

So the w-grams that we can have are aa, ab, ba, bb,

ac, ca, cb, bc, and cc.

So, these are essentially all w-grams that we

can extract from these two strings.

So, what we will do is we will basically create two counting vectors,

one for string P and one for string Q.

Essentially, what we will do is that every move,

and as we move and create the w-grams,

in this case two grams of string P,

whenever we have an aa,

we will increment the position aa plus one.

And whenever we see say cc,

we will increment the position cc plus one.

So if you do that, eventually,

we will basically have a counting vector, or a histogram,

also known as a histogram,

of the number of times aa occurs.

Let's assume that in P aa occurs a hundred times,

ab occurs 80 times,

ba occurs twice, bb occurs five times,

ac occurs 50 times,

ca occurs 40 times,

cb occurs 30 times,

bc occurs 35 times,

and cc occurs four times.

We will do the same thing for the string Q.

So essentially here, we will have another account,

say this is occurring 50 times,

this is occurring 10, this occurring 40,

40, 35 times, 20 times,

20 times, and 10 times.

So what we have here, essentially two vectors.

The first vector corresponding to P,

the second vector corresponding to Q,

that counts the number of times each possible w-gram occurs in P and Q.

The idea is that, if these strings are very similar to each other,

then these two vectors must also be very similar to each other.

If P and Q are very different from each other,

then the corresponding vectors,

the counting vectors must also be very different from each other.

So, what we can do is instead of computing the edit distance between P and Q,

we can take the counting vectors of P and Q,

and then we can compute the distance between these two counting vectors,

or we can compute the similarity between these counting vectors.

One way to do that is using that product.

But there are other vectors comparison algorithms,

and we had already covered them in another module of this course.

So I'm not going to go over them here,

but essentially what we have done is that we have transformed

the string comparison problem,

which is expensive, which is possibly order,

N times M cost,

to a vector comparison problem which can be implemented in a much cheaper way.

And how did I do that?

We have done that by converting

the strings or sequence that are given input to counting vectors.

So, we have transformed the string comparison problem to a vector comparison problem,

by using w-grams as the transformation tool.

So, this is one way to simplify the edit distance process.

The last algorithm that we will learn,

once again to see if string P match string Q but do it efficiently,

it's called the min-sampling similarity.

I will try to explain that once again using an example.

So again, we are given a string P of length N,

and we are given a string Q of length M. And what we tried to do is we try

to quickly identify whether P may be similar to Q or not.

The main way of doing it is using edit distance,

but we know that the cost of this is N times M,

which is very expensive.

We don't want to pay that price.

So if you don't want to pay that price,

can we use w-grams to compute this.

Well, the initial part of

this algorithm will be similar to the previous algorithm that we had learned.

So what we will do is we will identify all w-grams of

P. We will also identify

all w-grams of Q.

So at this point,

we have one set of w-grams that correspond to P,

and one set of w-grams that correspond to Q.

If these two sets are similar,

actually I should correct myself,

these are not really sets because say,

w-gram aa occurs twice,

then I will have aa twice in this,

which means that these actually not sets, they are multisets.

So we have one multiset of w-grams that

correspond to P and we have another multisets of w-grams that correspond to Q.

So if this multiset is similar to this multiset,

then I can say that P is similar to Q.

So the question essentially becomes,

how can I compare this multiset to the other multiset?

The counting vector algorithm was one way to do that.

Min-sampling is another way of comparing these multisets.

The idea is simple.

Let's assume that I have a multiset of w-grams that correspond to P,

and I have another multiset of w-grams that correspond to Q.

What I will do is I will use several hash functions.

In this case, I will use r hash functions.

The role of the hash function will be

to hash the w-grams.

So given r hash functions,

I will create r different hashings of these w-grams to each other.

And let's assume that these hash functions will match

the given w-grams to numbers.

So, given that, I have r hash functions,

essentially what I'm doing that I'm creating r different orderings of the W-grams of P,

and W-grams of Q. I create r different orderings of W-grams of P and Q.

So, this basically create this lets say that R equal to three.

So I can basically create essentially,

R equal to one,

R equal to two,

R equal to three,

and for each one of them,

I have an ordering of W-grams.

So this is the ordering of W-grams that correspond r equal to one.

This is an ordering of W-grams that correspond to r equal to to two.

That is the ordering of W-grams that correspond to r equal to three.

Now, depending on the hash function, the same,

W-grams, say aa, can occur at different positions.

So, for example for this pq, this is pq,

this is pq for example in one order,

aa a may occur here,

in the other one aa may occur at this position,

and in the next one aa may occur at this position.

Why? Because I have r random selected hash functions that provide me different rankings.

The next one is, ab and ab may occur here,

and the first hash function ab may occur here,

in the second hash function,

and ab may occur here on third hash function.

So I get different orderings.

So what is the idea?

Why do these orderings matter?

So why the orderings matter is that,

if P is very similar to Q,

independent of these hash functions that I use,

the corresponding orders, order symbols,

sequencers, W-grams, should also be similar,

if they are very similar because there's aa here,

there's also an aa here, there's an ab here,

there's an ab here,

so this aa must also occur here,

ab must also occur here,

ab will occur here,

aa will occur here,

aa will occur here and ab will roughly occur

at the same location because all the other things are similar,

say cc is here.

Let's assume that cc matches here,

the next one is dd.

The strings are similar.

So dd must also occur here,

the next maybe da and because the strings are very similar

to each other and when we order the corresponding W-grams,

they will also be similarly ordered.

Now, there will be mistakes out the errors obviously,

right so because they may not be exactly

matching so it may maybe that basically this has aa,

but the other one doesn't have aa,

so it maybe that is has aa and then say cc arise and this one doesn't have aa,

so it's just has ca.

So, there is no guarantee,

that this entire sequence will match,

because there will be some errors,

but if we focus on some small b,

we know the first b windows here,

for some small, budget capital,

let me use the capital B clean to indicate budget.

Say B equal to five,

the first five W-grams.

One, two, three, four, five.

One, two, three, four, five.

And one, two, three, four, and five.

If I basically focus on the first few W-grams,

if the two strings are very similar to each other,

there's a good chance that this entire sequence,

will match this entire sequence.

I'm not guaranteed but there's a good chance it will match.

Now, it is possible that when I use this hash order,

I will find a match,

when I use this hash order,

the first P that still match,

but when I use this hash order,

it is possible that there will be a mismatch.

Say this is basically correspond to ca here,

but da will be here because this string didn't have ca in it,

so it didn't appear.

So what does that mean?

This means that basically if I have similar strings,

then I will have orders that are more or less matching,

if I have similar strings.

I will have more orders that are matching,

and if I have strings that are dissimilar from each other,

then I will have more hash orders that are mismatching.

So the more similar the strings are,

the more hash orders will match each other.

The more dissimilar the strings are,

the more hash orders will mismatch each other.

So this is essentially the idea.

Given strings p and q,

we create all their W-grams,

we create multi-sets of W-grams,

and then by using r hash functions,

we create different orderings of the W-grams in both multi-sets,

and then we look at the first b meaning of these,

and see if they are aligned given the corresponding random orders.

If they are aligned,

we say that match for the right hash function is equal to one.

So the idea is very simple.

So given p and q,

consider all possible hash orderings.

There are r of them compute with

the corresponding smallest BW-grams

the first BW-grams in the given order are matching or not,

and then count the number of hash orders in which they are realigned,

say R equal to in this example three.

If only two hash orders are aligned,

we say that these two things are matching to each other with two by three probability.

If only one is aligned,

if only one alignment holds then we say that,

"Well, there is less chance that these are similar to each other."

If none of them are aligned,

then we say, "Well,

there is really not much chance that these two strings are aligned with each other."

If all of these sequences they agree then we can't say, "Well, you know what,

there is good evidence that these two strings might be very similar to each other.

So why don't we go and pay

the more expensive distance cost to

actually evaluate how similar these two things are to each other?"

If however, the sim is zero by three,

then we can say, "Well, there's not much chance that these are aligned.

So there is actually no reason to go and pay the price of expensive edit distance to

compete the precise value of the edit cost between things p and q."

So, based on the value mean sampling similar to value that we obtain, we can decide,

whether to not consider the two string or whether to consider

them as potential match and go and compute the more expensive edit distance.

So, it can be used as a filtering tool.

So what did we learn?

In the previous lecture and this lecture,

we've learned that edit distance is a way to compare strings.

But we have also learned that edit distance can be very

costly for matching strings that are long.

We have learned in this lecture to all distance algorithms,

cross-parsing and compressing distance,

that can be used to approximate the edit distance comparison.

So instead of paying order m times m edit distance cost,

we could use some cheaper algorithms to compute edit distance.

Then we have learned in this lecture,

the use of W-grams also known as n-grams,

commonly referred to the n-grams to filter out unpromising candidates,

quickly so that we don't have to pay the price of

more cost and distance competitions such as

edit distance or cross-parsing and compression distance.

These are cheaper than edit distance but they are still,

they can still be expensive and

this W gram based algorithms essentially enable us to filter

out unpromising scenarios so that we can provide the potential matches to

the users in real time or in almost real time for effective exploration of sequence data.

Thanks a lot for your attention.

Explore our Catalog

Join for free and get personalized recommendations, updates and offers.