Data repositories in which cases are related to subcases are identified as hierarchical. This course covers the representation schemes of hierarchies and algorithms that enable analysis of hierarchical data, as well as provides opportunities to apply several methods of analysis.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

In this example, this work very well but in practice,

if our query patterns happen to be very long,

this may cause very long lookups, why?

Well, if it's seen here,

the cost of lookup is related to the length of the pattern,

the length of the pattern that we are searching.

But if our pattern is very long,

this may mean that the number of lookups that I

have to do on to trie may also be very long.

So, the question then is,

is there a way to reduce that cost?

Can we reduce that cost and can we make them look up faster for long query patterns?

Well, the answer turns out to be yes and the idea is basically

to take this suffix tree data structures into a new representation,

similar but new representation,

where instead of doing the search based on the length of the pattern,

doing it in a different way.

And then, we will basically call this new data structure as suffix array.

So, the idea is very simple, right?

So, let's take the string that we are given as the data string, we are given that.

And let's create our corresponding trie data structure called suffix tree.

Note that, what we can do is we can focus on the leaves of

this trie data structure.

Just focus on the leaves of the data structure and

then represent the leaves of this data structure,

which are nothing about pointers to the origin of data sequence,

into an array, in the order in which they appear.

So, in this case,

the leftmost pointer is six,

the leftmost array element is six.

The rightmost pointers points to position one and the rightmost array entry is one.

The order in which the pointers are laid out on my trie data structure,

on my suffix tree,

correspond to the order in which pointers are laid out on my array data structure.

So, what's the efficiency of this?

Now, then what happen is that basically we had

a data structure where we could actually search,

now we don't have a data structure,

we only have an array.

So, if you are given the sequence

"the" and we want to find it in our data sequence, how can we do that?

Is there even a way to do that?

Because my tree disappeared,

I removed my tree.

Well, what we can do is we can basically do binary search,

right? What's binary search?

Well, binary search, as you will know from your basic data search and algorithms courses,

a efficient way of doing searches on data that is altered in a particular way.

So, in this case, the particular way is at the right,

we have things that starts with an A and at the left,

this is the left or this is the right?

I think it is the left. Okay.

So, as you know,

the binary search is a efficient way to implement searches on sorted data and here

our data is sorted from left to right in alphabetical order.

At the leftmost position,

we have an, subsequence state start with an.

And at the rightmost position,

we have subsequences that starts with a w. So,

here six corresponds to an,

one implicitly corresponds to w. Now that I have

an implicit sorted data which means that I can implement binary search on it.

Let's try to implement,

how do we do binary search?

Well, we usually jump somewhere in the middle.

Let's assume that basically, we go to four,

it is somewhere in the middle of our array.

Four points to this position,

that's stands the s. S is greater than t,

I couldn't find what I'm looking for but I know that "the",

if they exists, it must be somewhere on the left side of my array.

So, let's basically jump the new middle of the new array,

sub array, the position is seven.

Seven points to arrays,

so start with an a.

Interesting, so "the", if it exists,

must be somewhere in this range.

So let's basically, let me pick a new center.

So, the new center,

let's assume it is eight,

eight points to,

I messed up.

So, we need to basically go to the beginning of this,

I apologize, what did I do? It's kind of silly.

Okay.

So let's basically see whether we can

find the subsequence "the" using the array in the example.

So, how do we do binary search?

Have do we search for "the" using the binary search idea.

Well, the idea is very simple,

we jump somewhere in the middle to the array and see whether,

by luck, we find the subsequence.

In this case, let's basically jump to the position eight.

The position eight pointed to the pointer at this position, points to i.

So, I start with an i, we were unlucky,

we didn't find "the" but we found,

what we found is that "the" if it exist,

must be somewhere in this range because t is larger than i.

Okay, let's begin other half.

Let's assume that maybe we jump at this position,

at that position I have a pointer to three,

three points to t, that's great.

So, we have something that started the t but the next symbol is an e,

it's not a h. So,

what this says is that basically,

the next symbol if "the" exists it must be somewhere in this range of the array.

Great, let's pick a new center.

So, in this case, I have nine, five, one.

So let me pick five,

five points to tr so are the t and r. So,

what this say is that basically here I had te,

here I had tr.

So, th, "the", if it exist it must be at this position.

So, let me follow this pointers t, h, e, great.

So, what happened is that basically I just found what I was looking for.

And to search for this,

I didn't need the tree data structure or the suffix tree data structure.

Instead, what I did is I did binary search.

Note that, the course of this binary search is log two N,

where N is the length of

my array which is nothing but the number of words I have in my string.

So in general, my search essentially become log two N cost.

Instead of basically being related to the length of the pattern,

now the search is related to the length of my query tree.

So, for log patterns,

this might be more efficient.

In fact, we can make this better,

so instead of using binary search with log two N cost,

we can build search trees such as binary search trees or b-trees which can

improve the log two N search cost to log f N search cost,

where the f is the fen-out of the b-tree data structure.

In case we can have large f values,

such as f close to 10 or f close to 100,

this can basically and significantly bring the search cost down.

It's great. So now we know how to do subsequence search.

We either create a suffix tree using all possible suffixes of

my data sequence or I convert

this suffix tree into a suffix array

and I perform binary search on it,

or I create a tree-base data structure on it to further bring the search cost down.

Explore our Catalog

Join for free and get personalized recommendations, updates and offers.