0:37
very easy to understand and we'll, look at some examples. in a little summary from
basic theoretical computer science, there's a very important theorem called
Kleene's theorem that, was, developed actually here at Princeton in the 30's and
it says that, for any discrete finite state automaton, there exists a regular
expression that describes the same set of strings. And equivalently, for any regular
expression, there's a DFA that recognizes the same set of strings. and this is just,
an example. and if you have seen this, this sort, this sort of thing will be
familiar and if we have, if you haven't you know, and you will understand better
as we get into it. So, a discreet finite state machine is a machine that has states
that are labeled and it has transitions from state to state that are labeled with
characters. And, the way that it works is, if it's in a state and it reads like for
every state is well defined is the next character is zero or one it's another
state to go to. So for example at state zero if you see a zero you stay in state
zero, if you see a one you go to state one. In state one if you see a zero you
stay in state one, if you see a one you go to state two, and so forth. So at every
state. read a character and go to the well defined next state. That's a discreet a
deterministic final state automaton. and so, that's the machine you start it up on
a string, and it reads all of the characters in the string. and then, if it,
ends up in a state that's so called terminal state, and started on the start
state the terminal state is just in dicated by this X that are on this
example. It ends up in the right state you say that it recognizes a set of strings.
and so that's a way to determine if a given string is in a specified set. The
machine specifies the set. In an RE the, we specify the set by writing characters
and stars and meta-characters in parenthesis and this regular expression
recognizes these same set of strings. Describes these set of same, same set of
strings that's recognized by this DFA. Hank Leany's theorem showed that it's
always possible, to, construct such a machine. So that gives a, a basic plan for
developing a pattern matching implementation. and this is developed by
Ken Thompson at Bell labs, one of the developers of UNIX, and, and this facility
was an important part of early UNIX and developed this idea based on the
well-known theorem from theoretical computer science. is let's try to build a
machine, the same way as we did for Knuth-Morris-Pratt, where we have no
backup in the text input string. so we just got a finite state machine. With
Knuth-Morris-Pratt we built a, a machine for recognizing the one string how about
building one for multiple strings and that would give us a linear time guarantee. so
the unruling extractions is determine if it's FSA or DFA and the basic plan is to
just go ahead and take your pattern. It describes a set of strings and use that to
build a DFA and then simulate the DFA with the Texas input, the same way we did for
Knuth-Morris-Pratt And if it accepts, we say we have a match. If it rejects, we
don't have a match. this is a, a fine plan but it's got a flaw. the bad news is that
The plan is infeasible because the number of states in the DFA might be exponential
in the length of the RE. So, for it's got too many states Kleene's, the proof of
Kleene's theorem, standard proof of Kleene's theorem , involves the
exponential number of states. and it's not that difficult to prove if you're
interested be sure to look it up. But from a practical, standpoint, too many states,
you can't use that as a basis for algorithm. But there's an easy revision.
And again, this gets back. It will, it will give us a quadratic time guarantee,
and actually, in, in real life, it's usually linear, and all we do is change
the abstraction to use a non-deterministic finite state machine, an NFA, rather than
a DFA. so, in the same basic plan, we're going to build an NFA. It's a, it's a
different kind of machine, but actually, it's also the case that, that, oh yeah.
For any regular expression we can build on NFA so and in vice versa cleaning stand to
this and so we're going to simulate with the NFA with the text as input. And so,
what do we, what do we mean by non-determined as a finite state machine?
That's what we have to talk about next. and we'll just do it with this example
that we used throughout this lecture. So, It's very similar to the DFA that we have
before, now we're going to put the characters in the states and actually, the
kind of NFA that we're going to build, we're going to have one state for every
character in a regular expression. So this is an NFA, corres-, corresponding to this
regular expression. just, to, We, we always enclose the regular expressions in
parentheses. just to, make everything work. and then, Got this regular
expression here. A star B or ACD. then, we're going show how to build, this NFA.
and then, simulate it, to recognize the regular expression. And, how is it
different than a DFA So, there's a character in every state, and if the
character's a text character, it's the same as before. We read that text
character and moved to the next state. But it's a more general kind of machine,
because states also have what's called epsilon transition. And with an epsilon
transition the machine is allowed to change the state without scanning any
text. So, at the beginning the machine go from zero to one, to two, back to one, I'm
sorry, two to three, back to two or it go from zero to one over to six there's lots
of places the machine could go, without scanning any text character. but we do
have the black matc h transitions that scan text characters and so those are the
rules the machine operates by. And, the, final rule is when does it accept, when
does it decide a strings in the pattern. It accepts if there's any sequence of
transitions that scans all the text characters and ends in the accept case.
it's a way of specifying an infinite set of strings. but it's got this
non-determinism. It's not always determined what the machine, will do next.
It's a little bit of a mind blowing concept in theoretical computer science.
But this particular example actually shows how such, such a concept can be made
concrete and actually give us a widely useful algorithm. One way to think of a
non-deterministic machine is a machine that has super powers and can guess the
proper sequence of state transitions to get to the end. Get to the accept state.
another way to think about is the sequences if you provide us particular
sequences that's a proof that the machine accepts the text. and so this is a real
machine. We don't have a real machines that can guess the right answer but it's a
completely well specified abstract machine, and we can write a program to
stimulate it's operation, and that's what we are going to do. So let's just make
sure that everyone's got the concept down. So, say we have the question is 4A is
followed by BD, is that accepted by this NFA down here? and the answer is yes
because there is a sequence of legal transitions that ends up in state eleven.
So in this case, we'll take an epsilon transition from zero to one to two and
then we'll... We've got 4A's so we'll chew up four A's, one, tw-, I'm sorry. One, and
then go back. Two, and then go back. Three, and then go back. Four, and then
we'll be in state three. And then at that point, we'll decide to take this epsilon
transition. We'll assume your machine decides to take this one. then it
10:35
sequences that the machine might guess and go to the wrong state or stall, it doesn't
matter as long as there's some sequence and we are going to assume that the
machine always guesses the right one. So for example, if the machine just
recognizes three A's, one, two, three, and then went to state four, it would get
stuck because there's no way for it to get out of state four because state four is
looking for a B, and it's sitting on an A, and so forth. So, there's definitely,
things that the machine could do that would be wrong, but we don't care, as long
as there's some way for it to get through. and then, what about if it's a string
that's not in a language recognized by the state. the machine, well, so we have to
argue about all possible sequences and, prove that, no sequence of legal
transition, is transitions ends in state eleven. and that, seems to be, a fairly
complicated sort of argument. So in this case the machine could recognize a bunch
of A's, and then go to state four. But again, there's no B, so there's no way
it's going to get out of state four. and, so you can make a general argument like
looking at the machine that's any string that it recognizes, it has to end in D and
this one doesn't end in D. that's a, much more complicated thing, than we're talking
about, is to try to come up with, a, simple machine that will decide whether or
not a string, is in the, language that it specifies. so the question, so question
that we have is, we have this non deterministic machine, how are we going to
decide whether a given string is in the language that it recognizes. So for
deterministic machines, like we used for Knuth-Morris-Pratt it's very easy, because
at every time, there's only one possible transition. But for non-deterministic, if
you're, in some states, there's, multiple ways to go, and you have to, figure out
the right transition. But actually, the situation isn't so bad. and what we can do
is, to simulate the NFA, is just to make sure that we consider all possible,
transition sequences. And that's what our algorithm for a regular expression pattern
matching is going to be based on. that's what we will look at next.