Hi there. In this next series of lectures,
we will be looking at another important distributed computing problem.
This problem is called mutual exclusion.
Just to motivate the problem here is an example scenario.
Suppose you're a bank where you store all of your money,
runs it's servers in the Cloud.
Now, say, two of your customers are trying to make deposits into your company's account,
and the deposits are both $10,000.
The customers make these deposits simultaneously,
each from a separate bank branch or from a separate ATM.
Here is the set of operations executed by the two ATMs which we'll call as the clients.
Each ATM reads initial amount, say,
it's $1,000 that's the balance of your bank account from the bank's Cloud server.
Then, each ATM adds $10,000,
the amount to be deposited to this amount which makes it $11,000.
And then, each of these ATMs writes back the final amount to the server.
So, the server in the end is storing a value of $11,000.
What really went wrong?
Well, you should have been storing $21,000 in the end of it, not $11,000.
So, essentially, you ended up losing one of the $10,000 because your clients,
the ATMs, did not have mutually exclusive access to your account.
If you had allowed only one client at a time to access
the bank account object that represents your bank balance,
then you would have guaranteed mutually exclusive access,
and you would have had a balance of $21,000 at
the end of this particular series of operations.
So, this is an example where mutual exclusion is
important where you need mutual exclusive access to a given object.
In general, you need mutually exclusive access to a piece of code.
In this case, the piece of code would have a piece of code
that fetches your bank balance,
adds the deposit amount,
and then, sends the final bank balance back.
If at most one client is allowed to execute in that piece of code at any point of time,
then correctness would have been maintained in this particular scenario.
There are other situations where mutual exclusion is important and critical.
In distributed file systems,
clients typically need to lock files, and/or directories.
And in this sort of a scenario,
you need mutual exclusion where at most one client
is allowed to access a file at any given point of time.
In general, accessing objects in any distributed system, distributed object store,
is a scenario which
requires mutual exclusion so that at most once server is granted access to the object.
Coordination across servers is another example of a situation requiring mutual exclusion,
so that you can partition the work across servers,
and the servers can coordinate using locks.
In industry, mutual exclusion is fairly widely used.
Chubby is one of the core systems underlying several of Google's internal systems.
We'll see Chubby in a little bit of detail later on in this lecture series.
Also, many cloud computing stacks that are based on
open source use Apache Zookeeper for coordination among servers.
So, what is the problem?
Sort of formally, the mutual exclusion problem
is also called as a critical section problem.
The critical section is a piece of code which is available at all processes,
but for which we need to ensure that there is at
most one process executing inside that piece of code at any point of time.
So each process has that piece of coordinates in it's own code.
But whenever one process is inside that piece of code,
the critical section code, no other process
should be inside it's own critical section piece of code.
So essentially, this is the mutual exclusion problem.
Now, the way we define an API for this problem is via essentially two main functions.
They are the enter function and the exit function.
The enter function is called by a process before it enters a critical section.
When the process exits from the enter function,
then it can enter the critical section,
it can call a function like AccessResource,
that runs a critical section code.
Then, once the process is done with a critical section,
it needs to signal in some way to the other processes that it is done,
so that the next waiting process can enter the critical section.
And so when a process is done with the critical section,
it calls exit to exit the critical section.
Here's our example from the banks scenario earlier.
Here are the two ATMs.
The ATM on the left is ATM1 and the ATM on the right is ATM2.
And both ATMs are running identical codes.
And that's the beauty of mutual exclusion,
that the program is being run at the different clients could in fact,
be identical to each other.
So, let's see what one of the ATMs does.
So, ATM2 does the following,
ATM1 does the same thing.
First, it enters the critical section.
Then, it accesses the resource where it obtains a bank balance,
it adds in the deposit,
and updates the bank amount.
Finally, it calls exit on the critical section.
And at this point of time, it's done.
So, if you ensure that at most one ATM
is executing the critical section in between the enter and the exit,
then you're guaranteed that firstly,
the ATM1 goes and enters its deposit,
and then ATM2 goes, or the reverse happens.
Where ATM2 goes first and then ATM1 goes next.
And in either case, you're going to end up with
the correct bank balance of $21,000 rather than $11,000.
So, there are several approaches to solving mutual exclusion.
Some of the most basic approaches arise when
all the processes are running inside
one operating system on a machine or inside a virtual machine.
In this case, there are several abstractions that are provided
by operating systems or their libraries.
These include semaphores, mutexes,
condition variables, and monitors.
So, when you're talking of a distributed system however,
where processes are communicating by exchanging messages,
you can no longer use these abstractions that we talked about on the previous slide.
Because essentially those abstractions on the previous slide,
they need some of the shared variables among the processes,
which is not possible in a message passing distributed system.
So, you need essentially three important properties for any mutual exclusion solution.
They are safety, liveness, and ordering.
Safety essentially, once again just to remind you,
safety is a property that nothing bad ever happens.
And in this case, the bad thing would be
two or more processes in the critical section simultaneously.
So, safety would say that
at most one process isn't the critical section at any point of time.
Liveness, once again informally speaking is
the guarantee that something good happens eventually.
In this mutual exclusion problem,
liveness would be the guarantee that when a client wants to access a critical section,
that access or that request is granted eventually.
These two are essential property, safety and liveness.
There is also a desirable third property called ordering,
which says that when multiple clients request access to the same critical section,
these requests are granted in the order that they were made.
And of course, there are many different kinds of ordering that we can talk
about and you'll see those different kinds of
ordering as we discuss the different solutions that exist to mutual exclusion.
So, before we jump to the distributed version of the mutual exclusion problem,
let's look at how the basic solution works in the case where processes are sharing an OS.
And so, they can actually share variables such as semaphores. So, what is a semaphore?
A semaphore informally said is an integer
that can only be accessed via two special functions.
It's a shared variable, so multiple processes can be sharing this variable.
So, the semaphore S in this particular slide is initialized to one,
and this means that one is a maximum number of
allowed processes that are allowed inside the critical section.
The two operations possible on the semaphore are wait and signal.
Wait is often called as P or as down.
A signal is often called as V or up.
The P and V come from Dutch words
that were associated with the original semaphore invention.
The word semaphore itself is of course
an English word that's been in the dictionary for many centuries.
Essentially, semaphores were used in the old days to communicate from
one hill top to another by burning a fire or beating a drum.
And here, in computer science essentially,
we are abstracting that out into code and using the same concept,
but for mutual exclusion in our operating system.
So, let's look at what the wait function does.
Essentially, the wait function is what
a process would call before it enters the critical section.
The wait function enters a while loop.
In the while loop, there is another if statement,
and the entire execution of the while loop.
One execution of while loop is atomic,
which means that when a process is executing this particular piece of code,
no other process can preempt this process in the process.
So for instance, if the process executes the if statement and the S-- statement,
and then, it's preempted by the scheduler when another process runs, that's not allowed.
Once a process enters one execution of the while statement,
it should complete that execution of the while statement before it is preempted.
So what does each execution of while statement do?
Well, it only checks if the value of the integer S is greater than zero.
If it is, then it decrements it by one.
And then, it breaks from the while loop in which case,
the process can then enter the critical section.
So, essentially, if there is only one process accessing the critical section,
then according to the while loop and immediately break out,
and the value of S would be zero.
If another process came along and wanted to enter the same critical section,
it would call wait S and it would essentially get stuck in this while loop.
It will get stuck until the first process exited the critical section.
When it called signal,
exiting process would call signal which would simply increment the value of S by one.
Once again, this incrementing is again an atomic operation,
and the system must ensure that this is executed completely,
and it's not executed in a way where the processor fetches the value of S, and then,
the process gets preempted before
the incremented value of S can be written back to the memory.
So, at that point of time,
the value of S is again one and the next execution,
the while loop, and the second waiting process will break out,
and the second waiting process can now enter the critical section.
So, these atomic operations of the statement in the while loop and also
the S++ operation in the signal operation are
provided typically via hardware instructions such as compare-and-swap,
test-and-set, and there are several other variants and different architectures.
Most hardware platforms and processors provide such instructions so that
semaphores and other mutual exclusion variables can be implemented easily in the OS.
So, essentially, the wait would be
the enter function and the signal would be the exit function.
So, in the example of the bank,
if ATM1 and ATM2 were in the same operating system as each other,
then the code would look as follows.
They would share the semaphore variable S,
should be initialized to one.
Then, before entering the critical section,
each ATM would call wait(S).
And then, after being done with the critical section,
each ATM would then call signal on
S. So that was
where processes share and can share variables in a single operating system,
but in a distributed system where processes communicate by exchanging messages,
these solutions will not work.
So, in the next lecture,
we'll see a solution to the distributed mutual exclusion problem.