We want really good accuracy properties here,
which was lacking in the all to all variant of heartbeating.
So here is how gossip-style failure detection works
in a nutshell.
Here I have an example with, uh, four processes, uh,
marked as 1, 2, 3, and 4.
Every process maintains a table.
Let's look at the table for 1,
and the table for 1 shows, uh, four entries
for each of the four processes in the system.
There are, uh, so, four rows.
There are three columns.
The first column is the address and the process
which is basically the ID.
The second is the heartbeat counter received
from that corresponding process.
Third is the local time at which
that heartbeat counter was last updated.
So, for instance, at process 1,
the row for 2 says that the last heartbeat that was received
from process 2 at process 1 was numbered 10, 10, 3,
and it was received when the local time at process 1 was 62.
The local time is needed essentially
because times may be unsynchronized
across different processes
as we have discussed before,
and this is an asynchronous system after all,
and so we wanna be using the local time
to mark when the heartbeat was last updated.
Periodically, each process sends over this entire table
to a few of its neighbors selected at random.
Okay, this is known as, uh, gossip.
So nodes are processes
periodically gossip their membership list.
So, for instance, if 1 selects 2 at random
and sends it its membership list,
say 2's membership looks like this, right.
What 2 does when it receives 1's membership list
is that it merges it into its own list.
It merges it row by row;
so for instance, it looks at row number 1
and it says that hey, my, uh, latest for heartbeat
for, uh, 1 is 10,118,
but I am receiving a heartbeat for 1 which is 10,120,
which is later,
and so I am going to update that, uh, heartbeat
to be 10,120.
Also I am going to update the, uh, time
component of that particular row
from 64 to my current local time
which happens to be 70 at this node.