The following is a transcript of a talk I gave at the 2016 UTG Lightning Talks at U of T, along with some references and notes. Leave clarifying questions in the comments section!

Unfortunately “distributed computing” is one of the most buzzword-laden fields in computer science, to the point where even the wikipedia article is wrong (or, at least, too specific).

Definitions¹

Distributed system: Any collection of two or more computing devices (called processes) that can communicate. Examples include:

Multicore laptops
LAN parties
The internet
People? (It’s worth considering…)

Distributed computing is the study of how devices in such a system coordinate their actions.

Be careful not to confuse it with parallel computing, which studies how to break up a job (like matrix multiplication) into smaller ones that can be executed in parallel by processes in a distributed system. In that field, the system tends to be tightly coupled (i.e. highly synchronous and spatially close together), with few or no failures.

Distributed computing focuses on loosely coupled systems of processes operating independently, which need to communicate because they are:

Competing for resources, e.g. memory or a printer, or
Trying to achieve fault-tolerance or availability, e.g. databases.

A rule of thumb: Adding processes to the former makes the problem easier, and adding them to the latter makes it harder.

Models

Because the definition is so broad, we need to be more specific about our formal models. Specifically, these are our assumptions about:

Synchrony
Fault model
Communication method

Synchrony

I will be talking about an asynchronous model, which means that “the absolute and even relative times at which events take place cannot be known precisely”². That is, every process can take an arbitrarily long (or short) time to perform its next step. We do assume however that no two events can occur simultaneously.

Hence if you have two processes each executing a sequence of N events, we need to verify that every possible way of interleaving those events will have correct behavior.³

In practice we can sometimes make more assumptions about synchrony, like assuming a delay of at most a minute between events. But this is theory, not practice :)

Fault model

This asks, “how many processes can fail, and how do they behave when they do?”

I’m going to assume that processes fail by crashing, i.e. they can, without warning, stop executing steps entirely. This is a big problem if our model is also asynchronous, since we won’t be able to tell whether any process has crashed or is merely very slow. This is more formally described in the famous FLP paper⁴.

Another case worth mentioning is Byzantine faults, where some processes do not behave according to their instructions. This might be due to malice (people trying to taking advantage of the system) or accident. My favorite instance of the latter is when a single corrupted bit in an AWS network packet (used for synchronizing internal state) made their servers unresponsive for 8 hours.

Communication method

Message-passing: Processes communicate only by sending messages.

Shared memory: Processes communicate only by reading and writing to shared memory locations.

We’ll be considering the latter, but there are many cases of hybrid models supporting both types of communication.

Our model

To reiterate, our model will be:

Asynchronous (never know when an event will occur)
Crash failures (processes stop executing without warning)
Shared memory (can only communicate by reading and writing to memory)

An example

Consider two processes with a shared memory location X initialized to 0.

If we want each process to increment the register 20 times, you’d probably write code like this:

for i in range(20):
  X = X + 1

To see if this is correct in our asynchronous model, we need to consider all the different ways that their executions can be interleaved. This program should be fine because it doesn’t matter what order we increment in, right? The problem is that code like this assumes that X is a shared register: a shared memory location on which processes can only perform read() and write(new) operations. Consequently the program will actually compile down to:

for i in range(20):
    tmp = X.read()
    X.write(tmp + 1)

where tmp is a process-local variable that can’t be read by other processes. This is fine in a single-process program, but it means our code won’t work in an asynchronous model, because some of the increment operations can “get lost”:

Process A                   Process B
---------                   ---------
X.read() // tmp_A == 0
                            X.read() // tmp_B == 0 
X.write(1)
                            X.write(1)

If you’re thinking of using a lock here, this will not work because our algorithm must be fault-tolerant! For those who don’t know, a lock is a particular way of achieving mutual exclusion, i.e. making sure that only one process can execute a particular piece of code at a time. So if process B is performing these increment steps, then process A cannot proceed until B is finished. The problem of course is, what if B crashes in between these two steps? Alternatively, what if B just takes a really long time, for example B has to start handling some other task for a while - then A has to wait on it. Hence even without crash failures, the asynchronous model makes this solution undesirable.

[In practice, for tightly coupled systems like a laptop, locks are significantly faster than alternatives. But they’re simply not fault-tolerant.]

One solution (there are more!)

Every modern CPU supports a handful of special atomic instructions which are more expensive to execute, but behave as though they took place instantaneously. Hence all other events take place either before or after the instruction. Revisiting our incorrect code from before, we can replace the two-step increment operation with fetch_and_increment, which atomically increments a shared variable and returns its previous value.

for i in range(20):
    fetch_and_increment(X)

Now our code is fault-tolerant since a crashed process will not prevent the other one from doing work, and it works in an asynchronous model because you can rearrange the fetch_and_increment operations. In fact, this algorithm works for any number of processes.

Distributed data structures

My research was on manual memory reclamation schemes for dynamic distributed data structures like linked lists and trees, where nodes can be inserted and removed. The problem comes down to figuring out when a given node is:

No longer reachable from the data structure, and
No other processes are currently reading it.

The last process to use a node should be the one to free its memory. If we don’t take care of (2), then process A might free a node’s memory while process B is still trying to access it, for example reading its value or trying to use it to traverse the data structure - this would result in an error at best and a corrupted data structure at worst. It turns out that (1) can be addressed with ordinary reference counting (each node stores the number of nodes pointing to it) but (2) is more difficult.

We might try to solve this by giving each node an attribute which counts the number of processes that want to access the node; I’ll call this a process count. But here we need to be careful: Suppose a process finds the address of a node and wants to access it. It can’t increment that node’s process count yet because that would be begging the question, since we can’t increment a count that we don’t yet have guaranteed access to. The trick, identified by Hyonho Lee, is to associate a proactive count together with each reference to a node in the data structure. If we can simultaneously read the address and increment the count (which is possible using an atomic instruction like before), then it’s safe to access. When the reference is changed (e.g. a linked list deleting a node) we just pass on that count to the process count associated with the node. This way when the node is removed from the data structure, this attribute tells us the number of processes still accessing the node, and that it’s not safe to free the memory until that count falls to 0.

However, most data structures use special atomic operations to change the value of references; the problem becomes figuring out how to support the operation I already described for reading and incrementing, as well as the operation used for updating the reference pointer.

There’s an excellent and terribly interesting way of getting this to work, but unfortunately I’m all out of time. You can find an extended abstract with more details here.

Footnotes

All these definitions can be found in the following textbook:

Hagit Attiya and Jennifer Welch. Distributed Computing: Fundamentals, Simulations, and Advanced Topics. 2nd Edition. ↩
Attiya, Chapter 1. ↩
By the way, I’m going to assume sequential consistency throughout - if you don’t know what it is then you’re probably already assuming it’s true, so don’t worry about it. But if you want to start writing distributed algorithms yourself, check out Jeff Preshing’s amazing blog after finishing this post. ↩
Fischer, Lynch, and Paterson. Impossibility of Distributed Consensus with One Faulty Process. ↩

Definitions1