Another way to reduce bus traffic is to use the well-known Ethernet binary
exponential backoff algorithm (Anderson, 1990).
Instead of continuously polling,
as in Fig. 2-25, a delay loop can be inserted between polls. Initially the delay is one
instruction. If the lock is still busy, the delay is doubled to two instructions, then
four instructions, and so on up to some maximum.
A low maximum gives a fast
response when the lock is released, but wastes more bus cycles on cache thrashing.
A high maximum reduces cache thrashing at the expense of not noticing that the
lock is free so quickly. Binary exponential backoff can be used with or without the
pure reads preceding the
An even better idea is to give each CPU wishing to acquire the mutex its own
private lock variable to test, as illustrated in Fig. 8-11 (Mellor-Crummey and Scott,
1991). The variable should reside in an otherwise unused cache block to avoid
conflicts. The algorithm works by having a CPU that fails to acquire the lock allo-
cate a lock variable and attach itself to the end of a list of CPUs waiting for the
lock. When the current lock holder exits the critical region, it frees the private lock
that the first CPU on the list is testing (in its own cache).
This CPU then enters the
critical region. When it is done, it frees the lock its successor is using, and so on.
Although the protocol is somewhat complicated (to avoid having two CPUs attach
themselves to the end of the list simultaneously), it is efficient and starvation free.
For all the details, readers should consult the paper.
CPU 3 spins on this (private) lock
CPU 4 spins on this (private) lock
CPU 2 spins on this (private) lock
When CPU 1 is finished with the
real lock, it releases it and also
releases the private lock CPU 2
is spinning on
Use of multiple locks to avoid cache thrashing.
Spinning vs. Switching
So far we have assumed that a CPU needing a locked mutex just waits for it,
by polling continuously, polling intermittently, or attaching itself to a list of wait-
Sometimes, there is no alternative for the requesting CPU to just wait-
ing. For example, suppose that some CPU is idle and needs to access the shared
MULTIPLE PROCESSOR SYSTEMS
ready list to pick a process to run.
If the ready list is locked, the CPU cannot just
decide to suspend what it is doing and run another process, as doing that would re-
quire reading the ready list.
wait until it can acquire the ready list.
However, in other cases, there is a choice. For example, if some thread on a
CPU needs to access the file system buffer cache and it is currently locked, the
CPU can decide to switch to a different thread instead of waiting. The issue of
whether to spin or to do a thread switch has been a matter of much research, some
of which will be discussed below. Note that this issue does not occur on a uniproc-
essor because spinning does not make much sense when there is no other CPU to
release the lock.
If a thread tries to acquire a lock and fails, it is always blocked to
give the lock owner a chance to run and release the lock.
Assuming that spinning and doing a thread switch are both feasible options,
the trade-off is as follows. Spinning wastes CPU cycles directly.
Testing a lock re-
peatedly is not productive work. Switching, however, also wastes CPU cycles,
since the current thread’s state must be saved, the lock on the ready list must be ac-
quired, a thread must be selected, its state must be loaded, and it must be started.
Furthermore, the CPU cache will contain all the wrong blocks, so many expensive
cache misses will occur as the new thread starts running.
TLB faults are also like-
Eventually, a switch back to the original thread must take place, with more
cache misses following it. The cycles spent doing these two context switches plus
all the cache misses are wasted.
If it is known that mutexes are generally held for, say, 50
sec and it takes 1
msec to switch from the current thread and 1 msec to switch back later, it is more
efficient just to spin on the mutex. On the other hand, if the average mutex is held
for 10 msec, it is worth the trouble of making the two context switches. The trouble
is that critical regions can vary considerably in their duration, so which approach is
One design is to always spin.
A second design is to always switch.
But a third
design is to make a separate decision each time a locked mutex is encountered. At
the time the decision has to be made, it is not known whether it is better to spin or
switch, but for any given system, it is possible to make a trace of all activity and
analyze it later offline. Then it can be said in retrospect which decision was the
best one and how much time was wasted in the best case. This hindsight algorithm
then becomes a benchmark against which feasible algorithms can be measured.
This problem has been studied by researchers for decades (Ousterhout, 1982).
Most work uses a model in which a thread failing to acquire a mutex spins for
some period of time.
If this threshold is exceeded, it switches.
In some cases the
threshold is fixed, typically the known overhead for switching to another thread
and then switching back.
In other cases it is dynamic, depending on the observed
history of the mutex being waited on.
The best results are achieved when the system keeps track of the last few
observed spin times and assumes that this one will be similar to the previous ones.
For example, assuming a 1-msec context switch time again, a thread will spin for a