MULTIPLE PROCESSOR SYSTEMS
An omega switching network.
All the second-stage switches, including 2D, use the second bit for routing.
This, too, is a 1, so the message is now forwarded via the lower output to 3D.
the third bit is tested and found to be a 0.
Consequently, the message goes out on
the upper output and arrives at memory 110, as desired. The path followed by this
message is marked in Fig. 8-5 by the letter
As the message moves through the switching network, the bits at the left-hand
end of the module number are no longer needed. They can be put to good use by
recording the incoming line number there, so the reply can find its way back. For
, the incoming lines are 0 (upper input to 1D), 1 (lower input to 2D), and 1
(lower input to 3D), respectively. The reply is routed back using 011, only reading
it from right to left this time.
At the same time all this is going on, CPU 001 wants to write a word to memo-
ry module 001.
An analogous process happens here, with the message routed via
the upper, upper, and lower outputs, respectively, marked by the letter
field reads 001, representing the path it took. Since these two
requests do not use any of the same switches, lines, or memory modules, they can
proceed in parallel.
Now consider what would happen if CPU 000 simultaneously wanted to access
memory module 000.
Its request would come into conflict with CPU 001’s request
at switch 3A.
One of them would then have to wait. Unlike the crossbar switch,
the omega network is a
Not every set of requests can be proc-
essed simultaneously. Conflicts can occur over the use of a wire or a switch, as
well as between requests
memory and replies
Since it is highly desirable to spread the memory references uniformly across
the modules, one common technique is to use the low-order bits as the module
Consider, for example, a byte-oriented address space for a computer that
mostly accesses full 32-bit words. The 2 low-order bits will usually be 00, but the
next 3 bits will be uniformly distributed. By using these 3 bits as the module num-
ber, consecutively words will be in consecutive modules. A memory system in
which consecutive words are in different modules is said to be
leaved memories maximize parallelism because most memory references are to
consecutive addresses. It is also possible to design switching networks that are
nonblocking and offer multiple paths from each CPU to each memory module to
spread the traffic better.
Single-bus UMA multiprocessors are generally limited to no more than a few
dozen CPUs, and crossbar or switched multiprocessors need a lot of (expensive)
hardware and are not that much bigger.
To get to more than 100 CPUs, something
has to give.
Usually, what gives is the idea that all memory modules have the same
access time. This concession leads to the idea of NUMA multiprocessors, as men-
Like their UMA cousins, they provide a single address space across
all the CPUs, but unlike the UMA machines, access to local memory modules is
faster than access to remote ones. Thus all UMA programs will run without change
on NUMA machines, but the performance will be worse than on a UMA machine.
NUMA machines have three key characteristics that all of them possess and
which together distinguish them from other multiprocessors:
There is a single address space visible to all CPUs.
Access to remote memory is via
Access to remote memory is slower than access to local memory.
When the access time to remote memory is not hidden (because there is no cach-
ing), the system is called
Non Cache-coherent NUMA
). When the
caches are coherent, the system is called
A popular approach for building large CC-NUMA multiprocessors is the
The idea is to maintain a database telling where
each cache line is and what its status is. When a cache line is referenced, the data-
base is queried to find out where it is and whether it is clean or dirty.
database is queried on every instruction that touches memory, it must be kept in ex-
tremely fast special-purpose hardware that can respond in a fraction of a bus cycle.
To make the idea of a directory-based multiprocessor somewhat more concrete,
let us consider as a simple (hypothetical) example, a 256-node system, each node
consisting of one CPU and 16 MB of RAM connected to the CPU via a local bus.
The total memory is 2
bytes and it is divided up into 2
cache lines of 64 bytes
each. The memory is statically allocated among the nodes, with 0–16M in node 0,
16M–32M in node 1, etc. The nodes are connected by an interconnection network,