M. Campbell et al. / Artificial Intelligence 134 (2002) 57–83
71
chips communicate with their host node via a Micro Channel
®
bus. This heterogeneous
architecture has a strong influence on the parallel search algorithm used in Deep Blue, as
discussed below.
6.1. Parallel search algorithm
To characterize the parallel search algorithm used in Deep Blue, we will use the
taxonomy given in [5].
1. Processor hierarchy. Deep Blue uses a static processor tree, with one SP node
controlling the other 29 nodes, which in turn control 16 chess chips each. The static
nature of the hierarchy is in part determined by the fact that the chess chips are not
general purpose processors and can only act as slaves. In addition, the chess chips
can only communicate directly with their host node.
2. Control distribution. Deep Blue uses a centralized control of the parallel search.
The control is managed on the SP nodes, since the chess chips do not have this
functionality.
3. Parallelism possible. Deep Blue permits parallelism under the following condi-
tions:
17
(a) Type 1 (PV) nodes. After the first move has been examined at a PV node, all
the alternatives may be examined in parallel (with an offset window to enable
the selective search algorithm). Null move searches, used to generate threat
information for the selective search, can also be carried out in parallel. Finally,
PV nodes at the hardware/software boundary can be searched in parallel. Because
the hardware search allows only null-window searches, a number of searches are
required to determine an exact score within a window.
(b) Good type 2 nodes (nodes where the first move “fails high”, or exceeds
expectations). Most parallel algorithms do not allow or need parallelism in this
case. Deep Blue executes reduced depth offset searches as well as the null move
searches in parallel after the first move fails high. As above, these searches
generate information for use by the selective search.
(c) Bad type 2 nodes (nodes where the fail high move is not searched first). The
moves after the first are searched in parallel. After a fail high move is found, all
the alternatives are searched in parallel with a reduced depth offset search. Null
move searches also can execute in parallel at this point.
(d) Type 3 nodes (nodes where all the moves fail low). These moves can all be
searched in parallel.
4. Synchronization. Type 1 and type 2 nodes are synchronization points for Deep Blue.
The first move must be evaluated before parallelism is allowed. There are also global
synchronization points at the end of iterations.
17
The terminology of node types used in this section was originally defined in [19].
72
M. Campbell et al. / Artificial Intelligence 134 (2002) 57–83
6.2. Parallel search implementation
The early iterations of the Deep Blue parallel search are carried out on the master node.
There is not much parallelism in the first few iterations, and the master is fast enough (it
has 16 chess chips) that there is little to be gained by attempting to further parallelize the
search.
As the search gets deeper, jobs get allocated throughout the system. There are three
major issues that need to be addressed:
• Load balancing. The search extensions algorithm used in Deep Blue leads to widely
varying tree sizes for a given search depth. This extends all the way to the hardware,
where the complex quiescence search can cause a search to “blow up”. This can
lead to severe load balancing problems. The solution used in Deep Blue was to
abort long-running hardware searches (more than 8000 nodes) and push more of the
search into software. This gives additional opportunities for parallelism. Similarly,
jobs on the worker nodes can abort and return their job to the master for further
splitting.
• Master overload. The performance bottleneck of the Deep Blue system was the mas-
ter processor. One method of alleviating this was to ensure that the workers always
had a job “on deck”, ready to execute when it completes its active job. This reduces
the effect of the communication latency between master and workers.
• Sharing between nodes. Workers do not directly communicate with each other. This
decision was made in order to simplify the implementation. Workers will generally
pass their “expensive” transposition table results up to the master.
As a final point, it should be noted that the Deep Blue parallel search is non-
deterministic. Various factors can influence timing and processor job assignments.
Although this was not a major concern, it makes debugging the system much more difficult.
6.3. Parallel search performance
We have limited experimental results to assess the efficiency of the Deep Blue
parallel search. The most accurate numbers were derived on a single-node version of
Deep Blue with 24 chess chips. We compared the 24 chip system with a single chip
system on a variety of positions. The results varied widely depending on the tactical
complexity of the position searched. For positions with many deep forcing sequences
speedups averaged about 7, for an observed efficiency of about 30%. For quieter positions,
speedups averaged 18, for an observed efficiency of 75%. The non-deterministic nature
of the search, particularly in tactical positions, makes it more difficult to conduct these
measurements.
It is difficult to assess the efficiency of the full 30-node Deep Blue system. Indirect
evidence suggests an overall observed efficiency of about 8% in tactical positions and about
12% in quiet positions. It is clear that there is room for improvement here. However it was
a conscious design decision of the Deep Blue team to focus on improving the evaluation
function following the 1996 Kasparov match, and the parallel search code was largely
untouched between the 1996 and 1997 matches.