`
`Parallel Algorithms
`
`25-1
`25-2
`
`Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`25.1
`25.2 Modeling Parallel Computations . . . . . . . . . . . . . . . . . . . . . . . . .
`Multiprocessor Models • Work-Depth Models • Assigning
`Costs to Algorithms • Emulations among Models • Model
`Used in This Chapter
`25.3 Parallel Algorithmic Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 25-12
`Divide-and-Conquer • Randomization • Parallel Pointer
`Techniques • Other Techniques
`25.4 Basic Operations on Sequences, Lists, and Trees. . . . . . 25-16
`Sums • Scans • Multiprefix and Fetch-and-Add • Pointer
`Jumping • List Ranking • Removing Duplicates
`25.5 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-21
`Graphs and Graph Representations • Breadth-First Search •
`Connected Components
`Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-30
`QuickSort • Radix Sort
`25.7 Computational Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-32
`Closest Pair • Planar Convex Hull
`25.8 Numerical Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-37
`Matrix Operations • Fourier Transform
`25.9 Research Issues and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-38
`25.10 Further Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-39
`Defining Terms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-39
`References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-40
`
`25.6
`
`Guy E. Blelloch
`Carnegie Mellon University
`
`Bruce M. Maggs
`Duke University and Akamai
`Technologies
`
`25.1 Introduction
`
`The subject of this chapter is the design and analysis of parallel algorithms. Most of today’s algorithms
`are sequential, that is, they specify a sequence of steps in which each step consists of a single operation.
`These algorithms are well suited to today’s computers, which basically perform operations in a
`sequential fashion. Although the speed at which sequential computers operate has been improving
`at an exponential rate for many years, the improvement is now coming at greater and greater cost.
`As a consequence, researchers have sought more cost-effective improvements by building “parallel”
`computers—computers that perform multiple operations in a single step. In order to solve a problem
`efficiently on a parallel computer, it is usually necessary to design an algorithm that specifies multiple
`operations on each step, i.e., a parallel algorithm.
`As an example, consider the problem of computing the sum of a sequence A of n numbers.
`The standard algorithm computes the sum by making a single pass through the sequence, keeping
`
`25-1
`
`Atallah/Algorithms and Theory of Computation Handbook: Second Edition C820X_C025 Finals Page 1 2009-10-6
`
`AA/SWA Ex. 1017, p.1 of 44
`American Airlines, et. al. v. Intellectual Ventures, et.al.
`IPR2025-00785
`
`
`
`25-2
`
`Special Topics and Techniques
`
`a running sum of the numbers seen so far. It is not difficult however, to devise an algorithm for
`computing the sum that performs many operations in parallel. For example, suppose that, in parallel,
`each element of A with an even index is paired and summed with the next element of A, which has
`an odd index, i.e., A[0] is paired with A[1], A[2] with A[3], and so on. The result is a new sequence
`of (cid:2)n/2(cid:3) numbers that sum to the same value as the sum that we wish to compute. This pairing and
`summing step can be repeated until, after (cid:2)log2 n(cid:3) steps, a sequence consisting of a single value is
`produced, and this value is equal to the final sum.
`The parallelism in an algorithm can yield improved performance on many different kinds of com-
`puters. For example, on a parallel computer, the operations in a parallel algorithm can be performed
`simultaneously by different processors. Furthermore, even on a single-processor computer the paral-
`lelism in an algorithm can be exploited by using multiple functional units, pipelined functional units,
`or pipelined memory systems. Thus, it is important to make a distinction between the parallelism in
`an algorithm and the ability of any particular computer to perform multiple operations in parallel.
`Of course, in order for a parallel algorithm to run efficiently on any type of computer, the algorithm
`must contain at least as much parallelism as the computer, for otherwise resources would be left
`idle. Unfortunately, the converse does not always hold: some parallel computers cannot efficiently
`execute all algorithms, even if the algorithms contain a great deal of parallelism. Experience has
`shown that it is more difficult to build a general-purpose parallel computer than a general-purpose
`sequential computer.
`The remainder of this chapter consists of nine sections. We begin in Section 25.2 with a discus-
`sion of how to model parallel computers. Next, in Section 25.3 we cover some general techniques
`that have proven useful in the design of parallel algorithms. Sections 25.4 through 25.8 present
`algorithms for solving problems from different domains. We conclude in Section 25.9 with a dis-
`cussion of current research topics, a collection of defining terms, and finally sources for further
`information.
`Throughout this chapter, we assume that the reader has some familiarity with sequential algorithms
`and asymptotic notation and analysis.
`
`25.2 Modeling Parallel Computations
`
`The designer of a sequential algorithm typically formulates the algorithm using an abstract model
`of computation called the random-access machine (RAM) model [2], Chapter 1. In this model, the
`machine consists of a single processor connected to a memory system. Each basic CPU operation,
`including arithmetic operations, logical operations, and memory accesses, requires one time step.
`The designer’s goal is to develop an algorithm with modest time and memory requirements. The
`RAM model allows the algorithm designer to ignore many of the details of the computer on which
`the algorithm will ultimately be executed, but captures enough detail that the designer can predict
`with reasonable accuracy how the algorithm will perform.
`Modeling parallel computations is more complicated than modeling sequential computations
`because in practice parallel computers tend to vary more in organization than do sequential com-
`puters. As a consequence, a large portion of the research on parallel algorithms has gone into
`the question of modeling, and many debates have raged over what the “right” model is, or about
`how practical various models are. Although there has been no consensus on the right model, this
`research has yielded a better understanding of the relationship between the models. Any discussion
`of parallel algorithms requires some understanding of the various models and the relationships
`among them.
`In this chapter we divide parallel models into two classes: multiprocessor models and work-depth
`models. In the remainder of this section we discuss these two classes and how they are related.
`
`Atallah/Algorithms and Theory of Computation Handbook: Second Edition C820X_C025 Finals Page 2 2009-10-6
`
`AA/SWA Ex. 1017, p.2 of 44
`American Airlines, et. al. v. Intellectual Ventures, et.al.
`IPR2025-00785
`
`
`
`Parallel Algorithms
`
`25-3
`
`25.2.1 Multiprocessor Models
`A multiprocessor model is a generalization of the sequential RAM model in which there is more
`than one processor. Multiprocessor models can be classified into three basic types: local-memory
`machine models, modular memory machine models, and parallel random-access machine (PRAM)
`models. Figure 25.1 illustrates the structure of these machine models. A local-memory machine
`model consists of a set of n processors each with its own local-memory. These processors are
`attached to a common communication network. A modular memory machine model consists of
`m memory modules and n processors all attached to a common network. An n-processor PRAM
`model consists of a set of n processors all connected to a common shared memory [32,37,38,77].
`The three types of multiprocessors differ in the way that memory can be accessed. In a local-
`memory machine model, each processor can access its own local memory directly, but can access
`the memory in another processor only by sending a memory request through the network. As in the
`RAM model, all local operations, including local memory accesses, take unit time. The time taken
`to access the memory in another processor, however, will depend on both the capabilities of the
`communication network and the pattern of memory accesses made by other processors, since these
`other accesses could congest the network. In a modular memory machine model, a processor accesses
`the memory in a memory module by sending a memory request through the network. Typically
`the processors and memory modules are arranged so that the time for any processor to access any
`memory module is roughly uniform. As in a local-memory machine model, the exact amount of
`
`Interconnection network
`
`P1
`
`M1
`
`(a)
`
`P2
`
`M2
`
`P3
`
`M3
`
`Pn
`
`Processors
`
`Mn
`
`Memory
`
`M1
`
`M2
`
`M3
`
`M4
`
`Mm
`
`Memory
`
`Interconnection network
`
`P1
`
`P2
`
`P3
`
`(b)
`
`Pn
`
`Processors
`
`Shared memory
`
`P1
`
`(c)
`
`P2
`
`P3
`
`Pn
`
`Processors
`
`FIGURE 25.1 The three types of multiprocessor machine models: (a) a local-memory machine model; (b) a modular
`memory machine model; and (c) a PRAM model.
`
`Atallah/Algorithms and Theory of Computation Handbook: Second Edition C820X_C025 Finals Page 3 2009-10-6
`
`AA/SWA Ex. 1017, p.3 of 44
`American Airlines, et. al. v. Intellectual Ventures, et.al.
`IPR2025-00785
`
`
`
`25-4
`
`Special Topics and Techniques
`
`time depends on the communication network and the memory access pattern. In a PRAM model,
`a processor can access any word of memory in a single step. Furthermore, these accesses can occur
`in parallel, i.e., in a single step, every processor can access the shared memory.
`The PRAM models are controversial because no real machine lives up to its ideal of unit-time access
`to shared memory. It is worth noting, however, that the ultimate purpose of an abstract model is not
`to directly model a real machine, but to help the algorithm designer produce efficient algorithms.
`Thus, if an algorithm designed for a PRAM model (or any other model) can be translated to an
`algorithm that runs efficiently on a real computer, then the model has succeeded. In Section 25.2.4
`we show how an algorithm designed for one parallel machine model can be translated so that it
`executes efficiently on another model.
`The three types of multiprocessor models that we have defined are broad and allow for many
`variations. The local-memory machine models and modular memory machine models may differ
`according to their network topologies. Furthermore, in all three types of models, there may be
`differences in the operations that the processors and networks are allowed to perform. In the
`remainder of this section we discuss some of the possibilities.
`
`25.2.1.1 Network Topology
`A network is a collection of switches connected by communication channels. A processor or memory
`module has one or more communication ports that are connected to these switches by communi-
`cation channels. The pattern of interconnection of the switches is called the network topology. The
`topology of a network has a large influence on the performance and also on the cost and difficulty of
`constructing the network. Figure 25.2 illustrates several different topologies.
`The simplest network topology is a bus. This network can be used in both local-memory machine
`models and modular memory machine models. In either case, all processors and memory modules
`are typically connected to a single bus. In each step, at most one piece of data can be written onto the
`bus. This data might be a request from a processor to read or write a memory value, or it might be the
`response from the processor or memory module that holds the value. In practice, the advantage of
`using a bus is that it is simple to build and, because all processors and memory modules can observe
`
`0010
`
`0110
`
`P2
`
`P3
`
`P1
`(a)
`
`Pn
`
`0000
`
`0100
`
`1010
`
`1110
`
`1000
`
`1011
`
`1100
`
`1111
`
`1001
`
`1101
`
`0011
`
`0111
`
`(b)
`
`0001
`(c)
`
`0101
`
`FIGURE 25.2 (a) Bus, (b) two-dimensional mesh, and (c) hypercube network topologies.
`
`Atallah/Algorithms and Theory of Computation Handbook: Second Edition C820X_C025 Finals Page 4 2009-10-6
`
`AA/SWA Ex. 1017, p.4 of 44
`American Airlines, et. al. v. Intellectual Ventures, et.al.
`IPR2025-00785
`
`
`
`Parallel Algorithms
`
`25-5
`
`the traffic on the bus, it is relatively easy to develop protocols that allow processors to cache memory
`values locally. The disadvantage of using a bus is that the processors have to take turns accessing the
`bus. Hence, as more processors are added to a bus, the average time to perform a memory access
`grows proportionately.
`A two-dimensional mesh is a network that can be laid out in a rectangular fashion. Each switch
`in a mesh has a distinct label (x, y) where 0 ≤ x ≤ X − 1 and 0 ≤ y ≤ Y − 1. The values X and Y
`determine the length of the sides of the mesh. The number of switches in a mesh is thus X · Y. Every
`switch, except those on the sides of the mesh, is connected to four neighbors: one to the north, one
`to the south, one to the east, and one to the west. Thus, a switch labeled (x, y), where 0 < x < X − 1
`and 0 < y < Y − 1, is connected to switches (x, y + 1), (x, y − 1), (x + 1, y), and (x − 1, y). This
`network typically appears in a local-memory machine model, i.e., a processor along with its local
`memory is connected to each switch, and remote memory accesses are made by routing messages
`through the mesh. Figure 25.2b shows an example of an 8 × 8 mesh.
`Several variations on meshes are also popular, including three-dimensional meshes, toruses, and
`hypercubes. A torus is a mesh in which the switches on the sides have connections to the switches
`on the opposite sides. Thus, every switch (x, y) is connected to four other switches: (x, y+ 1 mod Y),
`(x, y − 1 mod Y), (x + 1 mod X, y), and (x − 1 mod X, y). A hypercube is a network with 2n switches
`in which each switch has a distinct n-bit label. Two switches are connected by a communication
`channel in a hypercube if and only if the labels of the switches differ in precisely one bit position. A
`hypercube with 16 switches is shown in Figure 25.2c.
`A multistage network is used to connect one set of switches called the input switches to another set
`called the output switches through a sequence of stages of switches. Such networks were originally
`designed for telephone networks [15]. The stages of a multistage network are numbered 1 through
`L, where L is the depth of the network. The switches on stage 1 are the input switches, and those on
`stage L are the output switches. In most multistage networks, it is possible to send a message from any
`input switch to any output switch along a path that traverses the stages of the network in order from
`1 to L. Multistage networks are frequently used in modular memory computers; typically processors
`are attached to input switches, and memory modules are attached to output switches. A processor
`accesses a word of memory by injecting a memory access request message into the network. This
`message then travels through the network to the appropriate memory module. If the request is to
`read a word of memory, then the memory module sends the data back through the network to
`the requesting processor. There are many different multistage network topologies. Figure 25.3a, for
`example, shows a depth-2 network that connects 4 processors to 16 memory modules. Each switch in
`this network has two channels at the bottom and four channels at the top. The ratio of processors to
`memory modules in this example is chosen to reflect the fact that, in practice, a processor is capable
`of generating memory access requests faster than a memory module is capable of servicing them.
`A fat-tree is a network structured like a tree [56]. Each edge of the tree, however, may represent
`many communication channels, and each node may represent many network switches (hence the
`name “fat”). Figure 25.3b shows a fat-tree with the overall structure of a binary tree. Typically the
`
`Memory modules
`
`Stage 2 (output switches)
`
`Stage 1 (input switches)
`
`Processors
`
`(b)
`
`(a)
`
`FIGURE 25.3 (a) 2-level multistage network and (b) fat-tree network topologies.
`
`Atallah/Algorithms and Theory of Computation Handbook: Second Edition C820X_C025 Finals Page 5 2009-10-6
`
`AA/SWA Ex. 1017, p.5 of 44
`American Airlines, et. al. v. Intellectual Ventures, et.al.
`IPR2025-00785
`
`
`
`25-6
`
`Special Topics and Techniques
`
`capacities of the edges near the root of the tree are much larger than the capacities near the leaves.
`For example, in this tree the two edges incident on the root represent 8 channels each, while the
`edges incident on the leaves represent only 1 channel each. A natural way to construct a local-
`memory machine model is to connect a processor along with its local memory to each leaf of the
`fat-tree. In this scheme, a message from one processor to another first travels up the tree to the least
`common-ancestor of the two processors, and then down the tree.
`Many algorithms have been designed to run efficiently on particular network topologies such as
`the mesh or the hypercube. For extensive treatment such algorithms, see [55,67,73,80]. Although
`this approach can lead to very fine-tuned algorithms, it has some disadvantages. First, algorithms
`designed for one network may not perform well on other networks. Hence, in order to solve a
`problem on a new machine, it may be necessary to design a new algorithm from scratch. Second,
`algorithms that take advantage of a particular network tend to be more complicated than algorithms
`designed for more abstract models like the PRAM models, because they must incorporate some of
`the details of the network. Nevertheless, there are some operations that are performed so frequently
`by a parallel machine that it makes sense to design a fine-tuned network-specific algorithm. For
`example, the algorithm that routes messages or memory access requests through the network should
`exploit the network topology. Other examples include algorithms for broadcasting a message from
`one processor to many other processors, for collecting the results computed in many processors in
`a single processor, and for synchronizing processors.
`An alternative to modeling the topology of a network is to summarize its routing capabilities in
`terms of two parameters, its latency and bandwidth. The latency, L, of a network is the time it takes
`for a message to traverse the network. In actual networks this will depend on the topology of the
`network, which particular ports the message is passing between, and the congestion of messages in the
`network. The latency is often modeled by considering the worst-case time assuming that the network
`is not heavily congested. The bandwidth at each port of the network is the rate at which a processor
`can inject data into the network. In actual networks this will depend on the topology of the network,
`the bandwidths of the network’s individual communication channels and, again, the congestion
`of messages in the network. The bandwidth often can be usefully modeled as the maximum rate
`at which processors can inject messages into the network without causing it to become heavily
`congested, assuming a uniform distribution of message destinations. In this case, the bandwidth can
`be expressed as the minimum gap g between successive injections of messages into the network.
`Three models that characterize a network in terms of its latency and bandwidth are the Postal
`model [14], the Bulk-Synchronous Parallel (BSP) model [85], and the LogP model [29]. In the
`Postal model, a network is described by a single parameter L, its latency. The Bulk-Synchronous Par-
`allel model adds a second parameter g, the minimum ratio of computation steps to communication
`steps, i.e., the gap. The LogP model includes both of these parameters, and adds a third parameter
`o, the overhead, or wasted time, incurred by a processor upon sending or receiving a message.
`
`25.2.1.2 Primitive Operations
`A machine model must also specify the types of operations that the processors and network are per-
`mitted to perform. We assume that all processors are allowed to perform the same local instructions
`as the single processor in the standard sequential RAM model. In addition, processors may have spe-
`cial instructions for issuing nonlocal memory requests, for sending messages to other processors, and
`for executing various global operations, such as synchronization. There may also be restrictions on
`when processors can simultaneously issue instructions involving nonlocal operations. For example a
`model might not allow two processors to write to the same memory location at the same time. These
`restrictions might make it impossible to execute an algorithm on a particular model, or make the
`cost of executing the algorithm prohibitively expensive. It is therefore important to understand what
`instructions are supported before one can design or analyze a parallel algorithm. In this section we
`
`Atallah/Algorithms and Theory of Computation Handbook: Second Edition C820X_C025 Finals Page 6 2009-10-6
`
`AA/SWA Ex. 1017, p.6 of 44
`American Airlines, et. al. v. Intellectual Ventures, et.al.
`IPR2025-00785
`
`
`
`Parallel Algorithms
`
`25-7
`
`consider three classes of instructions that perform nonlocal operations: (1) instructions that perform
`concurrent accesses to the same shared memory location, (2) instructions for synchronization, and
`(3) instructions that perform global operations on data.
`When multiple processors simultaneously make a request to read or write to the same resource—
`such as a processor, memory module, or memory location—there are several possible outcomes.
`Some machine models simply forbid such operations, declaring that it is an error if two or more
`processes try to access a resource simultaneously. In this case we say that the model allows only
`exclusive access to the resource. For example, a PRAM model might only allow exclusive read
`or write access to each memory location. A PRAM model of this type is called an exclusive-read
`exclusive-write (EREW) PRAM model. Other machine models may allow unlimited access to a
`shared resource. In this case we say that the model allows concurrent access to the resource. For
`example, a concurrent-read concurrent-write (CRCW) PRAM model allows both concurrent read
`and write access to memory locations, and a CREW PRAM model allows concurrent reads but only
`exclusive writes. When making a concurrent write to a resource such as a memory location there
`are many ways to resolve the conflict. The possibilities include choosing an arbitrary value from
`those written (arbitrary concurrent write), choosing the value from the processor with lowest index
`(priority concurrent write), and taking the logical or of the values written. A final choice is to allow
`for queued access. In this case concurrent access is permitted but the time for a step is proportional
`to the maximum number of accesses to any resource. A queue-read queue-write (QRQW) PRAM
`model allows for such accesses [36].
`In addition to reads and writes to nonlocal memory or other processors, there are other impor-
`tant primitives that a model might supply. One class of such primitives support synchronization.
`There are a variety of different types of synchronization operations and the costs of these opera-
`tions vary from model to model. In a PRAM model, for example, it is assumed that all processors
`operate in lock step, which provides implicit synchronization. In a local-memory machine model
`the cost of synchronization may be a function of the particular network topology. A related oper-
`ation, broadcast, allows one processor to send a common message to all of the other processors.
`Some machine models supply more powerful primitives that combine arithmetic operations with
`communication. Such operations include the prefix and multiprefix operations, which are defined in
`Sections 25.4.2 and 25.4.3.
`
`25.2.2 Work-Depth Models
`Because there are so many different ways to organize parallel computers, and hence to model them,
`it is difficult to select one multiprocessor model that is appropriate for all machines. The alternative
`to focusing on the machine is to focus on the algorithm. In this section we present a class of
`models called work-depth models. In a work-depth model, the cost of an algorithm is determined
`by examining the total number of operations that it performs, and the dependencies among those
`operations. An algorithm’s work W is the total number of operations that it performs; its depth D is
`the longest chain of dependencies among its operations. We call the ratio P = W/D the parallelism
`of the algorithm.
`The work-depth models are more abstract than the multiprocessor models. As we shall see
`however, algorithms that are efficient in work-depth models can often be translated to algorithms that
`are efficient in the multiprocessor models, and from there to real parallel computers. The advantage
`of using a work-depth model is that there are no machine-dependent details to complicate the design
`and analysis of algorithms. Here we consider three classes of work-depth models: circuit models,
`vector machine models, and language-based models. We will be using a language-based model in
`this chapter, so we will return to these models in Section 25.2.5. The most abstract work-depth
`model is the circuit model. A circuit consists of nodes and directed arcs. A node represents a basic
`operation, such as adding two values. Each input value for an operation arrives at the corresponding
`
`Atallah/Algorithms and Theory of Computation Handbook: Second Edition C820X_C025 Finals Page 7 2009-10-6
`
`AA/SWA Ex. 1017, p.7 of 44
`American Airlines, et. al. v. Intellectual Ventures, et.al.
`IPR2025-00785
`
`
`
`25-8
`
`Special Topics and Techniques
`
`Depth
`
`Work
`
`8 4 2 1 1
`
`5
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`Total :
`
`1 1 1 1 4
`
`Total :
`
`FIGURE 25.4 Summing 16 numbers on a tree. The total depth (longest chain of dependencies) is 4 and the total work
`(number of operations) is 15.
`
`node via an incoming arc. The result of the operation is then carried out of the node via one or more
`outgoing arcs. These outgoing arcs may provide inputs to other nodes. The number of incoming
`arcs to a node is referred to as the fan-in of the node and the number of outgoing arcs is referred
`to as the fan-out. There are two special classes of arcs. A set of input arcs provide input values to
`the circuit as a whole. These arcs do not originate at nodes. The output arcs return the final output
`values produced by the circuit. These arcs do not terminate at nodes. By definition, a circuit is not
`permitted to contain a directed cycle. In this model, an algorithm is modeled as a family of directed
`acyclic circuits. There is a circuit for each possible size of the input.
`Figure 25.4 shows a circuit for adding 16 numbers. In this figure all arcs are directed toward the
`bottom. The input arcs are at the top of the figure. Each + node adds the two values that arrive on
`its two incoming arcs, and places the result on its outgoing arc. The sum of all of the inputs to the
`circuit is returned on the single output arc at the bottom.
`The work and depth of a circuit are measured as follows. The work is the total number of nodes.
`The work in Figure 25.4, for example, is 15. (The work is also called the size of the circuit.) The
`depth is the number of nodes on the longest directed path from an input arc and an output arc. In
`Figure 25.4, the depth is 4. For a family of circuits, the work and depth are typically parameterized
`in terms of the number of inputs. For example, the circuit in Figure 25.4 can be easily generalized to
`add n input values for any n that is a power of two. The work and depth for this family of circuits is
`W(n) = n − 1 andD(n) = log2 n.
`Circuit models have been used for many years to study various theoretical aspects of parallelism,
`for example to prove that certain problems are difficult to solve in parallel. See [48] for an overview.
`In a vector model an algorithm is expressed as a sequence of steps, each of which performs an
`operation on a vector (i.e., sequence) of input values, and produces a vector result [19,69]. The work
`of each step is equal to the length of its input (or output) vector. The work of an algorithm is the
`sum of the work of its steps. The depth of an algorithm is the number of vector steps.
`In a language model, a work-depth cost is associated with each programming language construct
`[20,22]. For example, the work for calling two functions in parallel is equal to the sum of the work
`of the two calls. The depth, in this case, is equal to the maximum of the depth of the two calls.
`
`25.2.3 Assigning Costs to Algorithms
`In the work-depth models, the cost of an algorithm is determined by its work and by its depth.
`The notions of work and depth can also be defined for the multiprocessor models. The work W
`performed by an algorithm is equal to the number of processors multiplied by the time required for
`the algorithm to complete execution. The depth D is equal to the total time required to execute the
`algorithm.
`
`Atallah/Algorithms and Theory of Computation Handbook: Second Edition C820X_C025 Finals Page 8 2009-10-6
`
`AA/SWA Ex. 1017, p.8 of 44
`American Airlines, et. al. v. Intellectual Ventures, et.al.
`IPR2025-00785
`
`
`
`Parallel Algorithms
`
`25-9
`
`The depth of an algorithm is important, because there are some applications for which the time
`to perform a computation is crucial. For example, the results of a weather forecasting program are
`useful only if the program completes execution before the weather does!
`Generally, however, the most important measure of the cost of an algorithm is the work. This can
`be argued as follows. The cost of a computer is roughly proportional to the number of processors in
`the computer. The cost for purchasing time on a computer is proportional to the cost of the computer
`multiplied by the amount of time used. The total cost of performing a computation, therefore, is
`roughly proportional to the number of processors in the computer multiplied by the amount of time,
`i.e., the work.
`In many instances, the cost of running a computation on a parallel computer may be slightly larger
`than the cost of running the same computation on a sequential computer. If the time to completion
`is sufficiently improved, however, this extra cost can often be justified. As we shall see, however,
`there is often a tradeoff between time-to-completion and total work performed. To quantify when
`parallel algorithms are efficient in terms of cost, we say that a parallel algorithm is work-efficient if
`asymptotically (as the problem size grows) it requires at most a constant factor more work than the
`best sequential algorithm known.
`
`25.2.4 Emulations among Models
`Although it may appear that a different algorithm must be designed for each of the many parallel
`models, there are often automatic and efficient techniques for translating algorithms designed for
`one model into algorithms designed for another. These translations are work-preserving in the sense
`that the work performed by both algorithms is the same, to within a constant factor. For example,
`the following theorem, known as Brent’s Theorem [24], shows that an algorithm designed for the
`circuit model can be translated in a work-preserving fashion to a PRAM model algorithm.
`
`[Brent’s Theorem] Any algorithm that can be expressed as a circuit of size (i.e., work)
`THEOREM 25.1
`W and depth D and with constant fan-in nodes in the circuit model can be executed in O(W/P + D)
`steps in the CREW PRAM model.
`
`The basic idea is to have the PRAM emulate the computation specified by the circuit
`PROOF
`in a level-by-level fashion. The level of a node is defined as follows. A node is on level 1 if all of its
`inputs are also inputs to the circuit. Inductively, the level of any other node is one greater than the
`maximum of the level of the nodes that provide its inputs. Let li denote the number of nodes on
`level i. Then, by assigning (cid:2)li/P(cid:3) operations to each of the P processors in the PRAM, the operations
`for level i can be performed in O((cid:2)li/P(cid:3)) steps. Concurrent reads might be required since many
`operations on one level might read the same result from a previous level. Summing the time over all
`(cid:2)
`(cid:5)(c