throbber
25
`
`Parallel Algorithms
`
`25-1
`25-2
`
`Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
`25.1
`25.2 Modeling Parallel Computations . . . . . . . . . . . . . . . . . . . . . . . . .
`Multiprocessor Models • Work-Depth Models • Assigning
`Costs to Algorithms • Emulations among Models • Model
`Used in This Chapter
`25.3 Parallel Algorithmic Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . 25-12
`Divide-and-Conquer • Randomization • Parallel Pointer
`Techniques • Other Techniques
`25.4 Basic Operations on Sequences, Lists, and Trees. . . . . . 25-16
`Sums • Scans • Multiprefix and Fetch-and-Add • Pointer
`Jumping • List Ranking • Removing Duplicates
`25.5 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-21
`Graphs and Graph Representations • Breadth-First Search •
`Connected Components
`Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-30
`QuickSort • Radix Sort
`25.7 Computational Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-32
`Closest Pair • Planar Convex Hull
`25.8 Numerical Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-37
`Matrix Operations • Fourier Transform
`25.9 Research Issues and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-38
`25.10 Further Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-39
`Defining Terms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-39
`References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-40
`
`25.6
`
`Guy E. Blelloch
`Carnegie Mellon University
`
`Bruce M. Maggs
`Duke University and Akamai
`Technologies
`
`25.1 Introduction
`
`The subject of this chapter is the design and analysis of parallel algorithms. Most of today’s algorithms
`are sequential, that is, they specify a sequence of steps in which each step consists of a single operation.
`These algorithms are well suited to today’s computers, which basically perform operations in a
`sequential fashion. Although the speed at which sequential computers operate has been improving
`at an exponential rate for many years, the improvement is now coming at greater and greater cost.
`As a consequence, researchers have sought more cost-effective improvements by building “parallel”
`computers—computers that perform multiple operations in a single step. In order to solve a problem
`efficiently on a parallel computer, it is usually necessary to design an algorithm that specifies multiple
`operations on each step, i.e., a parallel algorithm.
`As an example, consider the problem of computing the sum of a sequence A of n numbers.
`The standard algorithm computes the sum by making a single pass through the sequence, keeping
`
`25-1
`
`Atallah/Algorithms and Theory of Computation Handbook: Second Edition C820X_C025 Finals Page 1 2009-10-6
`
`AA/SWA Ex. 1017, p.1 of 44
`American Airlines, et. al. v. Intellectual Ventures, et.al.
`IPR2025-00785
`
`

`

`25-2
`
`Special Topics and Techniques
`
`a running sum of the numbers seen so far. It is not difficult however, to devise an algorithm for
`computing the sum that performs many operations in parallel. For example, suppose that, in parallel,
`each element of A with an even index is paired and summed with the next element of A, which has
`an odd index, i.e., A[0] is paired with A[1], A[2] with A[3], and so on. The result is a new sequence
`of (cid:2)n/2(cid:3) numbers that sum to the same value as the sum that we wish to compute. This pairing and
`summing step can be repeated until, after (cid:2)log2 n(cid:3) steps, a sequence consisting of a single value is
`produced, and this value is equal to the final sum.
`The parallelism in an algorithm can yield improved performance on many different kinds of com-
`puters. For example, on a parallel computer, the operations in a parallel algorithm can be performed
`simultaneously by different processors. Furthermore, even on a single-processor computer the paral-
`lelism in an algorithm can be exploited by using multiple functional units, pipelined functional units,
`or pipelined memory systems. Thus, it is important to make a distinction between the parallelism in
`an algorithm and the ability of any particular computer to perform multiple operations in parallel.
`Of course, in order for a parallel algorithm to run efficiently on any type of computer, the algorithm
`must contain at least as much parallelism as the computer, for otherwise resources would be left
`idle. Unfortunately, the converse does not always hold: some parallel computers cannot efficiently
`execute all algorithms, even if the algorithms contain a great deal of parallelism. Experience has
`shown that it is more difficult to build a general-purpose parallel computer than a general-purpose
`sequential computer.
`The remainder of this chapter consists of nine sections. We begin in Section 25.2 with a discus-
`sion of how to model parallel computers. Next, in Section 25.3 we cover some general techniques
`that have proven useful in the design of parallel algorithms. Sections 25.4 through 25.8 present
`algorithms for solving problems from different domains. We conclude in Section 25.9 with a dis-
`cussion of current research topics, a collection of defining terms, and finally sources for further
`information.
`Throughout this chapter, we assume that the reader has some familiarity with sequential algorithms
`and asymptotic notation and analysis.
`
`25.2 Modeling Parallel Computations
`
`The designer of a sequential algorithm typically formulates the algorithm using an abstract model
`of computation called the random-access machine (RAM) model [2], Chapter 1. In this model, the
`machine consists of a single processor connected to a memory system. Each basic CPU operation,
`including arithmetic operations, logical operations, and memory accesses, requires one time step.
`The designer’s goal is to develop an algorithm with modest time and memory requirements. The
`RAM model allows the algorithm designer to ignore many of the details of the computer on which
`the algorithm will ultimately be executed, but captures enough detail that the designer can predict
`with reasonable accuracy how the algorithm will perform.
`Modeling parallel computations is more complicated than modeling sequential computations
`because in practice parallel computers tend to vary more in organization than do sequential com-
`puters. As a consequence, a large portion of the research on parallel algorithms has gone into
`the question of modeling, and many debates have raged over what the “right” model is, or about
`how practical various models are. Although there has been no consensus on the right model, this
`research has yielded a better understanding of the relationship between the models. Any discussion
`of parallel algorithms requires some understanding of the various models and the relationships
`among them.
`In this chapter we divide parallel models into two classes: multiprocessor models and work-depth
`models. In the remainder of this section we discuss these two classes and how they are related.
`
`Atallah/Algorithms and Theory of Computation Handbook: Second Edition C820X_C025 Finals Page 2 2009-10-6
`
`AA/SWA Ex. 1017, p.2 of 44
`American Airlines, et. al. v. Intellectual Ventures, et.al.
`IPR2025-00785
`
`

`

`Parallel Algorithms
`
`25-3
`
`25.2.1 Multiprocessor Models
`A multiprocessor model is a generalization of the sequential RAM model in which there is more
`than one processor. Multiprocessor models can be classified into three basic types: local-memory
`machine models, modular memory machine models, and parallel random-access machine (PRAM)
`models. Figure 25.1 illustrates the structure of these machine models. A local-memory machine
`model consists of a set of n processors each with its own local-memory. These processors are
`attached to a common communication network. A modular memory machine model consists of
`m memory modules and n processors all attached to a common network. An n-processor PRAM
`model consists of a set of n processors all connected to a common shared memory [32,37,38,77].
`The three types of multiprocessors differ in the way that memory can be accessed. In a local-
`memory machine model, each processor can access its own local memory directly, but can access
`the memory in another processor only by sending a memory request through the network. As in the
`RAM model, all local operations, including local memory accesses, take unit time. The time taken
`to access the memory in another processor, however, will depend on both the capabilities of the
`communication network and the pattern of memory accesses made by other processors, since these
`other accesses could congest the network. In a modular memory machine model, a processor accesses
`the memory in a memory module by sending a memory request through the network. Typically
`the processors and memory modules are arranged so that the time for any processor to access any
`memory module is roughly uniform. As in a local-memory machine model, the exact amount of
`
`Interconnection network
`
`P1
`
`M1
`
`(a)
`
`P2
`
`M2
`
`P3
`
`M3
`
`Pn
`
`Processors
`
`Mn
`
`Memory
`
`M1
`
`M2
`
`M3
`
`M4
`
`Mm
`
`Memory
`
`Interconnection network
`
`P1
`
`P2
`
`P3
`
`(b)
`
`Pn
`
`Processors
`
`Shared memory
`
`P1
`
`(c)
`
`P2
`
`P3
`
`Pn
`
`Processors
`
`FIGURE 25.1 The three types of multiprocessor machine models: (a) a local-memory machine model; (b) a modular
`memory machine model; and (c) a PRAM model.
`
`Atallah/Algorithms and Theory of Computation Handbook: Second Edition C820X_C025 Finals Page 3 2009-10-6
`
`AA/SWA Ex. 1017, p.3 of 44
`American Airlines, et. al. v. Intellectual Ventures, et.al.
`IPR2025-00785
`
`

`

`25-4
`
`Special Topics and Techniques
`
`time depends on the communication network and the memory access pattern. In a PRAM model,
`a processor can access any word of memory in a single step. Furthermore, these accesses can occur
`in parallel, i.e., in a single step, every processor can access the shared memory.
`The PRAM models are controversial because no real machine lives up to its ideal of unit-time access
`to shared memory. It is worth noting, however, that the ultimate purpose of an abstract model is not
`to directly model a real machine, but to help the algorithm designer produce efficient algorithms.
`Thus, if an algorithm designed for a PRAM model (or any other model) can be translated to an
`algorithm that runs efficiently on a real computer, then the model has succeeded. In Section 25.2.4
`we show how an algorithm designed for one parallel machine model can be translated so that it
`executes efficiently on another model.
`The three types of multiprocessor models that we have defined are broad and allow for many
`variations. The local-memory machine models and modular memory machine models may differ
`according to their network topologies. Furthermore, in all three types of models, there may be
`differences in the operations that the processors and networks are allowed to perform. In the
`remainder of this section we discuss some of the possibilities.
`
`25.2.1.1 Network Topology
`A network is a collection of switches connected by communication channels. A processor or memory
`module has one or more communication ports that are connected to these switches by communi-
`cation channels. The pattern of interconnection of the switches is called the network topology. The
`topology of a network has a large influence on the performance and also on the cost and difficulty of
`constructing the network. Figure 25.2 illustrates several different topologies.
`The simplest network topology is a bus. This network can be used in both local-memory machine
`models and modular memory machine models. In either case, all processors and memory modules
`are typically connected to a single bus. In each step, at most one piece of data can be written onto the
`bus. This data might be a request from a processor to read or write a memory value, or it might be the
`response from the processor or memory module that holds the value. In practice, the advantage of
`using a bus is that it is simple to build and, because all processors and memory modules can observe
`
`0010
`
`0110
`
`P2
`
`P3
`
`P1
`(a)
`
`Pn
`
`0000
`
`0100
`
`1010
`
`1110
`
`1000
`
`1011
`
`1100
`
`1111
`
`1001
`
`1101
`
`0011
`
`0111
`
`(b)
`
`0001
`(c)
`
`0101
`
`FIGURE 25.2 (a) Bus, (b) two-dimensional mesh, and (c) hypercube network topologies.
`
`Atallah/Algorithms and Theory of Computation Handbook: Second Edition C820X_C025 Finals Page 4 2009-10-6
`
`AA/SWA Ex. 1017, p.4 of 44
`American Airlines, et. al. v. Intellectual Ventures, et.al.
`IPR2025-00785
`
`

`

`Parallel Algorithms
`
`25-5
`
`the traffic on the bus, it is relatively easy to develop protocols that allow processors to cache memory
`values locally. The disadvantage of using a bus is that the processors have to take turns accessing the
`bus. Hence, as more processors are added to a bus, the average time to perform a memory access
`grows proportionately.
`A two-dimensional mesh is a network that can be laid out in a rectangular fashion. Each switch
`in a mesh has a distinct label (x, y) where 0 ≤ x ≤ X − 1 and 0 ≤ y ≤ Y − 1. The values X and Y
`determine the length of the sides of the mesh. The number of switches in a mesh is thus X · Y. Every
`switch, except those on the sides of the mesh, is connected to four neighbors: one to the north, one
`to the south, one to the east, and one to the west. Thus, a switch labeled (x, y), where 0 < x < X − 1
`and 0 < y < Y − 1, is connected to switches (x, y + 1), (x, y − 1), (x + 1, y), and (x − 1, y). This
`network typically appears in a local-memory machine model, i.e., a processor along with its local
`memory is connected to each switch, and remote memory accesses are made by routing messages
`through the mesh. Figure 25.2b shows an example of an 8 × 8 mesh.
`Several variations on meshes are also popular, including three-dimensional meshes, toruses, and
`hypercubes. A torus is a mesh in which the switches on the sides have connections to the switches
`on the opposite sides. Thus, every switch (x, y) is connected to four other switches: (x, y+ 1 mod Y),
`(x, y − 1 mod Y), (x + 1 mod X, y), and (x − 1 mod X, y). A hypercube is a network with 2n switches
`in which each switch has a distinct n-bit label. Two switches are connected by a communication
`channel in a hypercube if and only if the labels of the switches differ in precisely one bit position. A
`hypercube with 16 switches is shown in Figure 25.2c.
`A multistage network is used to connect one set of switches called the input switches to another set
`called the output switches through a sequence of stages of switches. Such networks were originally
`designed for telephone networks [15]. The stages of a multistage network are numbered 1 through
`L, where L is the depth of the network. The switches on stage 1 are the input switches, and those on
`stage L are the output switches. In most multistage networks, it is possible to send a message from any
`input switch to any output switch along a path that traverses the stages of the network in order from
`1 to L. Multistage networks are frequently used in modular memory computers; typically processors
`are attached to input switches, and memory modules are attached to output switches. A processor
`accesses a word of memory by injecting a memory access request message into the network. This
`message then travels through the network to the appropriate memory module. If the request is to
`read a word of memory, then the memory module sends the data back through the network to
`the requesting processor. There are many different multistage network topologies. Figure 25.3a, for
`example, shows a depth-2 network that connects 4 processors to 16 memory modules. Each switch in
`this network has two channels at the bottom and four channels at the top. The ratio of processors to
`memory modules in this example is chosen to reflect the fact that, in practice, a processor is capable
`of generating memory access requests faster than a memory module is capable of servicing them.
`A fat-tree is a network structured like a tree [56]. Each edge of the tree, however, may represent
`many communication channels, and each node may represent many network switches (hence the
`name “fat”). Figure 25.3b shows a fat-tree with the overall structure of a binary tree. Typically the
`
`Memory modules
`
`Stage 2 (output switches)
`
`Stage 1 (input switches)
`
`Processors
`
`(b)
`
`(a)
`
`FIGURE 25.3 (a) 2-level multistage network and (b) fat-tree network topologies.
`
`Atallah/Algorithms and Theory of Computation Handbook: Second Edition C820X_C025 Finals Page 5 2009-10-6
`
`AA/SWA Ex. 1017, p.5 of 44
`American Airlines, et. al. v. Intellectual Ventures, et.al.
`IPR2025-00785
`
`

`

`25-6
`
`Special Topics and Techniques
`
`capacities of the edges near the root of the tree are much larger than the capacities near the leaves.
`For example, in this tree the two edges incident on the root represent 8 channels each, while the
`edges incident on the leaves represent only 1 channel each. A natural way to construct a local-
`memory machine model is to connect a processor along with its local memory to each leaf of the
`fat-tree. In this scheme, a message from one processor to another first travels up the tree to the least
`common-ancestor of the two processors, and then down the tree.
`Many algorithms have been designed to run efficiently on particular network topologies such as
`the mesh or the hypercube. For extensive treatment such algorithms, see [55,67,73,80]. Although
`this approach can lead to very fine-tuned algorithms, it has some disadvantages. First, algorithms
`designed for one network may not perform well on other networks. Hence, in order to solve a
`problem on a new machine, it may be necessary to design a new algorithm from scratch. Second,
`algorithms that take advantage of a particular network tend to be more complicated than algorithms
`designed for more abstract models like the PRAM models, because they must incorporate some of
`the details of the network. Nevertheless, there are some operations that are performed so frequently
`by a parallel machine that it makes sense to design a fine-tuned network-specific algorithm. For
`example, the algorithm that routes messages or memory access requests through the network should
`exploit the network topology. Other examples include algorithms for broadcasting a message from
`one processor to many other processors, for collecting the results computed in many processors in
`a single processor, and for synchronizing processors.
`An alternative to modeling the topology of a network is to summarize its routing capabilities in
`terms of two parameters, its latency and bandwidth. The latency, L, of a network is the time it takes
`for a message to traverse the network. In actual networks this will depend on the topology of the
`network, which particular ports the message is passing between, and the congestion of messages in the
`network. The latency is often modeled by considering the worst-case time assuming that the network
`is not heavily congested. The bandwidth at each port of the network is the rate at which a processor
`can inject data into the network. In actual networks this will depend on the topology of the network,
`the bandwidths of the network’s individual communication channels and, again, the congestion
`of messages in the network. The bandwidth often can be usefully modeled as the maximum rate
`at which processors can inject messages into the network without causing it to become heavily
`congested, assuming a uniform distribution of message destinations. In this case, the bandwidth can
`be expressed as the minimum gap g between successive injections of messages into the network.
`Three models that characterize a network in terms of its latency and bandwidth are the Postal
`model [14], the Bulk-Synchronous Parallel (BSP) model [85], and the LogP model [29]. In the
`Postal model, a network is described by a single parameter L, its latency. The Bulk-Synchronous Par-
`allel model adds a second parameter g, the minimum ratio of computation steps to communication
`steps, i.e., the gap. The LogP model includes both of these parameters, and adds a third parameter
`o, the overhead, or wasted time, incurred by a processor upon sending or receiving a message.
`
`25.2.1.2 Primitive Operations
`A machine model must also specify the types of operations that the processors and network are per-
`mitted to perform. We assume that all processors are allowed to perform the same local instructions
`as the single processor in the standard sequential RAM model. In addition, processors may have spe-
`cial instructions for issuing nonlocal memory requests, for sending messages to other processors, and
`for executing various global operations, such as synchronization. There may also be restrictions on
`when processors can simultaneously issue instructions involving nonlocal operations. For example a
`model might not allow two processors to write to the same memory location at the same time. These
`restrictions might make it impossible to execute an algorithm on a particular model, or make the
`cost of executing the algorithm prohibitively expensive. It is therefore important to understand what
`instructions are supported before one can design or analyze a parallel algorithm. In this section we
`
`Atallah/Algorithms and Theory of Computation Handbook: Second Edition C820X_C025 Finals Page 6 2009-10-6
`
`AA/SWA Ex. 1017, p.6 of 44
`American Airlines, et. al. v. Intellectual Ventures, et.al.
`IPR2025-00785
`
`

`

`Parallel Algorithms
`
`25-7
`
`consider three classes of instructions that perform nonlocal operations: (1) instructions that perform
`concurrent accesses to the same shared memory location, (2) instructions for synchronization, and
`(3) instructions that perform global operations on data.
`When multiple processors simultaneously make a request to read or write to the same resource—
`such as a processor, memory module, or memory location—there are several possible outcomes.
`Some machine models simply forbid such operations, declaring that it is an error if two or more
`processes try to access a resource simultaneously. In this case we say that the model allows only
`exclusive access to the resource. For example, a PRAM model might only allow exclusive read
`or write access to each memory location. A PRAM model of this type is called an exclusive-read
`exclusive-write (EREW) PRAM model. Other machine models may allow unlimited access to a
`shared resource. In this case we say that the model allows concurrent access to the resource. For
`example, a concurrent-read concurrent-write (CRCW) PRAM model allows both concurrent read
`and write access to memory locations, and a CREW PRAM model allows concurrent reads but only
`exclusive writes. When making a concurrent write to a resource such as a memory location there
`are many ways to resolve the conflict. The possibilities include choosing an arbitrary value from
`those written (arbitrary concurrent write), choosing the value from the processor with lowest index
`(priority concurrent write), and taking the logical or of the values written. A final choice is to allow
`for queued access. In this case concurrent access is permitted but the time for a step is proportional
`to the maximum number of accesses to any resource. A queue-read queue-write (QRQW) PRAM
`model allows for such accesses [36].
`In addition to reads and writes to nonlocal memory or other processors, there are other impor-
`tant primitives that a model might supply. One class of such primitives support synchronization.
`There are a variety of different types of synchronization operations and the costs of these opera-
`tions vary from model to model. In a PRAM model, for example, it is assumed that all processors
`operate in lock step, which provides implicit synchronization. In a local-memory machine model
`the cost of synchronization may be a function of the particular network topology. A related oper-
`ation, broadcast, allows one processor to send a common message to all of the other processors.
`Some machine models supply more powerful primitives that combine arithmetic operations with
`communication. Such operations include the prefix and multiprefix operations, which are defined in
`Sections 25.4.2 and 25.4.3.
`
`25.2.2 Work-Depth Models
`Because there are so many different ways to organize parallel computers, and hence to model them,
`it is difficult to select one multiprocessor model that is appropriate for all machines. The alternative
`to focusing on the machine is to focus on the algorithm. In this section we present a class of
`models called work-depth models. In a work-depth model, the cost of an algorithm is determined
`by examining the total number of operations that it performs, and the dependencies among those
`operations. An algorithm’s work W is the total number of operations that it performs; its depth D is
`the longest chain of dependencies among its operations. We call the ratio P = W/D the parallelism
`of the algorithm.
`The work-depth models are more abstract than the multiprocessor models. As we shall see
`however, algorithms that are efficient in work-depth models can often be translated to algorithms that
`are efficient in the multiprocessor models, and from there to real parallel computers. The advantage
`of using a work-depth model is that there are no machine-dependent details to complicate the design
`and analysis of algorithms. Here we consider three classes of work-depth models: circuit models,
`vector machine models, and language-based models. We will be using a language-based model in
`this chapter, so we will return to these models in Section 25.2.5. The most abstract work-depth
`model is the circuit model. A circuit consists of nodes and directed arcs. A node represents a basic
`operation, such as adding two values. Each input value for an operation arrives at the corresponding
`
`Atallah/Algorithms and Theory of Computation Handbook: Second Edition C820X_C025 Finals Page 7 2009-10-6
`
`AA/SWA Ex. 1017, p.7 of 44
`American Airlines, et. al. v. Intellectual Ventures, et.al.
`IPR2025-00785
`
`

`

`25-8
`
`Special Topics and Techniques
`
`Depth
`
`Work
`
`8 4 2 1 1
`
`5
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`+
`
`Total :
`
`1 1 1 1 4
`
`Total :
`
`FIGURE 25.4 Summing 16 numbers on a tree. The total depth (longest chain of dependencies) is 4 and the total work
`(number of operations) is 15.
`
`node via an incoming arc. The result of the operation is then carried out of the node via one or more
`outgoing arcs. These outgoing arcs may provide inputs to other nodes. The number of incoming
`arcs to a node is referred to as the fan-in of the node and the number of outgoing arcs is referred
`to as the fan-out. There are two special classes of arcs. A set of input arcs provide input values to
`the circuit as a whole. These arcs do not originate at nodes. The output arcs return the final output
`values produced by the circuit. These arcs do not terminate at nodes. By definition, a circuit is not
`permitted to contain a directed cycle. In this model, an algorithm is modeled as a family of directed
`acyclic circuits. There is a circuit for each possible size of the input.
`Figure 25.4 shows a circuit for adding 16 numbers. In this figure all arcs are directed toward the
`bottom. The input arcs are at the top of the figure. Each + node adds the two values that arrive on
`its two incoming arcs, and places the result on its outgoing arc. The sum of all of the inputs to the
`circuit is returned on the single output arc at the bottom.
`The work and depth of a circuit are measured as follows. The work is the total number of nodes.
`The work in Figure 25.4, for example, is 15. (The work is also called the size of the circuit.) The
`depth is the number of nodes on the longest directed path from an input arc and an output arc. In
`Figure 25.4, the depth is 4. For a family of circuits, the work and depth are typically parameterized
`in terms of the number of inputs. For example, the circuit in Figure 25.4 can be easily generalized to
`add n input values for any n that is a power of two. The work and depth for this family of circuits is
`W(n) = n − 1 andD(n) = log2 n.
`Circuit models have been used for many years to study various theoretical aspects of parallelism,
`for example to prove that certain problems are difficult to solve in parallel. See [48] for an overview.
`In a vector model an algorithm is expressed as a sequence of steps, each of which performs an
`operation on a vector (i.e., sequence) of input values, and produces a vector result [19,69]. The work
`of each step is equal to the length of its input (or output) vector. The work of an algorithm is the
`sum of the work of its steps. The depth of an algorithm is the number of vector steps.
`In a language model, a work-depth cost is associated with each programming language construct
`[20,22]. For example, the work for calling two functions in parallel is equal to the sum of the work
`of the two calls. The depth, in this case, is equal to the maximum of the depth of the two calls.
`
`25.2.3 Assigning Costs to Algorithms
`In the work-depth models, the cost of an algorithm is determined by its work and by its depth.
`The notions of work and depth can also be defined for the multiprocessor models. The work W
`performed by an algorithm is equal to the number of processors multiplied by the time required for
`the algorithm to complete execution. The depth D is equal to the total time required to execute the
`algorithm.
`
`Atallah/Algorithms and Theory of Computation Handbook: Second Edition C820X_C025 Finals Page 8 2009-10-6
`
`AA/SWA Ex. 1017, p.8 of 44
`American Airlines, et. al. v. Intellectual Ventures, et.al.
`IPR2025-00785
`
`

`

`Parallel Algorithms
`
`25-9
`
`The depth of an algorithm is important, because there are some applications for which the time
`to perform a computation is crucial. For example, the results of a weather forecasting program are
`useful only if the program completes execution before the weather does!
`Generally, however, the most important measure of the cost of an algorithm is the work. This can
`be argued as follows. The cost of a computer is roughly proportional to the number of processors in
`the computer. The cost for purchasing time on a computer is proportional to the cost of the computer
`multiplied by the amount of time used. The total cost of performing a computation, therefore, is
`roughly proportional to the number of processors in the computer multiplied by the amount of time,
`i.e., the work.
`In many instances, the cost of running a computation on a parallel computer may be slightly larger
`than the cost of running the same computation on a sequential computer. If the time to completion
`is sufficiently improved, however, this extra cost can often be justified. As we shall see, however,
`there is often a tradeoff between time-to-completion and total work performed. To quantify when
`parallel algorithms are efficient in terms of cost, we say that a parallel algorithm is work-efficient if
`asymptotically (as the problem size grows) it requires at most a constant factor more work than the
`best sequential algorithm known.
`
`25.2.4 Emulations among Models
`Although it may appear that a different algorithm must be designed for each of the many parallel
`models, there are often automatic and efficient techniques for translating algorithms designed for
`one model into algorithms designed for another. These translations are work-preserving in the sense
`that the work performed by both algorithms is the same, to within a constant factor. For example,
`the following theorem, known as Brent’s Theorem [24], shows that an algorithm designed for the
`circuit model can be translated in a work-preserving fashion to a PRAM model algorithm.
`
`[Brent’s Theorem] Any algorithm that can be expressed as a circuit of size (i.e., work)
`THEOREM 25.1
`W and depth D and with constant fan-in nodes in the circuit model can be executed in O(W/P + D)
`steps in the CREW PRAM model.
`
`The basic idea is to have the PRAM emulate the computation specified by the circuit
`PROOF
`in a level-by-level fashion. The level of a node is defined as follows. A node is on level 1 if all of its
`inputs are also inputs to the circuit. Inductively, the level of any other node is one greater than the
`maximum of the level of the nodes that provide its inputs. Let li denote the number of nodes on
`level i. Then, by assigning (cid:2)li/P(cid:3) operations to each of the P processors in the PRAM, the operations
`for level i can be performed in O((cid:2)li/P(cid:3)) steps. Concurrent reads might be required since many
`operations on one level might read the same result from a previous level. Summing the time over all
`(cid:2)
`(cid:5)(c

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket