throbber
x -’I
`
`D e v 0 n 9 5h cL-h
`
`-
`
`r—.
`
`Programming with
`
`Bart Smgqlders
`
`Microsoft Corp. Exhibit 1057
`
`

`

`Programming with Threads
`
`Steve Kleiman
`
`Devang Shah
`
`Bart Smaalders
`
`
`SunSoft Press :
`
`A Prentice Hall Title '5—
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`© 1996 Sun Microsystems, Inc. — Printed in the United States of America.
`2550 Garcia Avenue, Mountain View, California 94043-1100 USA.
`
`All rights reserved. This book is protected by copyright and distributed under licenses
`restricting its use, copying, distribution, and ecompilation. No part of this book may be
`reproduced in any form by any means without prior written aut orization of Sun and its
`licensors, if any.
`Portions of the products described in this book may be derived from the UNIX® and Berkeley
`4.3 BSD systems, licensed from UNIX System Laboratories, Inc., a wholly owned subsidiary
`of Novell, Inc., and the University of California, respectively. Third-party font software in
`this product is protected by copyright and licensed from Sun’s font suppliers.
`RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the United States
`government is subject to restrictions as set forth in DFARS 252227-7013 (c)(1)(ii) and FAR
`52.227—19.
`
`The products described in this book may be protected by one or more US. patents, foreign
`patents, or pending applications.
`
`TRADEMARKS— Sun, Sun Microsystems, the Sun 10 o, SunSoft, Solaris, Solaris Sunburst
`Design, OpenWindows, ONC, ONC+, SunOS, Answer ook, Sun FORTRAN, Wabi, ToolTalk,
`NFS, XView, SunView, and The Network is the Computer are trademarks or registered
`trademarks of Sun Microsystems, Inc. UNIX is a re istered trademark in the United States
`and other countries exclus1vel
`licensed through X / pen Company, Ltd. OPEN LOOK® is a
`registered trademark of Nove 1, Inc. Adobe, PostScript, Displa PostScript, and PhotoShop
`are trademarks or registered trademarks of Adobe Systems ncor orated. PowerPC is a
`trademark of International Business Machines Corporation. Xenix,
`icrosoft Windows, and
`Windows NT are trademarks or‘registered trademarks of Microsoft Corporation. All other
`product names mentioned herein are the trademarks of their respective owners.
`SuperSPARC and all SPARC trademarks, including the SCD Compliant Logo, are trademarks
`or
`registered trademarks of SPARC International,
`Inc. SPARCstation, SPARCserver,
`SPARCengine, SPARCworks, SPARCworks iMPact, and SPARCompiler are licensed
`exclusively to Sun Microsystems, Inc. Products bearing SPARC trademarks are based upon
`an architecture developed by Sun Microsystems, Inc.
`Interfaces were developed by Sun
`The OPEN LOOK® and SunTM Graphical User
`Microsystems, Inc. for its users and licensees. Sun acknowledges the pioneering efforts of
`Xerox in researching and developing the concept of visual or graphical user interfaces for the
`computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User
`Interface, which license also covers Sun’s licensees who implement OPEN LOOK GUIs and
`otherwise comply with Sun’s written license agreements.
`X Window System is a trademark of X Consortium, Inc.
`
`The publisher offers discounts on this book when ordered in bulk quantities. For more
`information, contact: Corporate Sales Department, Prentice Hall PTR, One Lake Street,
`Upper Saddle River, NJ 07458. Phone: 800-382-3419 or 201-236-7156, Fax: 201-236-7141,
`email: corpsales@prenhall.com
`
`Cover designer: M 5* K Design, Palo Alto, California
`Manufacturing manager: Alexis R. Heydt
`Acquisitions editor: Gregory C. Doench
`Cover Design Director: Ierry Votta
`Production Supervisor: Ioanne Anzalone
`10 9 8 7 6 5 4 3 2 1
`
`ISBN 0-13-172389-8
`
`SunSoft Press
`
`A Prentice Hall Title
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`Parallel Computation
`
`16E
`
`
`
`One of the most powerful uses of threads is to speed up the execution of
`computationally intensive programs on shared memory multiprocessors.
`Unfortunately, effectively applying threads in this environment usually requires a
`deep understanding of the structure of the algorithm and how a proposed
`parallelized version of the algorithm is affected by the overheads of threads and
`the shared memory multiprocessor environment. Fortunately, many different
`algorithms have similar structures. This chapter covers in more detail a number
`of the thread paradigms introduced in Chapter 7, “Using Threads,” that are
`useful in many parallel computation situations. Detailed templates and examples
`are presented.
`
`Using Threads to Parallelize Algorithms
`Threads running on separate processors in a shared-memory multiprocessor
`allow you to use “parallel processing algorithms” in your program. Unlike the
`other uses of threads described in this book, using threads to implement parallel
`algorithms can be frustrating:
`° There are lots of techniques for “parallelizing” your program. How do you
`choose one that’s not too hard to program and that offers substantial speedups
`compared to uniprocessor execution? Does the performance of the technique
`scale up in proportion to the number of processors you use?
`° The overheads involved in synchronizing threads and sharing data among
`multiple processors may actually reduce the performance of your program.
`How can you anticipate and mitigate these problems?
`° Like many performance improvements, parallelizing increases the complexity
`of your program. How can you be sure it’s still correct?
`These are all tough problems: we do not yet know how to solve an arbitrary
`problem efficiently on a multiprocessor of arbitrary size. This section does not
`offer universal solutions, but tries to demonstrate some simple ways to get
`started. By sticking with some common “paradigms” for parallel algorithms and
`threads, you can avoid a lot of errors and aggravation.
`
`277
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`516
`
`
`Though it may seem simplistic, the most important step in writing a parallel
`program is to think carefully about the global structure of your program and the
`computing structures that threads offer. To speed up the program, we’re looking
`for a way to divide the work into a set of tasks so that:
`
`° The tasks interact little with each other;
`' The data shared by separate tasks is contained in a minimum of simple data
`structures that can be protected by locking mechanisms to prevent
`unsynchronized access;
`' The number of tasks can be varied so as to match the number of processors;
`° All tasks have equal computing requirements, or, instead, are configured in
`such a way that the set of tasks can keep all the processors fairly busy.
`As we’ve seen, Amdahl’s Law (Figure 7—3 on page 103) sets limits on the
`scalability of parallelized computations. Scalability is limited due to three kinds
`of overheads: synchronization overhead, contention, and balance. The
`synchronization operations required for correct multithreaded operation take
`time to execute, even when there is no contention. When two or more threads
`share data or locks, the system slows down due to contention for shared
`resources. And finally, balance refers to the ability of the algorithm to divide work
`evenly among the available processors. In the “serial” sections of your program,
`where only a single processor is used, balance is worst. Poor balance is often the
`result of dividing the work into large, unequal chunks: when the smaller chunks
`finish, they leave processors idle. If there are always at least as many runnable
`threads as available processors, the thread scheduler can keep the processors
`busy. While balance can be improved by dividing work into many small chunks
`that can be easily scheduled onto idle processors, making the grain size of the
`processing chunks small is usually accompanied by an increase in
`synchronization overhead, which hurts scalability.
`
`Thread Paradigms
`There are many ways to make your program parallel. Some techniques are very
`complex, depend on complicated data structures and locking strategies, depend
`on clever non-obvious algorithms, or require compiler support. We present a
`different approach in this section: three simple control structures, or paradigms,
`that can be applied to a wide range of applications.
`
`Each paradigm can be characterized by:
`' How the work is divided among parallel threads, and whether each thread
`executes the same code;
`
`' How the threads are synchronized;
`
`278
`
`Programming with Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`16E
`
`° What data is shared among the threads and how it is protected by locks or
`other means to avoid data races.
`
`The simplest paradigm is the master-slave. The main master thread launches a set
`of slave threads and allocates to each slave a portion of the work to be done; the
`amount of work is known in advance by the master and divided evenly among
`the slaves. The master starts the slaves and waits for all of them to reach a
`synchronization point, or barrier. The master may then compute and release the
`slaves again if needed. This pattern is repeated until the work is done. Example: a
`loop requiring 1,000 iterations is divided among four threads, each of which does
`250 iterations.
`
`Another versatile paradigm is the workpile, in which a set of worker threads
`request chunks of work to do from a “workpile,” usually some form of queue. As
`part of the work, a worker may cause additional entries to be added to the
`workpile. The pattern terminates when the workpile is finally emptied;
`alternatively, a worker thread may choose to terminate the process. Example: a
`search algorithm uses a workpile of promising paths that are extended by worker
`threads.
`
`The final paradigm is the pipeline, in which a task is passed to a succession of
`threads, each of which performs part of the overall work required; the threads are
`much like workers on an assembly line. In most cases, the processing in each
`pipeline stage is different, but there are applications for homogeneous pipelines in
`which the code executed in each stage is the same. Example: a graphics rendering
`pipeline in which stages successively transform, clip, and render geometry to
`form an image.
`
`Each of these three paradigms is explained in more detail in the following
`sections, and each is illustrated with one or more examples.
`
`Master-Slave Paradigm
`The simplest paradigm for parallel algorithms is the l’master-slave” arrangement.
`A single master thread, usually the main thread of your program, partitions work
`evenly among a set of slave threads. The slave threads are launched together, and
`each one is assigned to a separate partition of work. The master thread waits for
`all slave threads to complete their work and then continues.
`
`Parallel Computation
`
`279
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`516
`
`Master _xx~.\\\\xx\\xxmxxx—
`
`Slaves
`
`Time
`———_—-——>
`
`r x x x x mocked
`
`-— runnable — active
`
`Figure 16-1 Master-Slave Thread Execution
`
`This pattern is illustrated by the following template:
`
`Master
`Initialization
`i++)
`for (i = 0;
`i < n;
`pthread_create(...,slave,...);
`/* slaves are now running */
`
`Master proceséing if any
`
`i++)
`i < n;
`for (i = 0;
`pthread_join(...);
`Finalization if any
`
`Slave
`
`slave(void *params);
`(in
`Slave processing:
`index i
`params) tells each slave what work
`to do. Put results into global
`memory or params
`pthread_exit();
`
`'
`
`This simple kind of master-slave paradigm is used when the amount of work to
`be done is known in advance and when it’s easy to partition the work into n
`roughly equal parts that don’t depend on each other. The most common I
`application is to a loop, where m iterations of the loop are partitioned among the
`11 threads. Usually m is much greater than n, so each thread invocation executes
`quite a few iterations of the loop. A key requirement is that each iteration of the
`loop must be independent of every other iteration of the loop; thus no
`synchronization is required among the slaves. Here are some examples where this
`method applies:
`
`' Multiplying two matrices. Each row in the product matrix can be computed by
`a separate slave. The result for each row depends only on the values in the
`matrices being multiplied, and not on the results of other rows.
`
`280
`
`Programming with Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`16E
`
`° Transform-coding an image, for example in computing the discrete cosine
`transform (DCT) of every 8x8 pixel block in an image that is 640 by 480 pixels.
`There are m = 4,800 blocks to be done, and the blocks are independent of each
`other.
`
`' Computing the cross-correlation of two signals, e.g., two images, by iterating
`over all samples in the signal, computing the mean-square distance between
`the 'sample and its correlate, and summing the distances. This example
`computes a running sum over the iterations of the loop. In the master-slave
`paradigm, each slave will compute a sum for the iterations it is handling, and
`store the sum in the thread’s parameter record. After all slave threads have
`finished, the master thread computes the final sum by summing the entries in
`the parameter records of all threads.
`
`Master-Slave Example: Matrix Multiply
`To illustrate the master-slave paradigm, we present an example of a matrix
`multiply routine that also computes the trace of the resulting matrix. The trace is
`the sum of the diagonal elements. The outermost loop, which iterates over rows
`in the result matrix, is partitioned among the slave threads.
`
`#define MAX_PARALLEL
`#define SCHED_FUDGE
`
`/* maximum number of threads to use */
`20
`1.2 /* Threads/processor ratio (see text) */
`
`#define MREF(mt,
`
`row, col)mt—>m[(row)
`
`* mt—>n_cols + (col)]
`
`typedef struct {
`int n_rows;
`int n_cols;
`float m[1];
`} matrix_t;
`
`/* Matrix of variable size */
`/* # rows in matrix:
`0 <= row < n_rows */
`/* # columns in matrix:
`0 <= col < n_cols */
`/* MREF(m,row,col) elt in row—major order */
`
`/* parameter record for slave procedure */
`typedef struct {
`/* index of slave thread 0 <= i < n */
`int i;
`/* number of slave threads */
`int n;
`matrix_t *a,*b,*c; /* compute c = a*b */
`float trace_ans;
`/* trace contribution for this part of matrix */
`} mm_params;
`
`Parallel Computation
`
`281
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`516
`
`/*
`* This is the "slave procedure“ that each‘slave thread will execute.
`* Each thread is passed a structure of type mm_params;
`these parameter
`* records differ only in the value of index i for each slave.
`*/
`void *
`matrix_multiply_slave(void *param)
`{
`
`(mm_params *)param;
`mm_params *p =
`matrix_t *a = p—>a;
`matrix_t *b = p—>b;
`matrix_t *c = p—>C;
`int row, col,
`j;
`float trace = 0.0;
`
`/* c[row,col]
`
`is what is being computed */
`
`/ 'k
`* Calculate which rows the p—>i th thread will process. Since
`*
`the number of threads may not divide m = c—>n_rows exactly, we
`* adopt the convention that for index 0 <= i < m%n,
`the thread
`* will process floor(m/n)+1 rows, while all others will process
`* floor(m/n)
`rows.
`*/
`int quot = c—>n_rows / p—>n;
`int rem = c—>n_rows % p—>n;
`0);
`int do_rows = quot +
`((p—>i < rem)? 1
`int first_row = quot
`* p—>i +
`((p—>i < rem)? p—>i
`
`rem);
`
`row < first_row + do_rows;
`for (row 2 first_row;
`for (col = 0; col < c—>n_cols; col++)
`{
`/* Compute c[row,col] */
`float sum = 0.0;
`
`row++)
`
`{
`
`j < a—>n_cols;
`for (j = 0;
`sum += MREF(a,
`row,
`j)
`MREF(c,
`row, col)
`= sum;
`if (row == col)
`trace += sum;
`
`j++)
`* MREF(b,
`
`j, col);
`
`}
`
`} /
`
`* Record partial trace answer in parameter record */
`p—>trace_ans = trace;
`return (NULL);
`
`}
`
`282
`
`Programming with'Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`16
`
`/* Compute c = a * b, return trace of c */
`float
`
`matrix_multiply_trace(matrix_t *a, matrix_t *b, matrix*t *c)
`(
`
`mm_params params[MAX_PARALLEL];
`pthread_t thread[MAX_PARALLEL];
`int thread_present[MAX_PARALLEL];
`int n_threads = min(MAX_PARALLEL,
`(int)(sysconf(_SC_NPROCESSORS_ONLN)
`int i, res;
`float trace = 0.0;
`
`* SCHED_FUDGE));
`
`{
`
`i++)
`for (i = 0;.i < n_threads;
`thread_present[i] = TRUE;
`params[i].i = i;
`params[i].n = n_threads;
`params[i].a = a;
`params[i].b = b;
`params[i].c = c;
`res = pthread_create(&thread[i],
`NULL, matrix_multiply_slave,
`if (res != 0)
`{
`/* flag no thread created */
`thread_present[i] = FALSE;
`matrix_multiply_slave(&params[i]);/* just do it */
`
`(void *)&params[i]);
`
`{
`i++)
`i < n_threads;
`for (i = O;
`!= FALSE)
`if (thread_present[i]
`pthread_join(thread[i], NULL);/* wait for thread to exit */
`trace += params[i].trace_ans;
`
`} r
`
`eturn (trace);
`
`} C
`
`ode Example 16~1 Master-Slave Matrix Multiply
`
`The matrix_multiply_trace ( ) routine shown above correctly computes the
`product of the two matrices (provided no arithmetic exceptions occur). Although
`the code does not show the reasoning, it has been constructed carefully to avoid
`data races. Let’s categorize the data that is shared among the threads, and argue
`in each case that there are no data races.
`
`1. Master->slave.
`
`The matrices a and b, and the parameter records, are all written by the
`master thread and read by the slave threads. In the absence of
`synchronization, this would cause a data race. But all writing by the master
`
`Parallel Computation
`
`283
`
`Microsoft Corp_. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`16
`
`
`is separated from reading by a slave with a call to pthread_create ( ) , a
`function that ensures all writes have propagated to main memory (see “Data
`Races” on page 30). So all slaves will see the values written by the master.
`
`2. Slave->slave.
`In this example, no slave writes a memory location that another slave reads.
`Each slave is writing a separate region of the result matrix, c, and the partial
`results of the trace computation are stored by each slave in a separate
`memory location. Thus there is no need for locks or any other mechanism
`for synchronizing the slaves to each other; they are independent of one
`another.
`
`3. Slave->master.
`The slaves write the result matrix, c, and the trace_ans entry in the
`parameter record corresponding to each slave. The master refrains from
`reading this data until after executing pthread_j oin ( ) , a function that
`ensures that all writes initiated by the exiting thread have propagated to
`main memory (see “Data Races” on page 30). So the master will correctly see
`the values written by the slaves.
`\
`
`Note one other feature of the program: we have been careful to deal with the case
`when a thread-cannot be created. There’s a simple way to insure a correct result
`when this happens: the master thread calls the slave procedure directly.
`Performance will suffer, but the result will be correct. Note too that the main
`thread is idle while the slaves work; an alternative would be to use the main
`thread to do part of the work as well as the slaves.
`The program is modular about its use of threads: every thread that is created is
`also explicitly destroyed by waiting at pthread_j oin ( ) . The routine is thus
`reentrant, i.e., several separate threads could call matrix_multiply_trace ( ) ,
`and each call would create a set of slave threads that would be correctly managed
`
`by their respective master.
`Note theuse of stack allocation for the slave parameter records used by the slave
`threads. The example is careful to ensure that all of the threads that reference the
`parameter records exit before the procedure that allocated the stack storage
`(matrix_multiply_trace ( )) exits.
`’
`) create? To
`How many slave threads should matrix_multiply_trace (
`exploit parallelism to the maximum extent, we should use at least as many
`threads as there are processors available. But if we use exactly one thread per
`processor a balance problem may reduce performance. To see why, consider what
`happens on a four-processor system running four matrix multiply threads when
`the kernel scheduler decides to run another job on one of the processors, as
`shown in Figure 16-2. One thread now becomes delayed relative to the other
`
`284
`
`Programming with Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`165
`
`three, causing the entire parallel section, when all slaves are expected to be
`running, to be delayed. During some of this time, several processors may be idle.
`So the ideal balance, in which all processors are busy, is not achieved.
`
`Slaves
`
`Time
`_——__>
`
`1U”-
`
`blocked —— runnable — active
`
`Figure 16-2 Master-Slave with Scheduling Delay
`
`If the computation is broken up into a larger number of smaller parts, the
`scheduler is better able to achieve good balance, as shown in Figure 16-2. Too
`many threads, however, incur overheads in creating and destroying them, in
`memory for stack space, and in other system resources. You may wish to
`experiment with different strategies for choosing the right number of threads;
`your experiments need to consider both the number of processors available and
`the amount of work required for each thread (sometimes called the grain size).
`Allocating a few more threads than the number of processors is a good starting
`point. The thread scheduler will time-slice runnable threads, which helps avoid
`this problem —— in effect, replacing the grain size of the computation with that of
`a scheduling quantum.
`
`How does our algorithm scale? That is, does the program speed up in proportion
`to the number of processors available to share the work? Of course, if the
`matrices to be multiplied are small, the overheads of creating and joining threads
`will exceed the computational work. Indeed, a general-purpose routine would
`test the matrix size and use a conventional serial implementation for small
`matrices. But for large matrices, we should expect the program to scale well, in
`part because there is no slave—to-slave synchronization required and because the
`data sharing is modest.
`
`Consider the effects of caching parts of the result matrix, c, which is shared
`among all the slaves. The program above allocates to each slave responsibility for
`computing a certain number of rows of the result matrix. The matrix is in
`row—major order which would tend to put nearby elements in the same row in the
`same cache line, and therefore keeps the chances of false sharing to a minimum.
`
`Parallel Computation
`
`285
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`16
`
`
`Slaves
`
`Time
`—————————->
`
`Java blocked
`
`—-— runnable — active
`
`Figure 16-3 Master-Slave with Scheduling Delay and Extra Threads
`
`Compilers sometime use the master-slave paradigm to automatically parallelize
`loops within a program. Typically a parallelizing compiler will generate code
`such that the main flow of the computation is controlled by the master. When the
`master encounters a parallelizable loop, it divides up the loop’s work and gives it
`to a set of slaves. When the slaves finish, the master continues with the main
`computation. These techniques are illustrated more fully in l’Parallelizing Using
`Compiler Techniques” on page 311.
`
`Barrier Synchronization
`
`The key element of the master-slave paradigm is that every slave thread must
`complete its task before the computation can proceed. Rather than requiring each
`thread to exit in order to synchronize we can allow a pool of threads to remain
`ready to work on subsequent computations, as shown in Figure 16-4. Note that
`each thread must synchronize to start the computation and to end it. When each
`thread resumes execution, it can retain its state and local variables, which is not
`possible when threads exit and are recreated.
`
`A synchronization structure that conveniently implements this pattern is called a
`barrier. A barrier is initialized by specifying how many threads share the barrier;
`in the master-slave paradigm, this number is the number of slaves plus one for
`the master. In the example of Figure 16-4, five threads share the barrier, one
`master and four slaves. The main operation on a barrier is barrier__wait ( )
`
`286
`
`Programming with Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`16E
`
`
`Section
`
`A
`
`B
`
`A
`
`B
`
`A
`
`B
`
`A
`
`Slaves
`
`‘\\\1—.\\\—\\\\—\\\3
`
`L\\\_\\\_\\\\_\1\‘
`
`11k\_\\k\—\\K\_l\\‘
`
`L‘L‘h‘k—‘t‘L‘K—XEX‘C—‘h‘h‘h'
`
`Time
`w
`
`“\v blocked — runnable — active
`
`Figure 16-4 Master—Slave with Slave Pool
`
`which will block until the number of threads specified in the initialization are
`waiting at the barrier. When the last thread arrives at the barrier, all the threads
`are released. The master-slave paradigm then becomes:
`
`Master
`Initialization
`
`Slave
`
`bar = barrier_init(n+l);
`for (i = 0;
`i < n;
`i++)
`pthread_create(...,slave,...);
`slave(void *params);
`/* Slaves are now running */
`while (not done)
`{
`while (not done)
`{
`A: Fill in work for slave to dO‘
`/* A: wait for work */
`barrier_wait(bar);
`barrier_wait(bar);
`/* Slaves are now working */
`B: Slave processing: data in
`B: Master processing if any'
`global memory or params tells
`what
`to do. Put results
`
`/* Wait for slaves to finish */
`barrier_wait(bar);
`A: Process results
`
`}
`
`into global memory or params
`/* Wait till all are done */
`barrier_wait(bar);
`
`Pthread_eXit()i
`
`l f
`
`i++)
`i < n;
`or (i = 0;
`pthread_join(...);
`Finalization if any
`
`This paradigm is structured so that the slaves are in a loop in which two calls to
`barrier_wait ( ) divide the loop into two sections we have labeled A and B.
`During section A, the master is running and sets up work for the slaves to do; this
`section corresponds to the region where the master thread is active in Figure 16-4.
`When the master and all n slaves have entered the barrier by calling
`barrier_wait ( )
`, the barrier releases all the threads and all return from the
`
`Parallel Computation
`
`287
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`516
`
`call. Now both master and slaves are in section B, where the slaves perform work,
`and the master may be idle. (However, this structure makes it easy for the master
`thread to assume a share of the work during section B, in effect participating as
`another slave.) At the end of section B, the master and slaves call
`barrier_wait ( ) again; now all threads are in section A and the process repeats.
`The barriers provide the synchronization necessary to avoid data races between
`the master and the slaves. Data written by the master or slaves during one section
`can be safely read by the master and all slaves in any subsequent section, because
`barrier_wait ( ) not only contains the necessary synchronization function that
`forces data writes to memory, but also sequences all of the threads so that a
`reader of shared data can know that the data was written during a previous
`section. The template shown above uses this synchronization so that the master
`can safely pass work parameters to the slaves and the slaves can safely pass
`results back to the master. For example, the master could specify a different
`matrix multiplication task each time, or could use a more general parameter
`record to specify different slave tasks with each invocation. This design allows a
`pool of slave threads, waiting in the barrier, to be used by the master in a Wide
`variety of ways at different times during the execution of the program.
`An implementation of a symmetric summing barrier is shown below. The design is
`symmetric in that master and slave threads are not distinguished. An asymmetric
`barrier, in which master and slaves are treated differently, is more appropriate in
`the special case where the master and slaves truly alternate processing (i.e., the
`master does no work in section B and the slaves do no work in section A). In this
`case, you will notice that both master and slaves make two immediately adjacent
`calls to barrier_wait ( ) with no intervening computation; these can be
`collapsed into a single call that requires less synchronization overhead.
`
`The summing nature of the barrier allows all the threads to exchange an important
`piece of information as part of the synchronization at the barrier. Each caller of
`barrier_wait ( ) specifies an integer. When barrier_wait ( ) returns, it
`provides a result that is the sum of all the values specified in the calls to
`barrier_wait (
`) by threads entering the barrier. This feature is not used in the
`template shown above, but will be important in a more complex example that
`follows. It’s obvious how to remove the code used for summing if you don’t need
`it.
`
`288
`
`Programming with Threads
`Mlcrosoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`165
`
`typedef struct barrier_struct {
`
`lock;
`
`/*Mutex lock for the entire structure*/
`pthread_mutex_t
`/*Number of threads to wait for at barrier*/
`int n_clients;
`/*Number‘ofthreadshavecallaflbarrier_wait*/
`int n_waiting;
`‘/*Flagtoseparate waiters fromfastworkers*/
`int phase;
`/*Sum of arguments passed to barrier_wait*/
`int sum;
`/*Answer to be returned by barrier_wait*/
`‘
`int result;
`pthread_cond_t wait_cv;
`/*Clientswait on conditionvalzto proceed*/
`} *barrier_t;
`
`/*Createézinitializea.barrierwith the given.number of client threads*/
`barrier_t
`barrier_init(int n_clients)
`{
`
`barrier_t barrier =
`
`(barrier_t) malloc(sizeof (struct barrier_struet));
`if (barrier != NULL)
`{
`barrier—>n_clients = n_clients;
`barrier—>n_waiting = O; barrier—>phase = 0; barrier—>sum = O;
`pthread_mutex_init(&barrier—>lock, NULL);
`pthread_cond_init(&barrier—>wait_cv, NULL);
`
`) r
`
`eturn (barrier);
`
`/* Destroy a barrier */
`void
`
`barrier_destroy(barrier_t barrier)
`{
`
`pthread_mutex_destroy(&barrier—>lock);
`pthread_cond_destroy(&barrier—>wait_cv);
`free(barrier);
`
`Parallel Computation
`
`289
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`516
`
`/* Wait until the required number of threads enter the barrier */
`int
`barrier_wait(barrier_t barrier,
`{
`
`int increment)
`
`int my_phase;
`pthread_mutex_lock(&barrier—>lock);
`my_phase = barrier—>phase;
`barrier—>sum += increment;
`barrier—>n_waiting++;
`if (barrier—>n_waiting == barrier—>n_clients)
`barrier—>result = barrier—>sum;
`barrier—>sum = 0;
`barrier—>n_waiting = O;
`barrier—>phase = l — my_phase;
`pthread_cond_broadcast(&barrier—>wait_cv);
`
`{
`
`} /
`
`* Wait for the end of this synchronization phase */
`while (barrier—>phase == my_phase)
`{
`pthread_cond_wait(&barrier—>wait_cv, &barrier—>lock);
`,
`}
`pthread_mutex_unlock(&barrier—>lock);
`return (barrier—>result);
`
`} C
`
`ode Example 16-2 Symmetric Summing Barrier
`
`There is a subtle point in the design of the barrier in the use of the phase element
`which can have two values, 0 or 1. When the final thread arrives at the barrier it
`flips the value of phase. This tells the other threads waiting at the barrier that the
`current phase of waiting has ended and the next has begun. A fast thread may
`wake up, and hit the next use of the barrier before a slower thread wakes up and
`tests the condition. A slow thread can never see the value of phase flip twice
`while it is waiting since only the final thread arriving at the barrier will change
`the value of phase.
`
`Master-Slave Example: Relaxation
`
`Barrier synchronization can be used to enforce sequencing among slaves to avoid
`data races when one slave must write data that another slave must read. The
`
`matrix multiplication example does not show this form of data dependency: the
`work of each thread is independent of that of the other threads. This section
`presents a more general example that uses the symmetric summing barrier to
`coordinate slave dependencies.
`
`The example is a relaxation algorithm or "Jacobi iteration” of the form often used
`to solve Laplace’s equation to determine, for example, the steady—state
`temperature over a strangely shaped body when the temperature of its
`boundaries is held fixed. The algorithm approaches a solution iteratively; in each
`iteration, the temperature of a point is computed to be the average of the
`
`290
`
`Programming with Threads
`
`Microsoft Corp. Exhibit 1057
`
`Microsoft Corp. Exhibit 1057
`
`

`

`16E
`
`temperatures of its four neighbors, except that temperatures at the boundary are
`not changed. If Ti(x, y) is the temperature at location (x,y) after the ith iteration of
`the algorithm assuming that space has been quantized into a two-dimensional
`grid, then the algorithm computes the temperature values for the next iteration as
`follows:
`
`T,+1(x,y)= (T,(x —1,y)+ T,-(x +1,y)+ T,(x,y - 1) + T,-(x,y + 1)) /4
`
`The algorithm below uses a matrix t (the matrix__t type introduced in the
`matrix multiplication example) to represent the solution matrix. It identifies
`boundary points using another matrix, boundary. On each iteration, it computes
`a new solution matrix and tests for convergence.
`void
`
`relax(matrix_t *boundary, matrix_t *t)
`{
`
`matrix_t *t_old = t;
`
`matrix_t *t_new = matrix_alloc(t—>n_rows,
`matrix_t *temp;
`int row, col, converged;
`
`t—>n_cols);
`
`do {
`
`converged = TRUE;
`
`row++)
`row < boundarye>n_rows;
`for (row = 0;
`for (col = 0; col < boundary—>n_cols; col++)
`if (MREF(boundary,
`row, col) ==
`)
`{
`/* Not on boundary */
`row—l, col)
`float v =
`(MREF(t_old,
`MREF(t_old,
`row+l, co

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket