`
`introduction to Multiprocessing
`
`B13
`
`Summing the lloating-point requiretnents for all of the geometry stages gives a total of
`2,22ll,{1I'.Il multiplicationsldivisions and l,4'.-'{l,DllE|' additionsisubtractions per frame. Since
`it new frame is calculated every -ll,-second, a total of 22.2 million multiplicationsidivisions
`and 14.7 million additionsisubtractions (36.9 million aggregate lloating—point operations)
`as required per second—a very substantial number.
`
`Easteriratloo calculations and fraIne—bufl'er accesses. Let us now estimate the
`
`number of pixel calculations and frame-buffer memory accesses required in each frame. We
`assume that 2 values and REE triples each occupy one word (32 hits} of frame-bull‘er
`memory {typical in most current high~performance systems}. For each pixel that is initially
`visible {i.e., results in an update to the frame buffer). 2. R. G. and E values are calculated [4
`additions per pixel if forward difl'erences are used}, a r value is read from the frame bulfer [1
`ft‘aJ'tle-buffer cycle}, the 2 values are compared [1 subtraction} and new 2 values and colors
`are written (2 t‘rame—bufler cycles}. For each pixel that is initially not visible, only the 2
`value needs to be calculated (1 addition), and a 2 value is read from the frame buffer (I
`frame-buffer cycle}, and the two 2 values are compared {I subtraction}. Note that initially
`visible pixels may get covered, but initially invisible pixels can never be exposed.
`Since we assume that one-half of the pixels of each triangle are visible in the final
`scene. a reasonable guess is that
`three-quarters of the pixels are initially visible and
`one-quarter of the pixels are initially invisible. Each t:riangle covers till} pixels, so % - I00 -
`lil,flllIil = ’.-'5{i,lIIilll pixels are initially visible and-} - liiii- l[l,i'liltII = 25ll,i)Dll pixels are ini-
`tially invisible. To display an entire frame. therefore, a total of [‘l5tII,flfl[} -
`:5} + {25tl,llil0 -
`2) = 4.25 million additions and [?5ll,DIilJ - 3} + {25iLl.il{l|J - I) = 2.5 million frame-buffer
`accesm is required. To initialize each frame, both color and 2-|:Iuli'ecrs must be cleared, an
`additional [239 - 1024 - 2 = 2.6 million frame-bulfer accesses. The total number of
`frame-buffer accesses per frame, therefore, is 2.5 million + 2.6 million = 5.] million. If
`It} frames are generated per second, 42.5 million additions and 51 million frame-buffer
`accesses are required per second.
`In 1989,
`the fastest floating-point processors available computed approximately 20
`million floating-point operations per second,
`the fastest
`integer processors computed
`approximately 40 million integer operations per mood, and DRAM memory systems had
`cycle times of approximately ltlil nanoseconds. The floating-point and integer requirements
`of our sample application, therefore, are just at the limit of what can be achieved in a single
`CPU. The number of frame-bufleraccesses, however, is much higher than is possible in acou-
`vcntional memory system. As vie mentioned earlier. this database is only modestly sized
`for systems available in 1939. In the following sections, we show how multiprocessing can be
`used to achieve the performance necessary to display databases that are this size and larger.
`
`13.4 INTRODUCTION TD MULTIFHDGEESING
`
`Displaying large databases at high frantic rates clearly requires dramatic system perfoh
`mance, both in terms of computations and of memory of bandwidth. We have seen that the
`geometry portion of a graphics system can require more processing power than a single
`CPU can provide. Likewise, rastetization can require more bandwidth into memory than a
`single memory system can provide. The only way to attain such performance levels is to
`
`Volkswagen 1017 - Part 16 of 21
`
`0935
`0935
`
`Volkswagen 1017 - Part 16 of 21
`
`
`
`BT4
`
`Advanced HIIIEI Graphics Architecture
`
`ifil
`
`
`
`
`
`{'3}
`
`Fig. 13.9 Basic forms of multiprocessing: {a} pipalining. and (bi parallalism.
`
`perform multiple operations concurrently and to perform multiple reads and writes to
`ruentory eoncurrently—we need concurrent procming.
`Concurrent processing, or multiprocessing.
`is the basis of virtually all l'ligh-perlot'-
`mance graphics architectures. Multiprocessing has two basic forms: pipellnlng and
`parallelism (we reserve the term cortcttrrertcy for multiprocessing in general}. A pipeline
`processor contains a number of processing elements [PEs} ananged such that the output of
`one becomes the input of the next. in pipeline fashion {I-iig. lE.9a}. The PEs of a parallel
`processor at'e arranged side by side and operate simultaneously on different portions of the
`data (Fig.
`IE.‘-Lib).
`
`1 3.4.1 Pipalining
`
`To pipeline a computation, we partition it into stages that can be executed sequentially in
`separate PEs. Obviously. a pipeline can run only as fast as its slowest stage, so the
`processing load should be distributed evenly over the PEs. If this is not possible. PEs can be
`sized acuortling to the
`they must
`An important issue in pipeline systems is throughput versus latency. Throughput is the
`overall rate at which data are processed;
`latency is the time required for a single data
`element to pass front the beginning to the end of the pipeline. Some calculations can be
`pipelinetl using a large number of stages to achieve very high throughput. Pipeline latency
`increases with pipeline length. however. and certain oornputations can tolerate only a
`limited amount of latency. For example,
`real«time graphics systems. such as flight
`simulators. must respond quickly to changes in flight controls. It’ more than one or two
`frames are in the rendering pipeline at once, the system's interactivity may be impaired,
`regardless of the frame rate.
`
`0936
`0936
`
`
`
`13.4
`
`Introduction to Hll.lltipl‘t.'IGfl'IIil'lfl
`
`B'tl:'r
`
`'lB.Il.2 Parallelism
`
`Tb par-allelize a computation, we partition the data into portions that can be processed
`independently by different PEs. Frequently, PEs can execute the same pnograrn. Homoge-
`neous parallel processors contain PEs of the same type; heterogeneous parallel processors
`contain FEE of different types. in arty parallel system, the overall computation speed is
`determined by the time required for the slowest PE to finish its task.
`It is important,
`tlterefore, to balance the processing load among the His.
`is further distinction is useful for homogeneous parallel processors: whether the
`processors operate in loci: step or independently’. Processors that operate in lock step
`generally share a single code store and are called single-ittttnrction mrrt'tr'pl'e—doto {SIMDJ
`processors. Processors that operate independently must have a separate code store for each
`PE and are called multiple-instruction multiple-doto {MIME} processors.
`
`SIMD processors. Because all the Plis in a SIMD processor share a single code store.
`SIMD processors are generally less expensive than l'vlIMD processors. However, they do
`not perform well on algorithms that contain conditional branches or that access data using
`pointers or indirection. Since the path taken in a conditional branch depends on data
`specific to a PE, different PEs may follow different branch paths. Because all the His in a
`SIMD processor operate in loci: step, they all must follow every possible branch path. To
`accommodate conditional branches, PEs generally contain an enable regtrter to qualify
`write operations. Clnly P'Es whose enable registers are set write the results of computations.
`By appropriately setting and clearing the enable register, PEs can execute conditional
`branches [see Fig.
`lll.ll]a}.
`rlrlgortithms with few conditional branches execute efiioiently on SIMD processors.
`ttlgoritlnrrs with many conditional branches can be extremely inefiicierrt, however, since
`
`strttentent I;
`if not cottdrltion then
`
`ertrtble = FALSE;
`stotentertt 2;.
`toggle enable;
`statemrrrt 3;
`stoternerrr 4;
`enable -= TRUE:
`statement 5;
`stotetnetrt fit
`
`statement I;
`if condition than
`
`statement 2;
`else
`begin
`statement 3:
`statement 4;
`end:
`stntetnetu 5:
`statement ti;
`
`Total operations:
`to if condition evaluates TRUE,
`lll if condition evalrrates FALSE
`
`Total operations:
`.5 if condition evaluates TRUE,
`6 if condition evaluates FAl.5E
`
`lat
`
`lb}
`
`Fig. 13.10 la} SIMD and [bi MIMI] oltpraaaions of the same algorithm. In a SIMD
`program. conditional branches transform into operations on the enable register. When
`the enable register of a particular PE is FALSE. the PE executes the current instruction.
`but does not write the result.
`
`0937
`0937
`
`
`
`B15
`
`Advanced Heater Graphics Architecture
`
`most FEs may be disabled at any given time. Data structures containing pointers {such as
`linked H515 Dt trees} or indexed arrays cause similar pmbletns. Since a pointer orarray index
`may contain a different value at each PE. all possible values must be enumerated to ensure
`that each PE can make its required memory reference. For large arrays or pointers, this is an
`absurd waste of processing resources. A few SIMD processors provide separate address
`wires licr each PE in order to aiaoid this problem, but this adds size and complexity to l.he
`system.
`
`MIMI! processors. MIME processors are more expensive than SIMD processors. since
`each PE must have its own code store and controller. PEs in a MIMD processor often
`execute the same program. Unlike SIMD FEs, however, they are not constrained to operate
`in lock step. Because of this freedom, MIMD processors suffer no disadvantage when they
`encounter conditional branches; each PE makes an independent control-flow decision,
`skipping instructions that do not need to be executed (see Fig. 13. llllb}. As a result. MIMD
`processors achieve higher efliciency on general types of computations. However. since
`processors may start and end at different times and may process data at different rates.
`sjmchronization and load balancing are n1ore‘l:lil‘l‘icult. frequently requiring FIFD buffers at
`the input or output of each PE.
`
`13.4.3 Multiprocessor Graphics Systems
`
`Pipeline and parallel processors are the basic building bloclts of virtually all current
`high-perfolmattce graphics &T3lt.'.t'I‘1S. Both techniques can be used to accelerate l"mnt—en:d
`and back-end subsystems of a graphics system. as shown in Fig. 13.11.
`In the following sections. we examine each of these strategies. Sections 13.5 and 13.6
`discuss pipeline and parallel front-end architecturm. Sections 13.8 and 18.9 discuss
`pipeline and parallel back-end architectures. Section 13.11] discusses back-end architec~
`tures that use pa.ralleI techniques in combination.
`
`From
`traversal
`5111119
`
`Pipeline
`
`Parallel
`
`Ftp. 13.11 Fipelining and parallelism can be used to accelerate both front-end and
`back-and portions of a graphics system.
`
`0938
`0938
`
`
`
`13.5
`
`Pipeline Front-End Architectures
`
`31'?
`
`18.5 PIPELINE FRONT-END ARCHITECTURES
`
`Recall from Section 13.3 that the front end of a graphics display system has two major
`tasks: traversing the display model and ttansfonning primitives into screen space. fits we
`have seen, to achieve the rendering rates required in current applications. we must use
`concurrency to speed these computations. Both pipelining and parallelism have been used
`for decades to build front ends of high-performance graphics systems. Since the front end is
`intrinsically pipelined. its stages can be assigned to separate hardware units. Also. the large
`numbers of primitives in most graphics databases can be distributed over multiple
`processors and processed in parallel. In this section, we discuss pipeline front-end systems.
`We discuss parallel front—e11d systems in Section l3.ti.
`it
`l8.T. we mentioned that
`In introducing the standard graphics pipeline of Fig.
`provides a useful conceptual model of the rendering process. Because of its linear nature
`and fairly even allocation of processing efl'ort. it also maps well onto a physical pipeline of
`processors. This has been a popular approach to building high-performance graphics
`systems since the lilolls, as described in [lvIY‘ERo8]. a classic paper on the evolution of
`graphics architectures. Each stage of the pipeline can be implemented in several ways: as an
`individual general-purpose processor, as a custom hardware unit. or as a pipeline or parallel
`processor itself. We now discuss implementations for each stage in the front-end pipeline.
`
`13.5.1 Application Program and Display Traversal
`
`Some processor must execute the application program that drives the entire graphics
`system. In addition to feeding the graphics pipeline. this processor generally handles input
`devices. file Ill), and all
`interaction with the user.
`In systems using immediate-mode
`traversal. the display model is generally stored in the CPLl‘s main memory. The CPU must
`therefore traverse the model as well as run the application. In systems using retained mode.
`the model is generally [but not always} stored in the display procfifi memory. with the
`displayproeessorperfoimingu-an-ersal. Eecaasesuch systemsusetwoprocessotsfortheee
`tufts. they are potentially faster. although tlztey are less flexible and have other limitations.
`as discussed in Section 12.2.
`
`Where very high performance is desired. a single processor may not be powerful
`enough to traverse the entire database with sufiicient speed. The only remedy is to partition
`the database and to traverse it
`in parallel. This relatively new technique is discussed in
`Section 18.6.1-
`
`18.5.2 Geometric Transformation
`
`The geometric transformation stages [modeling transformation and viewing transfom1a-
`tion} are highly compute-intensive. Fortunately. vector and matrix multiplications are
`simple calculations that require no branching or looping, and can readily be irnpletnented in
`hardware.
`
`The most common implementation of these stages is a single processor or functional
`unit that sequentially transforms a series of vertices. A pioneering processor of this type
`was the lvlatriit
`lvlultipller [SUTHfi8l. which could multiply a four-element vector by a
`
`0939
`0939
`
`
`
`ET!
`
`Atlvaneetl flaatar Graphics Arehitaetura
`
`hcanogeneous transformation matrix in 20 microseconds. Other special-purpose geometry
`proeasors haveheeodeveloped sineethen, most notably Clarlfsfieometry Engine. which
`can perform clipping as well {see Section l3.5.5}. Recent geometry processors have
`exploited the power and programmability of oomtnercial floating-point chips.
`If pipelining does not provide enough performance. transformation computations can
`be parttllelized in several ways:
`
`I
`
`Individual components ofa vertett may be calculated in parallel. Four parallel
`p|'oemms.eachomuinutgdtecurrmtnansformadm1nmnia.canmraluateme
`ettpreaeionafor.r,y. z. and iv inparellel.
`
`' Multiple vertices can be transformed in parallel. If primititucs are all of a uniliarm
`type—say_ t:riangIs—the three vertices of each triangle can be transformed simttlta-
`neously.
`
`I
`
`Entire primitives can be transformed in parallel. If rt uansforrnation engines are
`available, eaclt processor can transform every nth primitive. This technique has many
`of tlte advantages and disadvantages of parallel front-end systems. which we will
`discuss in Section ISJ5.
`
`13.5.3 Trivial Aoeeptil-‘tale-et Claasificatiort
`
`'I."rivial accept and reject tests are straightliorward to irnplernent. since they require at worst a
`dot product and at best a singie floating—point comparison tor subtract} to determine on
`which side of each clipping platte each vertex lies. Because tlaese tests require little
`computation. they are gertemllyr perfomied by the processor that transforms primitives.
`
`13.5.4 IJfl'I‘IiI'Ifl
`
`liglttingcalcuIationsatestraigl'ttforwardandareIloatirtg-
`point-intensive. A.q:eeializedhudvmeproemoreancaloulateterte:eolorsbasedona
`polygon’: eolorand the light sector. More frequently. lighting calculations are performed
`using a programmable floating-point processor. In lovter-pe:rfot1uanee systems.
`lighting
`calctdadomcanbedoneinthesatrteproeeaaaorthatoansfornuvertiees. l'~lotethatifPhong
`shading is used. lighting calculations are deferred until the raateriaatioo stage.
`
`18.5.5 Clipping
`
`Pblygondipphtgwasmioeooreidnedcumbusonmsitmeflunmnbaofvmtieescan
`change dttring the clipping proem and concave polygons can fragment
`into multiple
`polygons during clipping. Sutherland and Hodgman [5l.lTl-W4] showed that arbitrary
`convert or concave polygons can be clipped to a convex view volume by passing the
`potygon'stetticesdu'onghasirtgle processing unit multiple times. Eachpassthroughthe
`unit clips the polygon to a different piane. In I930, Clark proposed unwrapping this
`prmuehtgbopinmasimplepipdineofidenfiulprmesammeachofnhidtootddbe
`impleunudieasingieVLSIdtip.whidthenmmdfluGmmdntEtgfiw[CLARB2].The
`
`perspectivedivisionaswell.
`
`0940
`0940
`
`
`
`13.5
`
`Pipeline Front-End Architecture:
`
`339
`
`Clipping using a Geometry Engine {or similar processor) can be performed either by a
`single processor that clips each polygon by as many planes as necessary. or by a pipeline of
`clipping processors, one for each clipping plane. The technique chosen alfects the
`worst-case performance of the graphics system: Systems with only one clipping processor
`may bog down during frames in whiclr large numbers of primitives need to be clipped,
`whereas systems with a clipping processor for each clipping plane can run at full speed.
`However, most of the clipping processors are idle for most databases and views in the latter
`approach.
`General-purpose floating-point units recently have begun to replace custom VLSI
`transformation and clipping processors. For example, Silicon Graphics, which for many
`years employed custom front-end processors, in l9‘El9 used the Weitelt 3332 floating-point
`chip for transformations and clipping in their PUWER IRIS system [described in detail in
`Section 13.8.2}. The delicate balance between performance and cost now favors commodi-
`ty processors. This balance may change again in the future if new graphics-specific
`functionality is needed and cannot be incorporated economically into general-purpose
`processors.
`
`1 B.5.E Division by w and Mapping to 3D Viewpoint
`
`Like geometric transformation and lighting, the calculations in this stage are straightfor-
`ward but require substantial
`tloating-point resources. A floating-point divide is time
`consuming even for most floating-point processors {many processors use an iterative
`method to do division]. Again, these stages can be implemented in custom functional units
`or in a commercial floating-point processor. In very high-performance systems,
`these
`calculations can be performed in separate, pipelined processors.
`
`13.5.7 Limitations of Front-End Pipelines
`
`technique for building high+perfotmance
`Even though pipelining is the predominant
`front-end systems.
`it has several limitations that are worth considering. First, a different
`algorithm is needed for each stage of the front-end pipeline. Thus. either a variety of
`hard-wired functional units must be designed or, if programmable processors are used,
`different programs must be written and ioacled into each processor. In eithercasc. processor
`or functional-unit capabilities must be carefully matched to their tasks, or bottlenecks will
`occur.
`
`least to
`Second, since the rendering algorithm is committed to hardware {or at
`firmware, since few systems allow users to reprogram pipeline processors}, it is difficult to
`add new features. Even if users have programming support for the pipeline processors, the
`distribution of hardware resources in the system may not adequately support new features
`such as complex primitives or collision detection between primitives.
`A final shortcoming of pip-elined front ends is that the approach breaks down when
`display traversal can no longer be performed by a single processor, and this inevitably
`occurs at some perfonnance level. For example, if we assume that traversal is pcrfonned by
`a ED-MHZ processor and memory system,
`that the description of each triangle in the
`database requires -it] words of data (for vertex coordinates, normal vectors. colors, etc.},
`and that each word sent to the pipeline requires two mem-otyfprocessor cycles [one to read it
`
`0941
`0941
`
`
`
`BID
`
`Advaneadltaatarfirapltiealtreltitaeuire
`
`front memory. another to load it into the pipeline}. then a maximum of 2fl.iltI}.[llIll I {2 -
`40} == 15l.'l.lIl.'l triangles peraeeond can bedisplayed by the system. no matter how powerful
`the processors in the pipeline are. Current systems are rapidly approaching such limits.
`Whateisecan bedone. themtoachieve higher-pcrfonnsnce'.*1'ltealte:-nativeto
`pipelining front-entl calculatlm is to paralleliee them. The follovving section t2Ifit.“.'t'lE this
`second any to build high-performmoe front-end systems.
`
`13.5 PARALLEL FRONT-END AHGHITECTUHES
`
`Since graphics databases are regular. typically consisting of a large number of primitives
`that receive nearly identical processing. an alternate vvay to add concurrency is to partition
`diedataintosepantestreairasridtopmotasslttemindependentiy. Formoststage-sot'the
`l'mnt—end mbsystem. such partitioning is
`readily done: for example.
`the geometric-
`transformation stages can use any ofthe parallel techniques described in Section l8.5.2.
`Hovaevef. stages in which data streams diverge {display traversal} or converge {between the
`front end and back end} are problematic. since they must handle the full data bandwidth.
`
`13.8.1 Display 'l'raveraal
`
`Almost all application programs mime a single. contiguous display mode! or database. In
`aparaltelfront-end systern.tl:tesintpluttecitniqueisa:tu-aversethedatabasc inasingle
`processor {aerial lrlvufill and than to distribute pritnitives to the parallel processors.
`Unfortunately. this serial traversal can become the bottleneck in a parallel front-end system.
`Severaltechnirnioscanbeusedtoacceleratcserial traversal:
`
`'
`
`"
`
`‘I
`
`Traversal routirtescanhettptimizedttrwritten in assetnhlycode
`
`Thedatahasecanbesturedinfaatei'memory{i.e.,SlUl.hlinsteadofDRfitivIl
`
`A ifastertraversal processor torone optimised for the particular structure format} can be
`used.
`
`Ifthese optimizations are not enough. the only alternative is to traverse the database in
`parallel. Thedatahaseeithercan be stored in asingie memory sy-stemthat allowspnraliei
`acom by multiple processors ta shared-memory model]. orcan bedistributed over multipie
`prooeuors. each with its own memory system is rtisnibuted-memory model}.
`The advantage of the sbaned-memory approach is that the database can remain in one
`place. although traversal must be divided among multiple processors. Presumably. each
`processor is assigned a certain portion oi‘ the datatme to traverse. Unfortunately. inherited
`attributes ina hieruchicaldatabammodelmeandtatprocessorsmustoontendftxaccess to
`thesarriedata. FtnumnpIe.eard|processornnistl|aveaceesstothcctumtnaansfmmation
`ntatnitaitdtoodtn-viewirtgandligl'tIing parameters. Since thedatabandtvidthtoandfroot
`ashared-ntentorysystemmaynotbemuch higherthan thatofaoomenrional l'ltemos'y
`systeltl. the sltated-ttletltory approach may not provide enough perfonnance.
`In the distributed-memory approach. each processor contains a portion of the database
`in its local memory. ltrraversesits portionotdtedatattaseforeachfratneand may also
`perform other front-end computations. Distributing the database presents its own problems.
`hon-ever: Unless the system gives die application prugrammerthe illusion ofa contiguous
`database. it cannot support portable graphics libraries. Also. the load must be balanced
`
`0942
`0942
`
`
`
`13.5
`
`Parallel Front-End Architectural
`
`881
`
`over the traversal processors if system resources are to be utilized fully. Hierarchical
`databases exacerbate both of these problems, since attributes in one level of a hierarchy
`aifect primitives below them. and structures deep in a hierarchy may be referenced by
`multiple higher-level structure calls.
`Tl1.-e following two mtions eztarnitre two Ways to distribute a hierarchical database over
`multiple processors: by structure. where each traversal processor is given a complete branch
`of the structure hierarchy; or by primitive, where each traversal processor is given a fraction
`of the pritnitives at each block in the hierarchy.
`
`Distributing by structure. Distributing by structure is outwardly appealing, since
`state—changing elements in the structure apparently need to be stored only once. This can be
`an illusion, however, since multiple high~|evel structures may refer to the same lower-level
`substructure. For example. a database containing several cars, each described by a separate
`car structure. can be distributed by assigning each car structure to a separate processor.
`However, if each car stmcture refers to a number of wheel structures, vviteel structures must
`
`also be replicated at every processor.
`Load balancing among procemors is also diflicult. Since primitives in a structure are
`likely to be spatially coherent, changing the viewpoint or geometry within a scene may
`cause entire portions of the structure to be clipped or to reappear. Maintaining even loading
`among the multiple processors would require reassigning portions of the database
`dynamically.
`
`Distributing by primitive. Distributing by primitive is costly, since the entire hierarchi-
`cal stntcture of the database and any state-changing commands must be replicated at each
`processor. Structure editing is also expensive, since changes must be broadcast to every
`processor. Load balancing, however, is automatic. Since objects in a hierarchical database
`typically contain a large number of simple primitives {e.g., polygons forming a tiled
`surface), these primitives will be scattered over all the processors, and each processor will
`have a similar processing load.
`the highest-
`In I939.
`is a relatively new technique.
`Parallel display traversal
`performance architectures were just approaching the point where serial traversal becomes
`insuflicient, and only a few systems had experimented with parallel traversal [FUCHSQL
`Neither of the disuibution techniques for hierarchical databases that we have described is
`ideal. Compared to geometry processing, which easily partitions into parallel tasks, display
`traversal
`is much more difiicult.
`lvievertheless, parallel
`traversal
`is
`likely to become
`increasingly important as system performance levels increase.
`
`‘lB.E.2 Flecornblnlng Parallel Streams
`
`The transition betvveen the front-end and back-end portions of the rendering pipeline is
`troublesome as well. In a parallel front-end system. the multiple streams of transformed and
`clipped primitives must be directed to the processor or processors doing rasterization. This
`can require sorting primitives based on spatial
`infonnation if dilierent processors are
`assigned to different screen regions.
`A second difliculty in parallel front-end systems is that the ordering of data may change
`as those data pass through parallel processors. For example, one processor may transform
`two small primitives before another processor transfonns a single, larger one. This does not
`
`0943
`0943
`
`
`
`E32
`
`Aadvaneodllasterflripltieanrelsltaetura
`
`matter for many graphics primitives and rendering techniques. Certain global commands.
`however. sttch as commands to update one window instead of another or to switch between
`double butters. require that data be synchronized before and after the command. If a large
`mmbu*oicommandssudtastheseoecurs.sometypeoftutdueresuppm1for
`synchronization rnayhenecessary. Aflasterficltnologiessystem 1'l'0RBS'i’] incorporatesa
`specidHFDinmeachPEmunoresugoodesforeachcomnmndmulalImvscommmflsw
`be resynchronlsed after they have been processed in separate FE-is.
`
`1a.s.3 Pipalinhtg trtrraua Pttnllaliarn
`
`We have seert that both pipelining and parallelism can be used to build higlt-performance
`front-end subsystems. Although pipelining has been the predominant technique in systems
`of the hm decade. panlleliam oflers several advantages, including reconfigurahiiity for
`difiamtalgrdmnmsiuxasinghprocmmhandlesfllfiom-uflulwlafions.mtdnune
`nnduhrhy.fincePEsmapufl|eisyueorcmthenndelurmgernmmnaneufilymmma
`pipdimsystmhfiemmedupwhnmauofapipdimsyfimthlimitedhydufluwgltpm
`ofits slowest stage. pipelines do not scale up as readily as do parallel systems. Parallel
`systen:s,cotheotltet'hand,teqt.tire nsorecomplicated synchnanizationandload balancing
`uflunnutneqncidizedpsocususaswdlucanpipdinedsyruursflothdesigmue
`Iitdymhemehtlinmehmne:indeed.thehiglest-pufmnuncesysteoauelikdym
`cotnbittedtetwo.
`
`13.7 MULTIPHDCEESDH HAS'I'EIiI|ZA1'lDlll ARCHITECTURES
`
`Recsllthattheoutputofthefront-endstbsystetttistypicallyasetofprimititvesinscteett
`coordinates. The rasterizatiort {back-end} subsystem creates the final
`image by amt
`converting each of these primitives, detertrtinirtg which primitives are visible at each pixel.
`and shading the pittel accttrtiingly. Section IS.2.tl identified two basic reasons why simple
`display-processodfrartte-hulfer systems are inadequate for higlt-performa:nee rasterization
`stlasystemst
`
`l. A single display processor does not have cnottglt processing power for all the pixel
`calculations.
`
`2. Memory bandwidth into the frame butter is insulftcient to handle the pixel traffic-
`even if the tlimlay processor cottld oornpttte pixels rapidly enottglt.
`
`Much ofthe research in graphics an'.'.hitecture over the past decade has concemed ways
`to ovet‘t.'ottte these litrtitatiotts. A great variety of techniqu has been proposed, and marry
`have been irtrplernertted in comrrtercia] artd ettperirnental systerrts.
`In this section. we
`consider low-cost. modes-ate-performance architectures that cast conventional algorithms
`into hardware. In Sections I33 and 18.9. we consider ways to improve performance by
`adding large amounts of parallelism to speed the calculation of the algorithm‘s “inner
`loop." in Section l3. ll]. we consider hybrid architectures that combine multiple techniques
`for improwd etficiency or even higher perfonnance. Figure l3. I2 summarizes the
`cortcurrent approaches we shall discuss here.
`
`0944
`0944
`
`
`
`1 3.?
`
`Multiprocessor Hacterlzafion Architecture:
`
`333
`
`HHSHTZEHDH
`
`
`
`.-architectural technique
`
`Highly parallel
`
`
`
`
` Dblect order
`Pipellnecl object order
`Image parallel
`li'lrlual buflerl
`virtual processor
`
`Partitioned Image
`Pclygoniedge:I'spen-
`1-butler. depth-
`processor plpellne
`memory
`sort. and ESP-tree
`
`algorithms
`Logic-enhanced
`memory
`
`
`
`Plpellnecl lmape order
`Object parallel
`image order
`Scan-line plpellne
`Scanline algorithms
`
`
`
`Parallel virtual buffer
`
`Image composition
`
`Fig. 13.12 Taxonomy of concurrent rasterization approaches.
`
`1B.7.1
`
`Pipelinetl Object-Order Architectures
`
`A direct way to add concunency to rasterization calculations is to cast the various steps of a
`software algorithm into a hardware pipeline. This technique has been used to build a
`number of inexpensive. moderately high-performance systems. This approach can he used
`with either of the two main rasterization approaches: object order [2-buffer. depth-sort. and
`ESP-tree algorithms} and image order (scan-line algorithms}. We consider objecteorder
`rasterieatlon now and image-order rasteaieation in Section lli.'l.2.
`Dbject-ortler rastcrization methods include the z-butler, depth-sort, and ESP-tree
`algorithms (the z-bufier is by far the most common in 3]} systems). The outer loop of these
`algorithms is an enumeration of primitives in the database, and the inner loop is an
`enumeration of pixels within each primitive. For polygon rendering. the heart of each of
`these algorithms is rasterizing a single polygon.
`Figure lli.l3 shows the most common rasterization algorithm for convex polygons.
`This algorithm is an extension of the 21;} polygon scanconversion algorithm presented in
`Section 3.6. using fixed-point aritlnnetic rather than integer arithmetic. Delta values are
`used to calculate the expressions for Jr. 2. R. G. and E incrementally from scan line to scan
`line, and from pixel to pixel. We shall describe each step of the algorithm.
`
`Polygon processing. Computations performed only once per polygon me grouped into
`this stage. The first step is to determine the initial scan line intersected by the polygon {this
`is determined by the vertex with the smallest y value). In most cases. the polygon intersects
`this scan line at a single pixel. with two edges projecting upward. the left and right edges.
`Delta values are calculated for 1. 2. R, G. and H for each edge. ‘These delta values are
`sornetitnes called slopes.
`
`Edge processing. Computations performed once for each scan line are grouped here.
`Scan lines within each primitive are processed one by one. The delta values computed
`previously are used to calculate x, 2, R, G, and E values at the intersection points of the left
`and right edges with the cun'ent scan line (Pm and PW in the figure}. A contiguous
`sequence of pixels