throbber
13.4
`
`introduction to Multiprocessing
`
`B13
`
`Summing the lloating-point requiretnents for all of the geometry stages gives a total of
`2,22ll,{1I'.Il multiplicationsldivisions and l,4'.-'{l,DllE|' additionsisubtractions per frame. Since
`it new frame is calculated every -ll,-second, a total of 22.2 million multiplicationsidivisions
`and 14.7 million additionsisubtractions (36.9 million aggregate lloating—point operations)
`as required per second—a very substantial number.
`
`Easteriratloo calculations and fraIne—bufl'er accesses. Let us now estimate the
`
`number of pixel calculations and frame-buffer memory accesses required in each frame. We
`assume that 2 values and REE triples each occupy one word (32 hits} of frame-bull‘er
`memory {typical in most current high~performance systems}. For each pixel that is initially
`visible {i.e., results in an update to the frame buffer). 2. R. G. and E values are calculated [4
`additions per pixel if forward difl'erences are used}, a r value is read from the frame bulfer [1
`ft‘aJ'tle-buffer cycle}, the 2 values are compared [1 subtraction} and new 2 values and colors
`are written (2 t‘rame—bufler cycles}. For each pixel that is initially not visible, only the 2
`value needs to be calculated (1 addition), and a 2 value is read from the frame buffer (I
`frame-buffer cycle}, and the two 2 values are compared {I subtraction}. Note that initially
`visible pixels may get covered, but initially invisible pixels can never be exposed.
`Since we assume that one-half of the pixels of each triangle are visible in the final
`scene. a reasonable guess is that
`three-quarters of the pixels are initially visible and
`one-quarter of the pixels are initially invisible. Each t:riangle covers till} pixels, so % - I00 -
`lil,flllIil = ’.-'5{i,lIIilll pixels are initially visible and-} - liiii- l[l,i'liltII = 25ll,i)Dll pixels are ini-
`tially invisible. To display an entire frame. therefore, a total of [‘l5tII,flfl[} -
`:5} + {25tl,llil0 -
`2) = 4.25 million additions and [?5ll,DIilJ - 3} + {25iLl.il{l|J - I) = 2.5 million frame-buffer
`accesm is required. To initialize each frame, both color and 2-|:Iuli'ecrs must be cleared, an
`additional [239 - 1024 - 2 = 2.6 million frame-bulfer accesses. The total number of
`frame-buffer accesses per frame, therefore, is 2.5 million + 2.6 million = 5.] million. If
`It} frames are generated per second, 42.5 million additions and 51 million frame-buffer
`accesses are required per second.
`In 1989,
`the fastest floating-point processors available computed approximately 20
`million floating-point operations per second,
`the fastest
`integer processors computed
`approximately 40 million integer operations per mood, and DRAM memory systems had
`cycle times of approximately ltlil nanoseconds. The floating-point and integer requirements
`of our sample application, therefore, are just at the limit of what can be achieved in a single
`CPU. The number of frame-bufleraccesses, however, is much higher than is possible in acou-
`vcntional memory system. As vie mentioned earlier. this database is only modestly sized
`for systems available in 1939. In the following sections, we show how multiprocessing can be
`used to achieve the performance necessary to display databases that are this size and larger.
`
`13.4 INTRODUCTION TD MULTIFHDGEESING
`
`Displaying large databases at high frantic rates clearly requires dramatic system perfoh
`mance, both in terms of computations and of memory of bandwidth. We have seen that the
`geometry portion of a graphics system can require more processing power than a single
`CPU can provide. Likewise, rastetization can require more bandwidth into memory than a
`single memory system can provide. The only way to attain such performance levels is to
`
`Volkswagen 1017 - Part 16 of 21
`
`0935
`0935
`
`Volkswagen 1017 - Part 16 of 21
`
`

`
`BT4
`
`Advanced HIIIEI Graphics Architecture
`
`ifil
`
`
`
`
`
`{'3}
`
`Fig. 13.9 Basic forms of multiprocessing: {a} pipalining. and (bi parallalism.
`
`perform multiple operations concurrently and to perform multiple reads and writes to
`ruentory eoncurrently—we need concurrent procming.
`Concurrent processing, or multiprocessing.
`is the basis of virtually all l'ligh-perlot'-
`mance graphics architectures. Multiprocessing has two basic forms: pipellnlng and
`parallelism (we reserve the term cortcttrrertcy for multiprocessing in general}. A pipeline
`processor contains a number of processing elements [PEs} ananged such that the output of
`one becomes the input of the next. in pipeline fashion {I-iig. lE.9a}. The PEs of a parallel
`processor at'e arranged side by side and operate simultaneously on different portions of the
`data (Fig.
`IE.‘-Lib).
`
`1 3.4.1 Pipalining
`
`To pipeline a computation, we partition it into stages that can be executed sequentially in
`separate PEs. Obviously. a pipeline can run only as fast as its slowest stage, so the
`processing load should be distributed evenly over the PEs. If this is not possible. PEs can be
`sized acuortling to the
`they must
`An important issue in pipeline systems is throughput versus latency. Throughput is the
`overall rate at which data are processed;
`latency is the time required for a single data
`element to pass front the beginning to the end of the pipeline. Some calculations can be
`pipelinetl using a large number of stages to achieve very high throughput. Pipeline latency
`increases with pipeline length. however. and certain oornputations can tolerate only a
`limited amount of latency. For example,
`real«time graphics systems. such as flight
`simulators. must respond quickly to changes in flight controls. It’ more than one or two
`frames are in the rendering pipeline at once, the system's interactivity may be impaired,
`regardless of the frame rate.
`
`0936
`0936
`
`

`
`13.4
`
`Introduction to Hll.lltipl‘t.'IGfl'IIil'lfl
`
`B'tl:'r
`
`'lB.Il.2 Parallelism
`
`Tb par-allelize a computation, we partition the data into portions that can be processed
`independently by different PEs. Frequently, PEs can execute the same pnograrn. Homoge-
`neous parallel processors contain PEs of the same type; heterogeneous parallel processors
`contain FEE of different types. in arty parallel system, the overall computation speed is
`determined by the time required for the slowest PE to finish its task.
`It is important,
`tlterefore, to balance the processing load among the His.
`is further distinction is useful for homogeneous parallel processors: whether the
`processors operate in loci: step or independently’. Processors that operate in lock step
`generally share a single code store and are called single-ittttnrction mrrt'tr'pl'e—doto {SIMDJ
`processors. Processors that operate independently must have a separate code store for each
`PE and are called multiple-instruction multiple-doto {MIME} processors.
`
`SIMD processors. Because all the Plis in a SIMD processor share a single code store.
`SIMD processors are generally less expensive than l'vlIMD processors. However, they do
`not perform well on algorithms that contain conditional branches or that access data using
`pointers or indirection. Since the path taken in a conditional branch depends on data
`specific to a PE, different PEs may follow different branch paths. Because all the His in a
`SIMD processor operate in loci: step, they all must follow every possible branch path. To
`accommodate conditional branches, PEs generally contain an enable regtrter to qualify
`write operations. Clnly P'Es whose enable registers are set write the results of computations.
`By appropriately setting and clearing the enable register, PEs can execute conditional
`branches [see Fig.
`lll.ll]a}.
`rlrlgortithms with few conditional branches execute efiioiently on SIMD processors.
`ttlgoritlnrrs with many conditional branches can be extremely inefiicierrt, however, since
`
`strttentent I;
`if not cottdrltion then
`
`ertrtble = FALSE;
`stotentertt 2;.
`toggle enable;
`statemrrrt 3;
`stoternerrr 4;
`enable -= TRUE:
`statement 5;
`stotetnetrt fit
`
`statement I;
`if condition than
`
`statement 2;
`else
`begin
`statement 3:
`statement 4;
`end:
`stntetnetu 5:
`statement ti;
`
`Total operations:
`to if condition evaluates TRUE,
`lll if condition evalrrates FALSE
`
`Total operations:
`.5 if condition evaluates TRUE,
`6 if condition evaluates FAl.5E
`
`lat
`
`lb}
`
`Fig. 13.10 la} SIMD and [bi MIMI] oltpraaaions of the same algorithm. In a SIMD
`program. conditional branches transform into operations on the enable register. When
`the enable register of a particular PE is FALSE. the PE executes the current instruction.
`but does not write the result.
`
`0937
`0937
`
`

`
`B15
`
`Advanced Heater Graphics Architecture
`
`most FEs may be disabled at any given time. Data structures containing pointers {such as
`linked H515 Dt trees} or indexed arrays cause similar pmbletns. Since a pointer orarray index
`may contain a different value at each PE. all possible values must be enumerated to ensure
`that each PE can make its required memory reference. For large arrays or pointers, this is an
`absurd waste of processing resources. A few SIMD processors provide separate address
`wires licr each PE in order to aiaoid this problem, but this adds size and complexity to l.he
`system.
`
`MIMI! processors. MIME processors are more expensive than SIMD processors. since
`each PE must have its own code store and controller. PEs in a MIMD processor often
`execute the same program. Unlike SIMD FEs, however, they are not constrained to operate
`in lock step. Because of this freedom, MIMD processors suffer no disadvantage when they
`encounter conditional branches; each PE makes an independent control-flow decision,
`skipping instructions that do not need to be executed (see Fig. 13. llllb}. As a result. MIMD
`processors achieve higher efliciency on general types of computations. However. since
`processors may start and end at different times and may process data at different rates.
`sjmchronization and load balancing are n1ore‘l:lil‘l‘icult. frequently requiring FIFD buffers at
`the input or output of each PE.
`
`13.4.3 Multiprocessor Graphics Systems
`
`Pipeline and parallel processors are the basic building bloclts of virtually all current
`high-perfolmattce graphics &T3lt.'.t'I‘1S. Both techniques can be used to accelerate l"mnt—en:d
`and back-end subsystems of a graphics system. as shown in Fig. 13.11.
`In the following sections. we examine each of these strategies. Sections 13.5 and 13.6
`discuss pipeline and parallel front-end architecturm. Sections 13.8 and 18.9 discuss
`pipeline and parallel back-end architectures. Section 13.11] discusses back-end architec~
`tures that use pa.ralleI techniques in combination.
`
`From
`traversal
`5111119
`
`Pipeline
`
`Parallel
`
`Ftp. 13.11 Fipelining and parallelism can be used to accelerate both front-end and
`back-and portions of a graphics system.
`
`0938
`0938
`
`

`
`13.5
`
`Pipeline Front-End Architectures
`
`31'?
`
`18.5 PIPELINE FRONT-END ARCHITECTURES
`
`Recall from Section 13.3 that the front end of a graphics display system has two major
`tasks: traversing the display model and ttansfonning primitives into screen space. fits we
`have seen, to achieve the rendering rates required in current applications. we must use
`concurrency to speed these computations. Both pipelining and parallelism have been used
`for decades to build front ends of high-performance graphics systems. Since the front end is
`intrinsically pipelined. its stages can be assigned to separate hardware units. Also. the large
`numbers of primitives in most graphics databases can be distributed over multiple
`processors and processed in parallel. In this section, we discuss pipeline front-end systems.
`We discuss parallel front—e11d systems in Section l3.ti.
`it
`l8.T. we mentioned that
`In introducing the standard graphics pipeline of Fig.
`provides a useful conceptual model of the rendering process. Because of its linear nature
`and fairly even allocation of processing efl'ort. it also maps well onto a physical pipeline of
`processors. This has been a popular approach to building high-performance graphics
`systems since the lilolls, as described in [lvIY‘ERo8]. a classic paper on the evolution of
`graphics architectures. Each stage of the pipeline can be implemented in several ways: as an
`individual general-purpose processor, as a custom hardware unit. or as a pipeline or parallel
`processor itself. We now discuss implementations for each stage in the front-end pipeline.
`
`13.5.1 Application Program and Display Traversal
`
`Some processor must execute the application program that drives the entire graphics
`system. In addition to feeding the graphics pipeline. this processor generally handles input
`devices. file Ill), and all
`interaction with the user.
`In systems using immediate-mode
`traversal. the display model is generally stored in the CPLl‘s main memory. The CPU must
`therefore traverse the model as well as run the application. In systems using retained mode.
`the model is generally [but not always} stored in the display procfifi memory. with the
`displayproeessorperfoimingu-an-ersal. Eecaasesuch systemsusetwoprocessotsfortheee
`tufts. they are potentially faster. although tlztey are less flexible and have other limitations.
`as discussed in Section 12.2.
`
`Where very high performance is desired. a single processor may not be powerful
`enough to traverse the entire database with sufiicient speed. The only remedy is to partition
`the database and to traverse it
`in parallel. This relatively new technique is discussed in
`Section 18.6.1-
`
`18.5.2 Geometric Transformation
`
`The geometric transformation stages [modeling transformation and viewing transfom1a-
`tion} are highly compute-intensive. Fortunately. vector and matrix multiplications are
`simple calculations that require no branching or looping, and can readily be irnpletnented in
`hardware.
`
`The most common implementation of these stages is a single processor or functional
`unit that sequentially transforms a series of vertices. A pioneering processor of this type
`was the lvlatriit
`lvlultipller [SUTHfi8l. which could multiply a four-element vector by a
`
`0939
`0939
`
`

`
`ET!
`
`Atlvaneetl flaatar Graphics Arehitaetura
`
`hcanogeneous transformation matrix in 20 microseconds. Other special-purpose geometry
`proeasors haveheeodeveloped sineethen, most notably Clarlfsfieometry Engine. which
`can perform clipping as well {see Section l3.5.5}. Recent geometry processors have
`exploited the power and programmability of oomtnercial floating-point chips.
`If pipelining does not provide enough performance. transformation computations can
`be parttllelized in several ways:
`
`I
`
`Individual components ofa vertett may be calculated in parallel. Four parallel
`p|'oemms.eachomuinutgdtecurrmtnansformadm1nmnia.canmraluateme
`ettpreaeionafor.r,y. z. and iv inparellel.
`
`' Multiple vertices can be transformed in parallel. If primititucs are all of a uniliarm
`type—say_ t:riangIs—the three vertices of each triangle can be transformed simttlta-
`neously.
`
`I
`
`Entire primitives can be transformed in parallel. If rt uansforrnation engines are
`available, eaclt processor can transform every nth primitive. This technique has many
`of tlte advantages and disadvantages of parallel front-end systems. which we will
`discuss in Section ISJ5.
`
`13.5.3 Trivial Aoeeptil-‘tale-et Claasificatiort
`
`'I."rivial accept and reject tests are straightliorward to irnplernent. since they require at worst a
`dot product and at best a singie floating—point comparison tor subtract} to determine on
`which side of each clipping platte each vertex lies. Because tlaese tests require little
`computation. they are gertemllyr perfomied by the processor that transforms primitives.
`
`13.5.4 IJfl'I‘IiI'Ifl
`
`liglttingcalcuIationsatestraigl'ttforwardandareIloatirtg-
`point-intensive. A.q:eeializedhudvmeproemoreancaloulateterte:eolorsbasedona
`polygon’: eolorand the light sector. More frequently. lighting calculations are performed
`using a programmable floating-point processor. In lovter-pe:rfot1uanee systems.
`lighting
`calctdadomcanbedoneinthesatrteproeeaaaorthatoansfornuvertiees. l'~lotethatifPhong
`shading is used. lighting calculations are deferred until the raateriaatioo stage.
`
`18.5.5 Clipping
`
`Pblygondipphtgwasmioeooreidnedcumbusonmsitmeflunmnbaofvmtieescan
`change dttring the clipping proem and concave polygons can fragment
`into multiple
`polygons during clipping. Sutherland and Hodgman [5l.lTl-W4] showed that arbitrary
`convert or concave polygons can be clipped to a convex view volume by passing the
`potygon'stetticesdu'onghasirtgle processing unit multiple times. Eachpassthroughthe
`unit clips the polygon to a different piane. In I930, Clark proposed unwrapping this
`prmuehtgbopinmasimplepipdineofidenfiulprmesammeachofnhidtootddbe
`impleunudieasingieVLSIdtip.whidthenmmdfluGmmdntEtgfiw[CLARB2].The
`
`perspectivedivisionaswell.
`
`0940
`0940
`
`

`
`13.5
`
`Pipeline Front-End Architecture:
`
`339
`
`Clipping using a Geometry Engine {or similar processor) can be performed either by a
`single processor that clips each polygon by as many planes as necessary. or by a pipeline of
`clipping processors, one for each clipping plane. The technique chosen alfects the
`worst-case performance of the graphics system: Systems with only one clipping processor
`may bog down during frames in whiclr large numbers of primitives need to be clipped,
`whereas systems with a clipping processor for each clipping plane can run at full speed.
`However, most of the clipping processors are idle for most databases and views in the latter
`approach.
`General-purpose floating-point units recently have begun to replace custom VLSI
`transformation and clipping processors. For example, Silicon Graphics, which for many
`years employed custom front-end processors, in l9‘El9 used the Weitelt 3332 floating-point
`chip for transformations and clipping in their PUWER IRIS system [described in detail in
`Section 13.8.2}. The delicate balance between performance and cost now favors commodi-
`ty processors. This balance may change again in the future if new graphics-specific
`functionality is needed and cannot be incorporated economically into general-purpose
`processors.
`
`1 B.5.E Division by w and Mapping to 3D Viewpoint
`
`Like geometric transformation and lighting, the calculations in this stage are straightfor-
`ward but require substantial
`tloating-point resources. A floating-point divide is time
`consuming even for most floating-point processors {many processors use an iterative
`method to do division]. Again, these stages can be implemented in custom functional units
`or in a commercial floating-point processor. In very high-performance systems,
`these
`calculations can be performed in separate, pipelined processors.
`
`13.5.7 Limitations of Front-End Pipelines
`
`technique for building high+perfotmance
`Even though pipelining is the predominant
`front-end systems.
`it has several limitations that are worth considering. First, a different
`algorithm is needed for each stage of the front-end pipeline. Thus. either a variety of
`hard-wired functional units must be designed or, if programmable processors are used,
`different programs must be written and ioacled into each processor. In eithercasc. processor
`or functional-unit capabilities must be carefully matched to their tasks, or bottlenecks will
`occur.
`
`least to
`Second, since the rendering algorithm is committed to hardware {or at
`firmware, since few systems allow users to reprogram pipeline processors}, it is difficult to
`add new features. Even if users have programming support for the pipeline processors, the
`distribution of hardware resources in the system may not adequately support new features
`such as complex primitives or collision detection between primitives.
`A final shortcoming of pip-elined front ends is that the approach breaks down when
`display traversal can no longer be performed by a single processor, and this inevitably
`occurs at some perfonnance level. For example, if we assume that traversal is pcrfonned by
`a ED-MHZ processor and memory system,
`that the description of each triangle in the
`database requires -it] words of data (for vertex coordinates, normal vectors. colors, etc.},
`and that each word sent to the pipeline requires two mem-otyfprocessor cycles [one to read it
`
`0941
`0941
`
`

`
`BID
`
`Advaneadltaatarfirapltiealtreltitaeuire
`
`front memory. another to load it into the pipeline}. then a maximum of 2fl.iltI}.[llIll I {2 -
`40} == 15l.'l.lIl.'l triangles peraeeond can bedisplayed by the system. no matter how powerful
`the processors in the pipeline are. Current systems are rapidly approaching such limits.
`Whateisecan bedone. themtoachieve higher-pcrfonnsnce'.*1'ltealte:-nativeto
`pipelining front-entl calculatlm is to paralleliee them. The follovving section t2Ifit.“.'t'lE this
`second any to build high-performmoe front-end systems.
`
`13.5 PARALLEL FRONT-END AHGHITECTUHES
`
`Since graphics databases are regular. typically consisting of a large number of primitives
`that receive nearly identical processing. an alternate vvay to add concurrency is to partition
`diedataintosepantestreairasridtopmotasslttemindependentiy. Formoststage-sot'the
`l'mnt—end mbsystem. such partitioning is
`readily done: for example.
`the geometric-
`transformation stages can use any ofthe parallel techniques described in Section l8.5.2.
`Hovaevef. stages in which data streams diverge {display traversal} or converge {between the
`front end and back end} are problematic. since they must handle the full data bandwidth.
`
`13.8.1 Display 'l'raveraal
`
`Almost all application programs mime a single. contiguous display mode! or database. In
`aparaltelfront-end systern.tl:tesintpluttecitniqueisa:tu-aversethedatabasc inasingle
`processor {aerial lrlvufill and than to distribute pritnitives to the parallel processors.
`Unfortunately. this serial traversal can become the bottleneck in a parallel front-end system.
`Severaltechnirnioscanbeusedtoacceleratcserial traversal:
`
`'
`
`"
`
`‘I
`
`Traversal routirtescanhettptimizedttrwritten in assetnhlycode
`
`Thedatahasecanbesturedinfaatei'memory{i.e.,SlUl.hlinsteadofDRfitivIl
`
`A ifastertraversal processor torone optimised for the particular structure format} can be
`used.
`
`Ifthese optimizations are not enough. the only alternative is to traverse the database in
`parallel. Thedatahaseeithercan be stored in asingie memory sy-stemthat allowspnraliei
`acom by multiple processors ta shared-memory model]. orcan bedistributed over multipie
`prooeuors. each with its own memory system is rtisnibuted-memory model}.
`The advantage of the sbaned-memory approach is that the database can remain in one
`place. although traversal must be divided among multiple processors. Presumably. each
`processor is assigned a certain portion oi‘ the datatme to traverse. Unfortunately. inherited
`attributes ina hieruchicaldatabammodelmeandtatprocessorsmustoontendftxaccess to
`thesarriedata. FtnumnpIe.eard|processornnistl|aveaceesstothcctumtnaansfmmation
`ntatnitaitdtoodtn-viewirtgandligl'tIing parameters. Since thedatabandtvidthtoandfroot
`ashared-ntentorysystemmaynotbemuch higherthan thatofaoomenrional l'ltemos'y
`systeltl. the sltated-ttletltory approach may not provide enough perfonnance.
`In the distributed-memory approach. each processor contains a portion of the database
`in its local memory. ltrraversesits portionotdtedatattaseforeachfratneand may also
`perform other front-end computations. Distributing the database presents its own problems.
`hon-ever: Unless the system gives die application prugrammerthe illusion ofa contiguous
`database. it cannot support portable graphics libraries. Also. the load must be balanced
`
`0942
`0942
`
`

`
`13.5
`
`Parallel Front-End Architectural
`
`881
`
`over the traversal processors if system resources are to be utilized fully. Hierarchical
`databases exacerbate both of these problems, since attributes in one level of a hierarchy
`aifect primitives below them. and structures deep in a hierarchy may be referenced by
`multiple higher-level structure calls.
`Tl1.-e following two mtions eztarnitre two Ways to distribute a hierarchical database over
`multiple processors: by structure. where each traversal processor is given a complete branch
`of the structure hierarchy; or by primitive, where each traversal processor is given a fraction
`of the pritnitives at each block in the hierarchy.
`
`Distributing by structure. Distributing by structure is outwardly appealing, since
`state—changing elements in the structure apparently need to be stored only once. This can be
`an illusion, however, since multiple high~|evel structures may refer to the same lower-level
`substructure. For example. a database containing several cars, each described by a separate
`car structure. can be distributed by assigning each car structure to a separate processor.
`However, if each car stmcture refers to a number of wheel structures, vviteel structures must
`
`also be replicated at every processor.
`Load balancing among procemors is also diflicult. Since primitives in a structure are
`likely to be spatially coherent, changing the viewpoint or geometry within a scene may
`cause entire portions of the structure to be clipped or to reappear. Maintaining even loading
`among the multiple processors would require reassigning portions of the database
`dynamically.
`
`Distributing by primitive. Distributing by primitive is costly, since the entire hierarchi-
`cal stntcture of the database and any state-changing commands must be replicated at each
`processor. Structure editing is also expensive, since changes must be broadcast to every
`processor. Load balancing, however, is automatic. Since objects in a hierarchical database
`typically contain a large number of simple primitives {e.g., polygons forming a tiled
`surface), these primitives will be scattered over all the processors, and each processor will
`have a similar processing load.
`the highest-
`In I939.
`is a relatively new technique.
`Parallel display traversal
`performance architectures were just approaching the point where serial traversal becomes
`insuflicient, and only a few systems had experimented with parallel traversal [FUCHSQL
`Neither of the disuibution techniques for hierarchical databases that we have described is
`ideal. Compared to geometry processing, which easily partitions into parallel tasks, display
`traversal
`is much more difiicult.
`lvievertheless, parallel
`traversal
`is
`likely to become
`increasingly important as system performance levels increase.
`
`‘lB.E.2 Flecornblnlng Parallel Streams
`
`The transition betvveen the front-end and back-end portions of the rendering pipeline is
`troublesome as well. In a parallel front-end system. the multiple streams of transformed and
`clipped primitives must be directed to the processor or processors doing rasterization. This
`can require sorting primitives based on spatial
`infonnation if dilierent processors are
`assigned to different screen regions.
`A second difliculty in parallel front-end systems is that the ordering of data may change
`as those data pass through parallel processors. For example, one processor may transform
`two small primitives before another processor transfonns a single, larger one. This does not
`
`0943
`0943
`
`

`
`E32
`
`Aadvaneodllasterflripltieanrelsltaetura
`
`matter for many graphics primitives and rendering techniques. Certain global commands.
`however. sttch as commands to update one window instead of another or to switch between
`double butters. require that data be synchronized before and after the command. If a large
`mmbu*oicommandssudtastheseoecurs.sometypeoftutdueresuppm1for
`synchronization rnayhenecessary. Aflasterficltnologiessystem 1'l'0RBS'i’] incorporatesa
`specidHFDinmeachPEmunoresugoodesforeachcomnmndmulalImvscommmflsw
`be resynchronlsed after they have been processed in separate FE-is.
`
`1a.s.3 Pipalinhtg trtrraua Pttnllaliarn
`
`We have seert that both pipelining and parallelism can be used to build higlt-performance
`front-end subsystems. Although pipelining has been the predominant technique in systems
`of the hm decade. panlleliam oflers several advantages, including reconfigurahiiity for
`difiamtalgrdmnmsiuxasinghprocmmhandlesfllfiom-uflulwlafions.mtdnune
`nnduhrhy.fincePEsmapufl|eisyueorcmthenndelurmgernmmnaneufilymmma
`pipdimsystmhfiemmedupwhnmauofapipdimsyfimthlimitedhydufluwgltpm
`ofits slowest stage. pipelines do not scale up as readily as do parallel systems. Parallel
`systen:s,cotheotltet'hand,teqt.tire nsorecomplicated synchnanizationandload balancing
`uflunnutneqncidizedpsocususaswdlucanpipdinedsyruursflothdesigmue
`Iitdymhemehtlinmehmne:indeed.thehiglest-pufmnuncesysteoauelikdym
`cotnbittedtetwo.
`
`13.7 MULTIPHDCEESDH HAS'I'EIiI|ZA1'lDlll ARCHITECTURES
`
`Recsllthattheoutputofthefront-endstbsystetttistypicallyasetofprimititvesinscteett
`coordinates. The rasterizatiort {back-end} subsystem creates the final
`image by amt
`converting each of these primitives, detertrtinirtg which primitives are visible at each pixel.
`and shading the pittel accttrtiingly. Section IS.2.tl identified two basic reasons why simple
`display-processodfrartte-hulfer systems are inadequate for higlt-performa:nee rasterization
`stlasystemst
`
`l. A single display processor does not have cnottglt processing power for all the pixel
`calculations.
`
`2. Memory bandwidth into the frame butter is insulftcient to handle the pixel traffic-
`even if the tlimlay processor cottld oornpttte pixels rapidly enottglt.
`
`Much ofthe research in graphics an'.'.hitecture over the past decade has concemed ways
`to ovet‘t.'ottte these litrtitatiotts. A great variety of techniqu has been proposed, and marry
`have been irtrplernertted in comrrtercia] artd ettperirnental systerrts.
`In this section. we
`consider low-cost. modes-ate-performance architectures that cast conventional algorithms
`into hardware. In Sections I33 and 18.9. we consider ways to improve performance by
`adding large amounts of parallelism to speed the calculation of the algorithm‘s “inner
`loop." in Section l3. ll]. we consider hybrid architectures that combine multiple techniques
`for improwd etficiency or even higher perfonnance. Figure l3. I2 summarizes the
`cortcurrent approaches we shall discuss here.
`
`0944
`0944
`
`

`
`1 3.?
`
`Multiprocessor Hacterlzafion Architecture:
`
`333
`
`HHSHTZEHDH
`
`
`
`.-architectural technique
`
`Highly parallel
`
`
`
`
` Dblect order
`Pipellnecl object order
`Image parallel
`li'lrlual buflerl
`virtual processor
`
`Partitioned Image
`Pclygoniedge:I'spen-
`1-butler. depth-
`processor plpellne
`memory
`sort. and ESP-tree
`
`algorithms
`Logic-enhanced
`memory
`
`
`
`Plpellnecl lmape order
`Object parallel
`image order
`Scan-line plpellne
`Scanline algorithms
`
`
`
`Parallel virtual buffer
`
`Image composition
`
`Fig. 13.12 Taxonomy of concurrent rasterization approaches.
`
`1B.7.1
`
`Pipelinetl Object-Order Architectures
`
`A direct way to add concunency to rasterization calculations is to cast the various steps of a
`software algorithm into a hardware pipeline. This technique has been used to build a
`number of inexpensive. moderately high-performance systems. This approach can he used
`with either of the two main rasterization approaches: object order [2-buffer. depth-sort. and
`ESP-tree algorithms} and image order (scan-line algorithms}. We consider objecteorder
`rasterieatlon now and image-order rasteaieation in Section lli.'l.2.
`Dbject-ortler rastcrization methods include the z-butler, depth-sort, and ESP-tree
`algorithms (the z-bufier is by far the most common in 3]} systems). The outer loop of these
`algorithms is an enumeration of primitives in the database, and the inner loop is an
`enumeration of pixels within each primitive. For polygon rendering. the heart of each of
`these algorithms is rasterizing a single polygon.
`Figure lli.l3 shows the most common rasterization algorithm for convex polygons.
`This algorithm is an extension of the 21;} polygon scanconversion algorithm presented in
`Section 3.6. using fixed-point aritlnnetic rather than integer arithmetic. Delta values are
`used to calculate the expressions for Jr. 2. R. G. and E incrementally from scan line to scan
`line, and from pixel to pixel. We shall describe each step of the algorithm.
`
`Polygon processing. Computations performed only once per polygon me grouped into
`this stage. The first step is to determine the initial scan line intersected by the polygon {this
`is determined by the vertex with the smallest y value). In most cases. the polygon intersects
`this scan line at a single pixel. with two edges projecting upward. the left and right edges.
`Delta values are calculated for 1. 2. R, G. and H for each edge. ‘These delta values are
`sornetitnes called slopes.
`
`Edge processing. Computations performed once for each scan line are grouped here.
`Scan lines within each primitive are processed one by one. The delta values computed
`previously are used to calculate x, 2, R, G, and E values at the intersection points of the left
`and right edges with the cun'ent scan line (Pm and PW in the figure}. A contiguous
`sequence of pixels

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket