throbber
........................................................................................................................................................................................................................................................
`
`NVIDIA TESLA: A UNIFIED
`GRAPHICS AND
`COMPUTING ARCHITECTURE
`
`........................................................................................................................................................................................................................................................
`TO ENABLE FLEXIBLE, PROGRAMMABLE GRAPHICS AND HIGH-PERFORMANCE COMPUTING,
`
`NVIDIA HAS DEVELOPED THE TESLA SCALABLE UNIFIED GRAPHICS AND PARALLEL
`
`COMPUTING ARCHITECTURE. ITS SCALABLE PARALLEL ARRAY OF PROCESSORS IS
`
`MASSIVELY MULTITHREADED AND PROGRAMMABLE IN C OR VIA GRAPHICS APIS.
`
`Erik Lindholm
`John Nickolls
`Stuart Oberman
`John Montrym
`NVIDIA
`
`...... The modern 3D graphics process-
`
`ing unit (GPU) has evolved from a fixed-
`function graphics pipeline to a programma-
`ble parallel processor with computing power
`exceeding that of multicore CPUs. Tradi-
`tional graphics pipelines consist of separate
`programmable stages of vertex processors
`executing vertex shader programs and pixel
`fragment processors executing pixel shader
`programs. (Montrym and Moreton provide
`additional background on the traditional
`graphics processor architecture.1)
`NVIDIA’s Tesla architecture, introduced
`in November 2006 in the GeForce 8800
`GPU, unifies the vertex and pixel processors
`and extends them, enabling high-perfor-
`mance parallel computing applications writ-
`ten in the C language using the Compute
`Unified Device Architecture (CUDA2–4)
`parallel programming model and develop-
`ment tools. The Tesla unified graphics and
`computing architecture is available in a
`scalable family of GeForce 8-series GPUs
`and Quadro GPUs for laptops, desktops,
`workstations, and servers. It also provides
`the processing architecture for the Tesla
`GPU computing platforms introduced in
`2007 for high-performance computing.
`
`0272-1732/08/$20.00 G 2008 IEEE
`
`Published by the IEEE Computer Society.
`
`In this article, we discuss the require-
`ments that drove the unified graphics and
`parallel computing processor architecture,
`describe the Tesla architecture, and how it is
`enabling widespread deployment of parallel
`computing and graphics applications.
`
`The road to unification
`The first GPU was the GeForce 256,
`introduced in 1999. It contained a fixed-
`function 32-bit floating-point vertex trans-
`form and lighting processor and a fixed-
`function integer pixel-fragment pipeline,
`which were programmed with OpenGL
`and the Microsoft DX7 API.5 In 2001,
`the GeForce 3 introduced the first pro-
`grammable vertex processor executing vertex
`shaders, along with a configurable 32-bit
`floating-point
`fragment
`pipeline,
`pro-
`grammed with DX85 and OpenGL.6 The
`Radeon 9700, introduced in 2002, featured
`a programmable 24-bit floating-point pixel-
`fragment processor programmed with DX9
`and OpenGL.7,8 The GeForce FX added 32-
`bit floating-point pixel-fragment processors.
`The XBox 360 introduced an early unified
`GPU in 2005, allowing vertices and pixels
`to execute on the same processor.9
`........................................................................
`
`39
`
`

`
`.........................................................................................................................................................................................................................
`HOT CHIPS 19
`
`Vertex processors operate on the vertices
`of primitives
`such as points,
`lines, and
`triangles. Typical operations include trans-
`forming coordinates
`into screen space,
`which are then fed to the setup unit and
`the rasterizer, and setting up lighting and
`texture parameters to be used by the pixel-
`fragment processors. Pixel-fragment proces-
`sors operate on rasterizer output, which fills
`the interior of primitives, along with the
`interpolated parameters.
`Vertex and pixel-fragment processors
`have evolved at different
`rates: Vertex
`processors were designed for low-latency,
`high-precision math operations, whereas
`pixel-fragment processors were optimized
`for high-latency,
`lower-precision texture
`filtering. Vertex processors have tradition-
`ally supported more-complex processing, so
`they became programmable first. For the
`last
`six years,
`the two processor
`types
`have been functionally converging as the
`result of a need for greater programming
`generality. However, the increased general-
`ity also increased the design complexity,
`area, and cost of developing two separate
`processors.
`typically must process
`Because GPUs
`more pixels than vertices, pixel-fragment
`processors traditionally outnumber vertex
`processors by about three to one. However,
`typical workloads are not well balanced,
`leading
`to
`inefficiency. For
`example,
`with large triangles, the vertex processors
`are mostly idle, while the pixel processors
`are
`fully busy. With small
`triangles,
`the opposite is
`true. The addition of
`more-complex
`primitive
`processing
`in
`DX10 makes it much harder to select a
`ratio.10 All
`fixed processor
`these factors
`influenced the decision to design a unified
`architecture.
`A primary design objective for Tesla was
`to execute vertex and pixel-fragment shader
`programs on the same unified processor
`architecture. Unification would enable dy-
`namic load balancing of varying vertex- and
`pixel-processing workloads and permit the
`introduction of new graphics shader stages,
`such as geometry shaders in DX10. It also
`let a single team focus on designing a fast
`and efficient processor and allowed the
`sharing of expensive hardware such as the
`
`texture units. The generality required of a
`unified processor opened the door to a
`completely new GPU parallel-computing
`capability. The downside of this generality
`was the difficulty of efficient load balancing
`between different shader types.
`Other critical hardware design require-
`ments were architectural scalability, perfor-
`mance, power, and area efficiency.
`The Tesla
`architects developed the
`graphics feature set in coordination with
`the development of the Microsoft Direct3D
`DirectX 10 graphics API.10 They developed
`the GPU’s computing feature set in coor-
`dination with the development of
`the
`CUDA C parallel programming language,
`compiler, and development tools.
`
`Tesla architecture
`The Tesla architecture is based on a
`scalable processor array. Figure 1 shows a
`block diagram of a GeForce 8800 GPU
`with 128 streaming-processor (SP) cores
`organized as 16 streaming multiprocessors
`(SMs) in eight independent processing units
`called texture/processor clusters
`(TPCs).
`Work flows from top to bottom, starting
`at the host interface with the system PCI-
`Express bus. Because of its unified-processor
`design,
`the physical Tesla
`architecture
`doesn’t
`resemble
`the
`logical order of
`graphics pipeline stages. However, we will
`use the logical graphics pipeline flow to
`explain the architecture.
`At the highest level, the GPU’s scalable
`streaming processor array (SPA) performs
`all the GPU’s programmable calculations.
`The scalable memory system consists of
`external DRAM control and fixed-function
`raster operation processors
`(ROPs)
`that
`perform color and depth frame buffer
`operations directly on memory. An inter-
`connection
`network
`carries
`computed
`pixel-fragment colors and depth values from
`the SPA to the ROPs. The network also
`routes texture memory read requests from
`the SPA to DRAM and read data from
`DRAM through a level-2 cache back to the
`SPA.
`The remaining blocks in Figure 1 deliver
`input work to the SPA. The input assembler
`collects vertex work as directed by the input
`command stream. The vertex work distri-
`
`.......................................................................
`
`40
`
`IEEE MICRO
`
`

`
`Figure 1. Tesla unified graphics and computing GPU architecture. TPC: texture/processor cluster; SM: streaming
`
`multiprocessor; SP: streaming processor; Tex: texture, ROP: raster operation processor.
`
`bution block distributes vertex work packets
`to the various TPCs in the SPA. The TPCs
`execute vertex shader programs, and (if
`enabled) geometry shader programs. The
`resulting output data is written to on-chip
`buffers. These buffers then pass their results
`to the viewport/clip/setup/raster/zcull block
`to be rasterized into pixel fragments. The
`pixel work distribution unit distributes pixel
`fragments
`to the appropriate TPCs
`for
`pixel-fragment processing. Shaded pixel-
`fragments are sent across the interconnec-
`tion network for processing by depth and
`color ROP units. The
`compute work
`distribution block
`dispatches
`compute
`thread arrays to the TPCs. The SPA accepts
`and processes work for multiple logical
`streams
`simultaneously. Multiple
`clock
`domains
`for GPU units,
`processors,
`DRAM, and other units allow independent
`power and performance optimizations.
`
`Command processing
`The GPU host interface unit communi-
`cates with the host CPU,
`responds
`to
`commands from the CPU, fetches data from
`system memory, checks command consisten-
`cy, and performs context switching.
`The input assembler collects geometric
`primitives
`(points,
`lines,
`triangles,
`line
`strips, and triangle
`strips) and fetches
`associated vertex input attribute data. It
`has peak rates of one primitive per clock
`and eight scalar attributes per clock at the
`GPU core
`clock, which is
`typically
`600 MHz.
`The work distribution units forward the
`input assembler’s output stream to the array
`of processors, which execute vertex, geom-
`etry, and pixel shader programs, as well as
`computing programs. The vertex and com-
`pute work distribution units deliver work to
`processors in a round-robin scheme. Pixel
`
`........................................................................
`
`MARCH–APRIL 2008
`
`41
`
`

`
`.........................................................................................................................................................................................................................
`HOT CHIPS 19
`
`Figure 2. Texture/processor cluster (TPC).
`
`work distribution is based on the pixel
`location.
`
`Streaming processor array
`The SPA executes graphics shader thread
`programs and GPU computing programs
`and provides thread control and manage-
`ment. Each TPC in the SPA roughly
`corresponds to a quad-pixel unit in previous
`architectures.1 The number of TPCs deter-
`mines a GPU’s programmable processing
`performance and scales from one TPC in a
`small GPU to eight or more TPCs in high-
`performance GPUs.
`
`Texture/processor cluster
`As Figure 2 shows, each TPC contains a
`geometry controller,
`an SM controller
`(SMC),
`two streaming multiprocessors
`(SMs), and a texture unit. Figure 3 expands
`each SM to show its eight SP cores. To
`balance the expected ratio of math opera-
`
`tions to texture operations, one texture unit
`serves two SMs. This architectural ratio can
`vary as needed.
`
`Geometry controller
`The geometry controller maps the logical
`graphics vertex pipeline into recirculation
`on the physical SMs by directing all
`primitive and vertex attribute and topology
`flow in the TPC. It manages dedicated on-
`chip input and output vertex attribute
`storage and forwards contents as required.
`DX10 has two stages dealing with vertex
`and primitive processing: the vertex shader
`and the geometry shader. The vertex shader
`processes one vertex’s attributes indepen-
`dently of other vertices. Typical operations
`are position space transforms and color and
`texture coordinate generation. The geome-
`try shader follows the vertex shader and
`deals with a whole primitive and its vertices.
`Typical operations are edge extrusion for
`
`.......................................................................
`
`42
`
`IEEE MICRO
`
`

`
`for transcendental functions and attribute
`interpolation—the interpolation of pixel
`attributes from vertex attributes defining a
`primitive. Each SFU also contains
`four
`floating-point multipliers. The SM uses the
`TPC texture unit as a third execution unit
`and uses
`the SMC and ROP units
`to
`implement external memory load,
`store,
`and atomic accesses. A low-latency inter-
`connect network between the SPs and the
`shared-memory banks provides
`shared-
`memory access.
`The GeForce 8800 Ultra clocks the SPs
`and SFU units at 1.5 GHz, for a peak of 36
`Gflops per SM. To optimize power and area
`efficiency, some SM non-data-path units
`operate at half the SP clock rate.
`
`SM multithreading. A graphics vertex or
`pixel shader is a program for a single thread
`that describes how to process a vertex or a
`pixel. Similarly, a CUDA kernel
`is a C
`program for a single thread that describes
`how one thread computes a result. Graphics
`and computing
`applications
`instantiate
`many parallel threads to render complex
`images and compute large result arrays. To
`dynamically balance shifting vertex and
`pixel shader thread workloads, the unified
`SM concurrently executes different thread
`programs and different
`types of
`shader
`programs.
`execute hundreds of
`To efficiently
`threads in parallel while running several
`different programs,
`the SM is hardware
`multithreaded. It manages and executes up
`to 768 concurrent threads in hardware with
`zero scheduling overhead.
`To support
`the
`independent vertex,
`primitive, pixel, and thread programming
`model of graphics shading languages and
`the CUDA C/C++ language, each SM
`thread has its own thread execution state
`and can execute an independent code path.
`Concurrent threads of computing programs
`can synchronize at a barrier with a single
`SM instruction. Lightweight thread crea-
`tion, zero-overhead thread scheduling, and
`fast barrier synchronization support very
`fine-grained parallelism efficiently.
`
`Single-instruction, multiple-thread. To man-
`age and execute hundreds of threads running
`
`Figure 3. Streaming multiprocessor (SM).
`
`stencil shadow generation and cube map
`texture generation. Geometry shader output
`primitives go to later stages for clipping,
`viewport transformation, and rasterization
`into pixel fragments.
`
`Streaming multiprocessor
`The SM is a unified graphics and
`computing multiprocessor
`that
`executes
`vertex, geometry, and pixel-fragment shader
`programs and parallel computing programs.
`As Figure 3 shows, the SM consists of eight
`streaming processor (SP) cores, two special-
`function units
`(SFUs), a multithreaded
`instruction fetch and issue unit (MT Issue),
`an instruction cache, a read-only constant
`cache, and a 16-Kbyte read/write shared
`memory.
`The shared memory holds graphics input
`buffers or shared data for parallel comput-
`ing. To
`pipeline
`graphics workloads
`through the SM, vertex, geometry, and
`pixel threads have independent input and
`output buffers. Workloads can arrive and
`depart independently of thread execution.
`Geometry threads, which generate variable
`amounts of output per thread, use separate
`output buffers.
`Each SP core contains a scalar multiply-
`add (MAD) unit, giving the SM eight
`MAD units. The SM uses its two SFU units
`
`........................................................................
`
`MARCH–APRIL 2008
`
`43
`
`

`
`.........................................................................................................................................................................................................................
`HOT CHIPS 19
`
`the
`several different programs efficiently,
`Tesla SM uses a new processor architecture
`we call
`single-instruction, multiple-thread
`(SIMT). The SM’s SIMT multithreaded
`instruction unit creates, manages, schedules,
`and executes
`threads
`in groups of 32
`parallel threads called warps. The term warp
`originates from weaving, the first parallel-
`thread technology. Figure 4 illustrates SIMT
`scheduling. The SIMT warp size of 32
`parallel threads provides efficiency on plen-
`tiful fine-grained pixel threads and comput-
`ing threads.
`Each SM manages a pool of 24 warps,
`with a total of 768 threads. Individual
`threads composing a SIMT warp are of the
`same type and start together at the same
`program address, but they are otherwise free
`to branch and execute independently. At
`each instruction issue time,
`the SIMT
`multithreaded instruction unit
`selects a
`warp that is ready to execute and issues
`the next instruction to that warp’s active
`threads. A SIMT instruction is broadcast
`synchronously to a warp’s active parallel
`threads; individual threads can be inactive
`due to independent branching or predica-
`tion.
`The SM maps the warp threads to the SP
`cores, and each thread executes indepen-
`dently with its own instruction address and
`register state. A SIMT processor realizes full
`efficiency and performance when all 32
`threads of a warp take the same execution
`path. If threads of a warp diverge via a data-
`dependent conditional branch,
`the warp
`serially executes each branch path taken,
`disabling threads that are not on that path,
`and when all paths complete, the threads
`reconverge to the original execution path.
`The SM uses a branch synchronization stack
`to manage independent threads that diverge
`and converge. Branch divergence only
`occurs within a warp; different warps
`execute independently regardless of whether
`they are executing common or disjoint code
`paths. As a result, Tesla architecture GPUs
`are dramatically more efficient and flexible
`on branching code than previous generation
`GPUs, as their 32-thread warps are much
`narrower than the SIMD width of prior
`GPUs.1
`
`Figure 4. Single-instruction, multiple-
`
`thread (SIMT) warp scheduling.
`
`SIMT architecture is similar to single-
`instruction, multiple-data (SIMD) design,
`which applies one instruction to multiple
`data lanes. The difference is that SIMT
`applies one instruction to multiple inde-
`pendent threads in parallel, not just multi-
`ple data lanes. A SIMD instruction controls
`a vector of multiple data lanes together and
`exposes the vector width to the software,
`whereas a SIMT instruction controls the
`execution and branching behavior of one
`thread.
`In contrast to SIMD vector architectures,
`SIMT enables programmers to write thread-
`level parallel code for independent threads
`as well as data-parallel code for coordinated
`threads. For program correctness, program-
`mers can essentially ignore SIMT execution
`attributes such as warps; however, they can
`achieve substantial performance improve-
`ments by writing code that seldom requires
`threads in a warp to diverge. In practice, this
`is analogous to the role of cache lines in
`
`.......................................................................
`
`44
`
`IEEE MICRO
`
`

`
`traditional codes: Programmers can safely
`ignore cache line size when designing for
`correctness but must consider it in the code
`structure when designing for peak perfor-
`mance. SIMD vector architectures, on the
`other hand, require the software to manu-
`ally coalesce loads
`into vectors and to
`manually manage divergence.
`
`SIMT warp scheduling. The SIMT ap-
`proach of scheduling independent warps is
`simpler than previous GPU architectures’
`complex scheduling. A warp consists of up
`to 32 threads of the same type—vertex,
`geometry, pixel, or compute. The basic unit
`of pixel-fragment shader processing is the 2
`3 2 pixel quad. The SM controller groups
`eight pixel quads into a warp of 32 threads.
`It similarly groups vertices and primitives
`into warps and packs 32 computing threads
`into a warp. The SIMT design shares the
`SM instruction fetch and issue unit effi-
`ciently across 32 threads but requires a full
`warp of active threads for full performance
`efficiency.
`As a unified graphics processor, the SM
`schedules and executes multiple warp types
`concurrently—for
`example,
`concurrently
`executing vertex and pixel warps. The SM
`warp scheduler operates at half the 1.5-GHz
`processor clock rate. At each cycle, it selects
`one of the 24 warps to execute a SIMT warp
`instruction, as Figure 4 shows. An issued
`warp instruction executes as two sets of 16
`threads over four processor cycles. The SP
`cores and SFU units execute instructions
`independently, and by issuing instructions
`between them on alternate
`cycles,
`the
`scheduler can keep both fully occupied.
`Implementing zero-overhead warp sched-
`uling for a dynamic mix of different warp
`programs and program types was a chal-
`lenging design problem. A scoreboard
`qualifies each warp for issue each cycle.
`The instruction scheduler prioritizes all
`ready warps and selects the one with highest
`priority for issue. Prioritization considers
`warp type, instruction type, and ‘‘fairness’’
`to all warps executing in the SM.
`
`SM instructions. The Tesla SM executes
`scalar instructions, unlike previous GPU
`vector
`instruction architectures. Shader
`
`programs are becoming longer and more
`scalar, and it is increasingly difficult to fully
`occupy even two components of the prior
`four-component vector architecture. Previ-
`ous architectures employed vector pack-
`ing—combining sub-vectors of work to
`gain efficiency—but that complicated the
`scheduling hardware as well as the compiler.
`Scalar instructions are simpler and compiler
`friendly. Texture instructions remain vector
`based, taking a source coordinate vector and
`returning a filtered color vector.
`High-level graphics and computing-lan-
`guage compilers generate intermediate in-
`structions, such as DX10 vector or PTX
`scalar instructions,10,2 which are then opti-
`mized and translated to binary GPU
`instructions. The optimizer readily expands
`DX10 vector instructions to multiple Tesla
`SM scalar instructions. PTX scalar instruc-
`tions optimize to Tesla SM scalar instruc-
`tions about one to one. PTX provides a
`stable target ISA for compilers and provides
`compatibility over several generations of
`GPUs with evolving binary instruction set
`architectures. Because the intermediate lan-
`guages use virtual registers, the optimizer
`analyzes data dependencies and allocates
`real registers. It eliminates dead code, folds
`instructions
`together when feasible, and
`optimizes SIMT branch divergence and
`convergence points.
`
`Instruction set architecture. The Tesla SM
`has a register-based instruction set including
`floating-point, integer, bit, conversion, tran-
`scendental, flow control, memory load/store,
`and texture operations.
`Floating-point and integer operations
`include add, multiply, multiply-add, mini-
`mum, maximum, compare, set predicate,
`and conversions between integer and float-
`ing-point numbers. Floating-point instruc-
`tions provide source operand modifiers for
`negation and absolute value. Transcenden-
`tal
`function instructions
`include cosine,
`sine, binary exponential, binary logarithm,
`reciprocal,
`and reciprocal
`square
`root.
`Attribute interpolation instructions provide
`efficient generation of pixel
`attributes.
`Bitwise operators include shift left, shift
`right, logic operators, and move. Control
`
`........................................................................
`
`MARCH–APRIL 2008
`
`45
`
`

`
`.........................................................................................................................................................................................................................
`HOT CHIPS 19
`
`flow includes branch, call, return, trap, and
`barrier synchronization.
`The floating-point and integer instruc-
`tions can also set per-thread status flags for
`zero, negative, carry, and overflow, which
`the thread program can use for conditional
`branching.
`
`instructions. The texture
`Memory access
`instruction fetches and filters texture sam-
`ples from memory via the texture unit. The
`ROP unit writes pixel-fragment output to
`memory.
`computing and C/C++
`To support
`language needs, the Tesla SM implements
`memory load/store instructions in addition
`to graphics texture fetch and pixel output.
`Memory load/store instructions use integer
`byte
`addressing with register-plus-offset
`address arithmetic to facilitate conventional
`compiler code optimizations.
`For computing, the load/store instruc-
`tions access three read/write memory spaces:
`N
`
`local memory for per-thread, private,
`temporary data (implemented in ex-
`ternal DRAM);
`shared memory for low-latency access
`to data shared by cooperating threads
`in the same SM; and
`global memory for data shared by all
`threads of a computing application
`(implemented in external DRAM).
`
`N
`
`N
`
`load-global,
`The memory instructions
`store-shared,
`store-global,
`load-shared,
`load-local, and store-local access global,
`shared, and local memory. Computing
`programs use the fast barrier synchroniza-
`tion instruction to synchronize
`threads
`within the SM that communicate with each
`other via shared and global memory.
`To improve memory bandwidth and
`reduce overhead, the local and global load/
`store instructions coalesce individual paral-
`lel thread accesses from the same warp into
`fewer memory block accesses. The addresses
`must
`fall
`in the same block and meet
`alignment
`criteria. Coalescing memory
`requests boosts performance significantly
`over separate requests. The large thread
`count,
`together with support
`for many
`outstanding load requests, helps
`cover
`
`.......................................................................
`
`46
`
`IEEE MICRO
`
`local and global
`load-to-use latency for
`memory implemented in external DRAM.
`The
`latest Tesla
`architecture GPUs
`provide efficient atomic memory opera-
`tions,
`including integer add, minimum,
`maximum,
`logic operators,
`swap,
`and
`compare-and-swap operations. Atomic op-
`erations facilitate parallel reductions and
`parallel data structure management.
`
`Streaming processor. The SP core is the
`primary thread processor in the SM. It
`performs
`the fundamental
`floating-point
`operations,
`including add, multiply, and
`multiply-add. It also implements a wide
`variety of integer, comparison, and conver-
`sion operations. The floating-point add and
`multiply operations are compatible with the
`IEEE 754 standard for single-precision FP
`numbers,
`including not-a-number (NaN)
`and infinity values. The unit
`is
`fully
`pipelined, and latency is optimized to
`balance delay and area.
`The add and multiply operations use
`IEEE round-to-nearest-even as the default
`rounding mode. The multiply-add opera-
`tion performs a multiplication with trunca-
`tion, followed by an add with round-to-
`nearest-even. The SP flushes denormal
`source operands to sign-preserved zero and
`flushes results that underflow the target
`output exponent range to sign-preserved
`zero after rounding.
`
`Special-function unit. The SFU supports
`computation of both transcendental func-
`tions and planar attribute interpolation.11 A
`traditional vertex or pixel shader design
`contains a functional unit
`to compute
`transcendental functions. Pixels also need
`an attribute-interpolating unit to compute
`the per-pixel attribute values at the pixel’s x,
`y location, given the attribute values at the
`primitive’s vertices.
`For functional evaluation, we use qua-
`dratic interpolation based on enhanced
`minimax approximations
`to approximate
`the reciprocal, reciprocal square root, log2x,
`2x, and sin/cos functions. Table 1 shows the
`accuracy of the function estimates. The SFU
`unit generates one 32-bit
`floating point
`result per cycle.
`
`

`
`Table 1. Function approximation statistics.
`
`Function
`
`Input
`interval
`
`Accuracy (good
`bits)
`
`ULP* error
`
`% exactly
`rounded
`
`Monotonic
`
`1/x
`
`[1, 2)
`
`24.02
`
`0.98
`
`87
`
`Yes
`
`[1, 4)
`
`[0, 1)
`
`23.40
`
`22.51
`
`22.57
`
`1.52
`
`1.41
`
`N/A**
`
`1/sqrt(x)
`2x
`[1, 2)
`log2x
`[0, p/2)
`No
`N/A
`N/A
`22.47
`sin/cos
`........................................................................................................................................................
`* ULP: unit-in-the-last-place.
`** N/A: not applicable.
`
`78
`
`74
`
`N/A
`
`Yes
`
`Yes
`
`Yes
`
`The SFU also supports attribute interpo-
`lation, to enable accurate interpolation of
`attributes such as color, depth, and texture
`coordinates. The SFU must
`interpolate
`these attributes in the (x, y) screen space
`to determine the values of the attributes at
`each pixel location. We express the value of
`a given attribute U in an (x, y) plane in
`plane equations of the following form:

`Þ ~
`U x, y

`AU | x z BU | y z CU

`
`Þ=

`AW | x z BW | y z CW
`
`where A, B, and C are
`interpolation
`parameters associated with each attribute
`U, and W is related to the distance of the
`pixel
`from the
`viewer
`for perspective
`projection. The
`attribute
`interpolation
`hardware in the SFU is fully pipelined,
`and it can interpolate four samples per
`cycle.
`the SFU can
`In a shader program,
`generate perspective-corrected attributes as
`follows:
`
`N
`
`Interpolate 1/W, and invert to form
`W.
`N
`Interpolate U/W.
`N Multiply U/W by W to form perspec-
`tive-correct U.
`
`SM controller. The SMC controls multiple
`SMs, arbitrating the shared texture unit,
`load/store path, and I/O path. The SMC
`serves
`three graphics workloads
`simulta-
`
`It
`neously: vertex, geometry, and pixel.
`packs each of these input types into the
`warp width,
`initiating shader processing,
`and unpacks the results.
`Each input type has independent I/O
`paths, but the SMC is responsible for load
`balancing among them. The SMC supports
`static and dynamic load balancing based on
`driver-recommended allocations,
`current
`allocations, and relative difficulty of addi-
`tional resource allocation. Load balancing of
`the workloads was one of
`the more
`challenging design problems due to its
`impact on overall SPA efficiency.
`
`Texture unit
`The texture unit processes one group of
`four threads (vertex, geometry, pixel, or
`compute) per cycle. Texture instruction
`sources are texture coordinates, and the
`outputs are filtered samples,
`typically a
`four-component (RGBA) color. Texture is
`a separate unit external to the SM connect-
`ed via the SMC. The issuing SM thread can
`continue execution until a data dependency
`stall.
`Each texture unit has four texture address
`generators and eight filter units, for a peak
`GeForce 8800 Ultra rate of 38.4 gigabi-
`lerps/s (a bilerp is a bilinear interpolation of
`four samples). Each unit supports full-speed
`2:1 anisotropic filtering, as well as high-
`dynamic-range (HDR) 16-bit and 32-bit
`floating-point data format filtering.
`The texture unit
`is deeply pipelined.
`Although it contains a cache to capture
`filtering locality, it streams hits mixed with
`misses without stalling.
`
`........................................................................
`
`MARCH–APRIL 2008
`
`47
`
`

`
`.........................................................................................................................................................................................................................
`HOT CHIPS 19
`
`Rasterization
`from the
`Geometry primitives output
`SMs go in their original round-robin input
`order to the viewport/clip/setup/raster/zcull
`block. The viewport and clip units clip the
`primitives to the standard view frustum and
`to any enabled user clip planes. They
`transform postclipping vertices into screen
`(pixel) space and reject whole primitives
`outside the view volume as well as back-
`facing primitives.
`Surviving primitives then go to the setup
`unit, which generates edge equations for the
`rasterizer. Attribute plane equations are also
`generated for linear interpolation of pixel
`attributes in the pixel shader. A coarse-
`rasterization stage generates all pixel tiles
`that are at least partially inside the primi-
`tive.
`The zcull unit maintains a hierarchical z
`surface,
`rejecting pixel
`tiles
`if
`they are
`conservatively known to be occluded by
`previously drawn pixels. The rejection rate
`is up to 256 pixels per clock. The screen is
`subdivided into tiles; each TPC processes a
`predetermined subset. The pixel tile address
`therefore selects the destination TPC. Pixel
`tiles that survive zcull then go to a fine-
`rasterization stage that generates detailed
`coverage information and depth values for
`the pixels.
`OpenGL and Direct3D require that a
`depth test be performed after the pixel
`shader has generated final color and depth
`values. When possible, for certain combi-
`nations of API
`state,
`the Tesla GPU
`performs the depth test and update ahead
`of
`the fragment
`shader, possibly saving
`thousands of cycles of processing time,
`without violating the API-mandated seman-
`tics.
`The SMC assembles surviving pixels into
`warps to be processed by a SM running the
`current pixel shader. When the pixel shader
`has finished, the pixels are optionally depth
`tested if this was not done ahead of the
`shader. The SMC then sends
`surviving
`pixels and associated data to the ROP.
`
`Raster operations processor
`Each ROP is paired with a specific
`memory partition. The TPCs feed data to
`the ROPs via an interconnection network.
`
`ROPs handle depth and stencil testing and
`updates and color blending and updates.
`The memory controller uses lossless color
`(up to 8:1) and depth compression (up to
`8:1) to reduce bandwidth. Each ROP has a
`peak rate of
`four pixels per clock and
`supports 16-bit floating-point and 32-bit
`floating-point HDR formats. ROPs support
`double-rate-depth processing when color
`writes are disabled.
`Each memory partition is 64 bits wide
`and supports double-data

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket