`
`NVIDIA TESLA: A UNIFIED
`GRAPHICS AND
`COMPUTING ARCHITECTURE
`
`........................................................................................................................................................................................................................................................
`TO ENABLE FLEXIBLE, PROGRAMMABLE GRAPHICS AND HIGH-PERFORMANCE COMPUTING,
`
`NVIDIA HAS DEVELOPED THE TESLA SCALABLE UNIFIED GRAPHICS AND PARALLEL
`
`COMPUTING ARCHITECTURE. ITS SCALABLE PARALLEL ARRAY OF PROCESSORS IS
`
`MASSIVELY MULTITHREADED AND PROGRAMMABLE IN C OR VIA GRAPHICS APIS.
`
`Erik Lindholm
`John Nickolls
`Stuart Oberman
`John Montrym
`NVIDIA
`
`...... The modern 3D graphics process-
`
`ing unit (GPU) has evolved from a fixed-
`function graphics pipeline to a programma-
`ble parallel processor with computing power
`exceeding that of multicore CPUs. Tradi-
`tional graphics pipelines consist of separate
`programmable stages of vertex processors
`executing vertex shader programs and pixel
`fragment processors executing pixel shader
`programs. (Montrym and Moreton provide
`additional background on the traditional
`graphics processor architecture.1)
`NVIDIA’s Tesla architecture, introduced
`in November 2006 in the GeForce 8800
`GPU, unifies the vertex and pixel processors
`and extends them, enabling high-perfor-
`mance parallel computing applications writ-
`ten in the C language using the Compute
`Unified Device Architecture (CUDA2–4)
`parallel programming model and develop-
`ment tools. The Tesla unified graphics and
`computing architecture is available in a
`scalable family of GeForce 8-series GPUs
`and Quadro GPUs for laptops, desktops,
`workstations, and servers. It also provides
`the processing architecture for the Tesla
`GPU computing platforms introduced in
`2007 for high-performance computing.
`
`0272-1732/08/$20.00 G 2008 IEEE
`
`Published by the IEEE Computer Society.
`
`In this article, we discuss the require-
`ments that drove the unified graphics and
`parallel computing processor architecture,
`describe the Tesla architecture, and how it is
`enabling widespread deployment of parallel
`computing and graphics applications.
`
`The road to unification
`The first GPU was the GeForce 256,
`introduced in 1999. It contained a fixed-
`function 32-bit floating-point vertex trans-
`form and lighting processor and a fixed-
`function integer pixel-fragment pipeline,
`which were programmed with OpenGL
`and the Microsoft DX7 API.5 In 2001,
`the GeForce 3 introduced the first pro-
`grammable vertex processor executing vertex
`shaders, along with a configurable 32-bit
`floating-point
`fragment
`pipeline,
`pro-
`grammed with DX85 and OpenGL.6 The
`Radeon 9700, introduced in 2002, featured
`a programmable 24-bit floating-point pixel-
`fragment processor programmed with DX9
`and OpenGL.7,8 The GeForce FX added 32-
`bit floating-point pixel-fragment processors.
`The XBox 360 introduced an early unified
`GPU in 2005, allowing vertices and pixels
`to execute on the same processor.9
`........................................................................
`
`39
`
`
`
`.........................................................................................................................................................................................................................
`HOT CHIPS 19
`
`Vertex processors operate on the vertices
`of primitives
`such as points,
`lines, and
`triangles. Typical operations include trans-
`forming coordinates
`into screen space,
`which are then fed to the setup unit and
`the rasterizer, and setting up lighting and
`texture parameters to be used by the pixel-
`fragment processors. Pixel-fragment proces-
`sors operate on rasterizer output, which fills
`the interior of primitives, along with the
`interpolated parameters.
`Vertex and pixel-fragment processors
`have evolved at different
`rates: Vertex
`processors were designed for low-latency,
`high-precision math operations, whereas
`pixel-fragment processors were optimized
`for high-latency,
`lower-precision texture
`filtering. Vertex processors have tradition-
`ally supported more-complex processing, so
`they became programmable first. For the
`last
`six years,
`the two processor
`types
`have been functionally converging as the
`result of a need for greater programming
`generality. However, the increased general-
`ity also increased the design complexity,
`area, and cost of developing two separate
`processors.
`typically must process
`Because GPUs
`more pixels than vertices, pixel-fragment
`processors traditionally outnumber vertex
`processors by about three to one. However,
`typical workloads are not well balanced,
`leading
`to
`inefficiency. For
`example,
`with large triangles, the vertex processors
`are mostly idle, while the pixel processors
`are
`fully busy. With small
`triangles,
`the opposite is
`true. The addition of
`more-complex
`primitive
`processing
`in
`DX10 makes it much harder to select a
`ratio.10 All
`fixed processor
`these factors
`influenced the decision to design a unified
`architecture.
`A primary design objective for Tesla was
`to execute vertex and pixel-fragment shader
`programs on the same unified processor
`architecture. Unification would enable dy-
`namic load balancing of varying vertex- and
`pixel-processing workloads and permit the
`introduction of new graphics shader stages,
`such as geometry shaders in DX10. It also
`let a single team focus on designing a fast
`and efficient processor and allowed the
`sharing of expensive hardware such as the
`
`texture units. The generality required of a
`unified processor opened the door to a
`completely new GPU parallel-computing
`capability. The downside of this generality
`was the difficulty of efficient load balancing
`between different shader types.
`Other critical hardware design require-
`ments were architectural scalability, perfor-
`mance, power, and area efficiency.
`The Tesla
`architects developed the
`graphics feature set in coordination with
`the development of the Microsoft Direct3D
`DirectX 10 graphics API.10 They developed
`the GPU’s computing feature set in coor-
`dination with the development of
`the
`CUDA C parallel programming language,
`compiler, and development tools.
`
`Tesla architecture
`The Tesla architecture is based on a
`scalable processor array. Figure 1 shows a
`block diagram of a GeForce 8800 GPU
`with 128 streaming-processor (SP) cores
`organized as 16 streaming multiprocessors
`(SMs) in eight independent processing units
`called texture/processor clusters
`(TPCs).
`Work flows from top to bottom, starting
`at the host interface with the system PCI-
`Express bus. Because of its unified-processor
`design,
`the physical Tesla
`architecture
`doesn’t
`resemble
`the
`logical order of
`graphics pipeline stages. However, we will
`use the logical graphics pipeline flow to
`explain the architecture.
`At the highest level, the GPU’s scalable
`streaming processor array (SPA) performs
`all the GPU’s programmable calculations.
`The scalable memory system consists of
`external DRAM control and fixed-function
`raster operation processors
`(ROPs)
`that
`perform color and depth frame buffer
`operations directly on memory. An inter-
`connection
`network
`carries
`computed
`pixel-fragment colors and depth values from
`the SPA to the ROPs. The network also
`routes texture memory read requests from
`the SPA to DRAM and read data from
`DRAM through a level-2 cache back to the
`SPA.
`The remaining blocks in Figure 1 deliver
`input work to the SPA. The input assembler
`collects vertex work as directed by the input
`command stream. The vertex work distri-
`
`.......................................................................
`
`40
`
`IEEE MICRO
`
`
`
`Figure 1. Tesla unified graphics and computing GPU architecture. TPC: texture/processor cluster; SM: streaming
`
`multiprocessor; SP: streaming processor; Tex: texture, ROP: raster operation processor.
`
`bution block distributes vertex work packets
`to the various TPCs in the SPA. The TPCs
`execute vertex shader programs, and (if
`enabled) geometry shader programs. The
`resulting output data is written to on-chip
`buffers. These buffers then pass their results
`to the viewport/clip/setup/raster/zcull block
`to be rasterized into pixel fragments. The
`pixel work distribution unit distributes pixel
`fragments
`to the appropriate TPCs
`for
`pixel-fragment processing. Shaded pixel-
`fragments are sent across the interconnec-
`tion network for processing by depth and
`color ROP units. The
`compute work
`distribution block
`dispatches
`compute
`thread arrays to the TPCs. The SPA accepts
`and processes work for multiple logical
`streams
`simultaneously. Multiple
`clock
`domains
`for GPU units,
`processors,
`DRAM, and other units allow independent
`power and performance optimizations.
`
`Command processing
`The GPU host interface unit communi-
`cates with the host CPU,
`responds
`to
`commands from the CPU, fetches data from
`system memory, checks command consisten-
`cy, and performs context switching.
`The input assembler collects geometric
`primitives
`(points,
`lines,
`triangles,
`line
`strips, and triangle
`strips) and fetches
`associated vertex input attribute data. It
`has peak rates of one primitive per clock
`and eight scalar attributes per clock at the
`GPU core
`clock, which is
`typically
`600 MHz.
`The work distribution units forward the
`input assembler’s output stream to the array
`of processors, which execute vertex, geom-
`etry, and pixel shader programs, as well as
`computing programs. The vertex and com-
`pute work distribution units deliver work to
`processors in a round-robin scheme. Pixel
`
`........................................................................
`
`MARCH–APRIL 2008
`
`41
`
`
`
`.........................................................................................................................................................................................................................
`HOT CHIPS 19
`
`Figure 2. Texture/processor cluster (TPC).
`
`work distribution is based on the pixel
`location.
`
`Streaming processor array
`The SPA executes graphics shader thread
`programs and GPU computing programs
`and provides thread control and manage-
`ment. Each TPC in the SPA roughly
`corresponds to a quad-pixel unit in previous
`architectures.1 The number of TPCs deter-
`mines a GPU’s programmable processing
`performance and scales from one TPC in a
`small GPU to eight or more TPCs in high-
`performance GPUs.
`
`Texture/processor cluster
`As Figure 2 shows, each TPC contains a
`geometry controller,
`an SM controller
`(SMC),
`two streaming multiprocessors
`(SMs), and a texture unit. Figure 3 expands
`each SM to show its eight SP cores. To
`balance the expected ratio of math opera-
`
`tions to texture operations, one texture unit
`serves two SMs. This architectural ratio can
`vary as needed.
`
`Geometry controller
`The geometry controller maps the logical
`graphics vertex pipeline into recirculation
`on the physical SMs by directing all
`primitive and vertex attribute and topology
`flow in the TPC. It manages dedicated on-
`chip input and output vertex attribute
`storage and forwards contents as required.
`DX10 has two stages dealing with vertex
`and primitive processing: the vertex shader
`and the geometry shader. The vertex shader
`processes one vertex’s attributes indepen-
`dently of other vertices. Typical operations
`are position space transforms and color and
`texture coordinate generation. The geome-
`try shader follows the vertex shader and
`deals with a whole primitive and its vertices.
`Typical operations are edge extrusion for
`
`.......................................................................
`
`42
`
`IEEE MICRO
`
`
`
`for transcendental functions and attribute
`interpolation—the interpolation of pixel
`attributes from vertex attributes defining a
`primitive. Each SFU also contains
`four
`floating-point multipliers. The SM uses the
`TPC texture unit as a third execution unit
`and uses
`the SMC and ROP units
`to
`implement external memory load,
`store,
`and atomic accesses. A low-latency inter-
`connect network between the SPs and the
`shared-memory banks provides
`shared-
`memory access.
`The GeForce 8800 Ultra clocks the SPs
`and SFU units at 1.5 GHz, for a peak of 36
`Gflops per SM. To optimize power and area
`efficiency, some SM non-data-path units
`operate at half the SP clock rate.
`
`SM multithreading. A graphics vertex or
`pixel shader is a program for a single thread
`that describes how to process a vertex or a
`pixel. Similarly, a CUDA kernel
`is a C
`program for a single thread that describes
`how one thread computes a result. Graphics
`and computing
`applications
`instantiate
`many parallel threads to render complex
`images and compute large result arrays. To
`dynamically balance shifting vertex and
`pixel shader thread workloads, the unified
`SM concurrently executes different thread
`programs and different
`types of
`shader
`programs.
`execute hundreds of
`To efficiently
`threads in parallel while running several
`different programs,
`the SM is hardware
`multithreaded. It manages and executes up
`to 768 concurrent threads in hardware with
`zero scheduling overhead.
`To support
`the
`independent vertex,
`primitive, pixel, and thread programming
`model of graphics shading languages and
`the CUDA C/C++ language, each SM
`thread has its own thread execution state
`and can execute an independent code path.
`Concurrent threads of computing programs
`can synchronize at a barrier with a single
`SM instruction. Lightweight thread crea-
`tion, zero-overhead thread scheduling, and
`fast barrier synchronization support very
`fine-grained parallelism efficiently.
`
`Single-instruction, multiple-thread. To man-
`age and execute hundreds of threads running
`
`Figure 3. Streaming multiprocessor (SM).
`
`stencil shadow generation and cube map
`texture generation. Geometry shader output
`primitives go to later stages for clipping,
`viewport transformation, and rasterization
`into pixel fragments.
`
`Streaming multiprocessor
`The SM is a unified graphics and
`computing multiprocessor
`that
`executes
`vertex, geometry, and pixel-fragment shader
`programs and parallel computing programs.
`As Figure 3 shows, the SM consists of eight
`streaming processor (SP) cores, two special-
`function units
`(SFUs), a multithreaded
`instruction fetch and issue unit (MT Issue),
`an instruction cache, a read-only constant
`cache, and a 16-Kbyte read/write shared
`memory.
`The shared memory holds graphics input
`buffers or shared data for parallel comput-
`ing. To
`pipeline
`graphics workloads
`through the SM, vertex, geometry, and
`pixel threads have independent input and
`output buffers. Workloads can arrive and
`depart independently of thread execution.
`Geometry threads, which generate variable
`amounts of output per thread, use separate
`output buffers.
`Each SP core contains a scalar multiply-
`add (MAD) unit, giving the SM eight
`MAD units. The SM uses its two SFU units
`
`........................................................................
`
`MARCH–APRIL 2008
`
`43
`
`
`
`.........................................................................................................................................................................................................................
`HOT CHIPS 19
`
`the
`several different programs efficiently,
`Tesla SM uses a new processor architecture
`we call
`single-instruction, multiple-thread
`(SIMT). The SM’s SIMT multithreaded
`instruction unit creates, manages, schedules,
`and executes
`threads
`in groups of 32
`parallel threads called warps. The term warp
`originates from weaving, the first parallel-
`thread technology. Figure 4 illustrates SIMT
`scheduling. The SIMT warp size of 32
`parallel threads provides efficiency on plen-
`tiful fine-grained pixel threads and comput-
`ing threads.
`Each SM manages a pool of 24 warps,
`with a total of 768 threads. Individual
`threads composing a SIMT warp are of the
`same type and start together at the same
`program address, but they are otherwise free
`to branch and execute independently. At
`each instruction issue time,
`the SIMT
`multithreaded instruction unit
`selects a
`warp that is ready to execute and issues
`the next instruction to that warp’s active
`threads. A SIMT instruction is broadcast
`synchronously to a warp’s active parallel
`threads; individual threads can be inactive
`due to independent branching or predica-
`tion.
`The SM maps the warp threads to the SP
`cores, and each thread executes indepen-
`dently with its own instruction address and
`register state. A SIMT processor realizes full
`efficiency and performance when all 32
`threads of a warp take the same execution
`path. If threads of a warp diverge via a data-
`dependent conditional branch,
`the warp
`serially executes each branch path taken,
`disabling threads that are not on that path,
`and when all paths complete, the threads
`reconverge to the original execution path.
`The SM uses a branch synchronization stack
`to manage independent threads that diverge
`and converge. Branch divergence only
`occurs within a warp; different warps
`execute independently regardless of whether
`they are executing common or disjoint code
`paths. As a result, Tesla architecture GPUs
`are dramatically more efficient and flexible
`on branching code than previous generation
`GPUs, as their 32-thread warps are much
`narrower than the SIMD width of prior
`GPUs.1
`
`Figure 4. Single-instruction, multiple-
`
`thread (SIMT) warp scheduling.
`
`SIMT architecture is similar to single-
`instruction, multiple-data (SIMD) design,
`which applies one instruction to multiple
`data lanes. The difference is that SIMT
`applies one instruction to multiple inde-
`pendent threads in parallel, not just multi-
`ple data lanes. A SIMD instruction controls
`a vector of multiple data lanes together and
`exposes the vector width to the software,
`whereas a SIMT instruction controls the
`execution and branching behavior of one
`thread.
`In contrast to SIMD vector architectures,
`SIMT enables programmers to write thread-
`level parallel code for independent threads
`as well as data-parallel code for coordinated
`threads. For program correctness, program-
`mers can essentially ignore SIMT execution
`attributes such as warps; however, they can
`achieve substantial performance improve-
`ments by writing code that seldom requires
`threads in a warp to diverge. In practice, this
`is analogous to the role of cache lines in
`
`.......................................................................
`
`44
`
`IEEE MICRO
`
`
`
`traditional codes: Programmers can safely
`ignore cache line size when designing for
`correctness but must consider it in the code
`structure when designing for peak perfor-
`mance. SIMD vector architectures, on the
`other hand, require the software to manu-
`ally coalesce loads
`into vectors and to
`manually manage divergence.
`
`SIMT warp scheduling. The SIMT ap-
`proach of scheduling independent warps is
`simpler than previous GPU architectures’
`complex scheduling. A warp consists of up
`to 32 threads of the same type—vertex,
`geometry, pixel, or compute. The basic unit
`of pixel-fragment shader processing is the 2
`3 2 pixel quad. The SM controller groups
`eight pixel quads into a warp of 32 threads.
`It similarly groups vertices and primitives
`into warps and packs 32 computing threads
`into a warp. The SIMT design shares the
`SM instruction fetch and issue unit effi-
`ciently across 32 threads but requires a full
`warp of active threads for full performance
`efficiency.
`As a unified graphics processor, the SM
`schedules and executes multiple warp types
`concurrently—for
`example,
`concurrently
`executing vertex and pixel warps. The SM
`warp scheduler operates at half the 1.5-GHz
`processor clock rate. At each cycle, it selects
`one of the 24 warps to execute a SIMT warp
`instruction, as Figure 4 shows. An issued
`warp instruction executes as two sets of 16
`threads over four processor cycles. The SP
`cores and SFU units execute instructions
`independently, and by issuing instructions
`between them on alternate
`cycles,
`the
`scheduler can keep both fully occupied.
`Implementing zero-overhead warp sched-
`uling for a dynamic mix of different warp
`programs and program types was a chal-
`lenging design problem. A scoreboard
`qualifies each warp for issue each cycle.
`The instruction scheduler prioritizes all
`ready warps and selects the one with highest
`priority for issue. Prioritization considers
`warp type, instruction type, and ‘‘fairness’’
`to all warps executing in the SM.
`
`SM instructions. The Tesla SM executes
`scalar instructions, unlike previous GPU
`vector
`instruction architectures. Shader
`
`programs are becoming longer and more
`scalar, and it is increasingly difficult to fully
`occupy even two components of the prior
`four-component vector architecture. Previ-
`ous architectures employed vector pack-
`ing—combining sub-vectors of work to
`gain efficiency—but that complicated the
`scheduling hardware as well as the compiler.
`Scalar instructions are simpler and compiler
`friendly. Texture instructions remain vector
`based, taking a source coordinate vector and
`returning a filtered color vector.
`High-level graphics and computing-lan-
`guage compilers generate intermediate in-
`structions, such as DX10 vector or PTX
`scalar instructions,10,2 which are then opti-
`mized and translated to binary GPU
`instructions. The optimizer readily expands
`DX10 vector instructions to multiple Tesla
`SM scalar instructions. PTX scalar instruc-
`tions optimize to Tesla SM scalar instruc-
`tions about one to one. PTX provides a
`stable target ISA for compilers and provides
`compatibility over several generations of
`GPUs with evolving binary instruction set
`architectures. Because the intermediate lan-
`guages use virtual registers, the optimizer
`analyzes data dependencies and allocates
`real registers. It eliminates dead code, folds
`instructions
`together when feasible, and
`optimizes SIMT branch divergence and
`convergence points.
`
`Instruction set architecture. The Tesla SM
`has a register-based instruction set including
`floating-point, integer, bit, conversion, tran-
`scendental, flow control, memory load/store,
`and texture operations.
`Floating-point and integer operations
`include add, multiply, multiply-add, mini-
`mum, maximum, compare, set predicate,
`and conversions between integer and float-
`ing-point numbers. Floating-point instruc-
`tions provide source operand modifiers for
`negation and absolute value. Transcenden-
`tal
`function instructions
`include cosine,
`sine, binary exponential, binary logarithm,
`reciprocal,
`and reciprocal
`square
`root.
`Attribute interpolation instructions provide
`efficient generation of pixel
`attributes.
`Bitwise operators include shift left, shift
`right, logic operators, and move. Control
`
`........................................................................
`
`MARCH–APRIL 2008
`
`45
`
`
`
`.........................................................................................................................................................................................................................
`HOT CHIPS 19
`
`flow includes branch, call, return, trap, and
`barrier synchronization.
`The floating-point and integer instruc-
`tions can also set per-thread status flags for
`zero, negative, carry, and overflow, which
`the thread program can use for conditional
`branching.
`
`instructions. The texture
`Memory access
`instruction fetches and filters texture sam-
`ples from memory via the texture unit. The
`ROP unit writes pixel-fragment output to
`memory.
`computing and C/C++
`To support
`language needs, the Tesla SM implements
`memory load/store instructions in addition
`to graphics texture fetch and pixel output.
`Memory load/store instructions use integer
`byte
`addressing with register-plus-offset
`address arithmetic to facilitate conventional
`compiler code optimizations.
`For computing, the load/store instruc-
`tions access three read/write memory spaces:
`N
`
`local memory for per-thread, private,
`temporary data (implemented in ex-
`ternal DRAM);
`shared memory for low-latency access
`to data shared by cooperating threads
`in the same SM; and
`global memory for data shared by all
`threads of a computing application
`(implemented in external DRAM).
`
`N
`
`N
`
`load-global,
`The memory instructions
`store-shared,
`store-global,
`load-shared,
`load-local, and store-local access global,
`shared, and local memory. Computing
`programs use the fast barrier synchroniza-
`tion instruction to synchronize
`threads
`within the SM that communicate with each
`other via shared and global memory.
`To improve memory bandwidth and
`reduce overhead, the local and global load/
`store instructions coalesce individual paral-
`lel thread accesses from the same warp into
`fewer memory block accesses. The addresses
`must
`fall
`in the same block and meet
`alignment
`criteria. Coalescing memory
`requests boosts performance significantly
`over separate requests. The large thread
`count,
`together with support
`for many
`outstanding load requests, helps
`cover
`
`.......................................................................
`
`46
`
`IEEE MICRO
`
`local and global
`load-to-use latency for
`memory implemented in external DRAM.
`The
`latest Tesla
`architecture GPUs
`provide efficient atomic memory opera-
`tions,
`including integer add, minimum,
`maximum,
`logic operators,
`swap,
`and
`compare-and-swap operations. Atomic op-
`erations facilitate parallel reductions and
`parallel data structure management.
`
`Streaming processor. The SP core is the
`primary thread processor in the SM. It
`performs
`the fundamental
`floating-point
`operations,
`including add, multiply, and
`multiply-add. It also implements a wide
`variety of integer, comparison, and conver-
`sion operations. The floating-point add and
`multiply operations are compatible with the
`IEEE 754 standard for single-precision FP
`numbers,
`including not-a-number (NaN)
`and infinity values. The unit
`is
`fully
`pipelined, and latency is optimized to
`balance delay and area.
`The add and multiply operations use
`IEEE round-to-nearest-even as the default
`rounding mode. The multiply-add opera-
`tion performs a multiplication with trunca-
`tion, followed by an add with round-to-
`nearest-even. The SP flushes denormal
`source operands to sign-preserved zero and
`flushes results that underflow the target
`output exponent range to sign-preserved
`zero after rounding.
`
`Special-function unit. The SFU supports
`computation of both transcendental func-
`tions and planar attribute interpolation.11 A
`traditional vertex or pixel shader design
`contains a functional unit
`to compute
`transcendental functions. Pixels also need
`an attribute-interpolating unit to compute
`the per-pixel attribute values at the pixel’s x,
`y location, given the attribute values at the
`primitive’s vertices.
`For functional evaluation, we use qua-
`dratic interpolation based on enhanced
`minimax approximations
`to approximate
`the reciprocal, reciprocal square root, log2x,
`2x, and sin/cos functions. Table 1 shows the
`accuracy of the function estimates. The SFU
`unit generates one 32-bit
`floating point
`result per cycle.
`
`
`
`Table 1. Function approximation statistics.
`
`Function
`
`Input
`interval
`
`Accuracy (good
`bits)
`
`ULP* error
`
`% exactly
`rounded
`
`Monotonic
`
`1/x
`
`[1, 2)
`
`24.02
`
`0.98
`
`87
`
`Yes
`
`[1, 4)
`
`[0, 1)
`
`23.40
`
`22.51
`
`22.57
`
`1.52
`
`1.41
`
`N/A**
`
`1/sqrt(x)
`2x
`[1, 2)
`log2x
`[0, p/2)
`No
`N/A
`N/A
`22.47
`sin/cos
`........................................................................................................................................................
`* ULP: unit-in-the-last-place.
`** N/A: not applicable.
`
`78
`
`74
`
`N/A
`
`Yes
`
`Yes
`
`Yes
`
`The SFU also supports attribute interpo-
`lation, to enable accurate interpolation of
`attributes such as color, depth, and texture
`coordinates. The SFU must
`interpolate
`these attributes in the (x, y) screen space
`to determine the values of the attributes at
`each pixel location. We express the value of
`a given attribute U in an (x, y) plane in
`plane equations of the following form:
`ð
`Þ ~
`U x, y
`ð
`AU | x z BU | y z CU
`ð
`
`Þ=
`Þ
`AW | x z BW | y z CW
`
`where A, B, and C are
`interpolation
`parameters associated with each attribute
`U, and W is related to the distance of the
`pixel
`from the
`viewer
`for perspective
`projection. The
`attribute
`interpolation
`hardware in the SFU is fully pipelined,
`and it can interpolate four samples per
`cycle.
`the SFU can
`In a shader program,
`generate perspective-corrected attributes as
`follows:
`
`N
`
`Interpolate 1/W, and invert to form
`W.
`N
`Interpolate U/W.
`N Multiply U/W by W to form perspec-
`tive-correct U.
`
`SM controller. The SMC controls multiple
`SMs, arbitrating the shared texture unit,
`load/store path, and I/O path. The SMC
`serves
`three graphics workloads
`simulta-
`
`It
`neously: vertex, geometry, and pixel.
`packs each of these input types into the
`warp width,
`initiating shader processing,
`and unpacks the results.
`Each input type has independent I/O
`paths, but the SMC is responsible for load
`balancing among them. The SMC supports
`static and dynamic load balancing based on
`driver-recommended allocations,
`current
`allocations, and relative difficulty of addi-
`tional resource allocation. Load balancing of
`the workloads was one of
`the more
`challenging design problems due to its
`impact on overall SPA efficiency.
`
`Texture unit
`The texture unit processes one group of
`four threads (vertex, geometry, pixel, or
`compute) per cycle. Texture instruction
`sources are texture coordinates, and the
`outputs are filtered samples,
`typically a
`four-component (RGBA) color. Texture is
`a separate unit external to the SM connect-
`ed via the SMC. The issuing SM thread can
`continue execution until a data dependency
`stall.
`Each texture unit has four texture address
`generators and eight filter units, for a peak
`GeForce 8800 Ultra rate of 38.4 gigabi-
`lerps/s (a bilerp is a bilinear interpolation of
`four samples). Each unit supports full-speed
`2:1 anisotropic filtering, as well as high-
`dynamic-range (HDR) 16-bit and 32-bit
`floating-point data format filtering.
`The texture unit
`is deeply pipelined.
`Although it contains a cache to capture
`filtering locality, it streams hits mixed with
`misses without stalling.
`
`........................................................................
`
`MARCH–APRIL 2008
`
`47
`
`
`
`.........................................................................................................................................................................................................................
`HOT CHIPS 19
`
`Rasterization
`from the
`Geometry primitives output
`SMs go in their original round-robin input
`order to the viewport/clip/setup/raster/zcull
`block. The viewport and clip units clip the
`primitives to the standard view frustum and
`to any enabled user clip planes. They
`transform postclipping vertices into screen
`(pixel) space and reject whole primitives
`outside the view volume as well as back-
`facing primitives.
`Surviving primitives then go to the setup
`unit, which generates edge equations for the
`rasterizer. Attribute plane equations are also
`generated for linear interpolation of pixel
`attributes in the pixel shader. A coarse-
`rasterization stage generates all pixel tiles
`that are at least partially inside the primi-
`tive.
`The zcull unit maintains a hierarchical z
`surface,
`rejecting pixel
`tiles
`if
`they are
`conservatively known to be occluded by
`previously drawn pixels. The rejection rate
`is up to 256 pixels per clock. The screen is
`subdivided into tiles; each TPC processes a
`predetermined subset. The pixel tile address
`therefore selects the destination TPC. Pixel
`tiles that survive zcull then go to a fine-
`rasterization stage that generates detailed
`coverage information and depth values for
`the pixels.
`OpenGL and Direct3D require that a
`depth test be performed after the pixel
`shader has generated final color and depth
`values. When possible, for certain combi-
`nations of API
`state,
`the Tesla GPU
`performs the depth test and update ahead
`of
`the fragment
`shader, possibly saving
`thousands of cycles of processing time,
`without violating the API-mandated seman-
`tics.
`The SMC assembles surviving pixels into
`warps to be processed by a SM running the
`current pixel shader. When the pixel shader
`has finished, the pixels are optionally depth
`tested if this was not done ahead of the
`shader. The SMC then sends
`surviving
`pixels and associated data to the ROP.
`
`Raster operations processor
`Each ROP is paired with a specific
`memory partition. The TPCs feed data to
`the ROPs via an interconnection network.
`
`ROPs handle depth and stencil testing and
`updates and color blending and updates.
`The memory controller uses lossless color
`(up to 8:1) and depth compression (up to
`8:1) to reduce bandwidth. Each ROP has a
`peak rate of
`four pixels per clock and
`supports 16-bit floating-point and 32-bit
`floating-point HDR formats. ROPs support
`double-rate-depth processing when color
`writes are disabled.
`Each memory partition is 64 bits wide
`and supports double-data