throbber
To appear in ISCA-27 (2000)
`
`Memory Access Scheduling
`
`Scott Rixner1, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens
`
`Computer Systems Laboratory
`Stanford University
`Stanford, CA 94305
`{rixner, billd, ujk, pmattson, jowens}@cva.stanford.edu
`
`Abstract
`
`The bandwidth and latency of a memory system are strongly
`dependent on the manner in which accesses interact with
`the “3-D” structure of banks, rows, and columns character-
`istic of contemporary DRAM chips. There is nearly an order
`of magnitude difference in bandwidth between successive
`references to different columns within a row and different
`rows within a bank. This paper introduces memory access
`scheduling, a technique that improves the performance of a
`memory system by reordering memory references to exploit
`locality within the 3-D memory structure. Conservative
`reordering, in which the first ready reference in a sequence
`is performed, improves bandwidth by 40% for traces from
`five media benchmarks. Aggressive reordering, in which
`operations are scheduled to optimize memory bandwidth,
`improves bandwidth by 93% for the same set of applica-
`tions. Memory access scheduling is particularly important
`for media processors where it enables the processor to make
`the most efficient use of scarce memory bandwidth.
`
`1 Introduction
`
`Modern computer systems are becoming increasingly lim-
`ited by memory performance. While processor performance
`increases at a rate of 60% per year, the bandwidth of a mem-
`ory chip increases by only 10% per year making it costly to
`provide the memory bandwidth required to match the pro-
`cessor performance [14] [17]. The memory bandwidth bot-
`tleneck is even more acute for media processors with
`streaming memory reference patterns that do not cache well.
`Without an effective cache to reduce the bandwidth
`demands on main memory, these media processors are more
`
`1. Scott Rixner is an Electrical Engineering graduate student at
`the Massachusetts Institute of Technology.
`
`often limited by memory system bandwidth than other com-
`puter systems.
`
`To maximize memory bandwidth, modern DRAM compo-
`nents allow pipelining of memory accesses, provide several
`independent memory banks, and cache the most recently
`accessed row of each bank. While these features increase
`the peak supplied memory bandwidth, they also make the
`performance of the DRAM highly dependent on the access
`pattern. Modern DRAMs are not truly random access
`devices (equal access time to all locations) but rather are
`three-dimensional memory devices with dimensions of
`bank, row, and column. Sequential accesses to different
`rows within one bank have high latency and cannot be pipe-
`lined, while accesses to different banks or different words
`within a single row have low latency and can be pipelined.
`
`The three-dimensional nature of modern memory devices
`makes it advantageous to reorder memory operations to
`exploit the non-uniform access times of the DRAM. This
`optimization is similar to how a superscalar processor
`schedules arithmetic operations out of order. As with a
`superscalar processor, the semantics of sequential execution
`are preserved by reordering the results.
`
`This paper introduces memory access scheduling in which
`DRAM operations are scheduled, possibly completing
`memory references out of order, to optimize memory sys-
`tem performance. The several memory access scheduling
`strategies introduced in this paper increase the sustained
`memory bandwidth of a system by up to 144% over a sys-
`tem with no access scheduling when applied to realistic syn-
`thetic benchmarks. Media processing applications exhibit a
`30% improvement in sustained memory bandwidth with
`memory access scheduling, and the traces of these applica-
`tions offer a potential bandwidth improvement of up to
`93%.
`
`To see the advantage of memory access scheduling, con-
`sider the sequence of eight memory operations shown in
`Figure 1A. Each reference is represented by the triple (bank,
`row, column). Suppose we have a memory system utilizing
`a DRAM that requires 3 cycles to precharge a bank, 3 cycles
`to access a row of a bank, and 1 cycle to access a column of
`a row. Once a row has been accessed, a new column access
`can issue each cycle until the bank is precharged. If these
`eight references are performed in order, each requires a pre-
`
`1
`Petitioner Lenovo (United States) Inc. - Ex. 1016
`
`1 of 11
`
`

`

`(A) Without access scheduling (56 DRAM Cycles)
`Time (Cycles)
`
`987654321
`P
`A
`C
`
`P
`
`10
`
`11
`
`12
`
`13
`
`14
`
`15
`
`16
`
`17
`
`18
`
`19
`
`20
`
`21
`
`22
`
`23
`
`24
`
`25
`
`26
`
`27
`
`28
`
`29
`
`30
`
`31
`
`32
`
`33
`
`34
`
`35
`
`36
`
`37
`
`38
`
`39
`
`40
`
`41
`
`42
`
`43
`
`44
`
`45
`
`46
`
`47
`
`48
`
`49
`
`50
`
`51
`
`52
`
`53
`
`54
`
`55
`
`56
`
`A
`
`C
`
`P
`
`A
`
`C
`
`P
`
`A
`
`C
`
`P
`
`A
`
`C
`
`P
`
`A
`
`C
`
`P
`
`A
`
`C
`
`P
`
`A
`
`C
`
`(0,0,0)
`(0,1,0)
`(0,0,1)
`(0,1,3)
`(1,0,0)
`(1,1,1)
`(1,0,1)
`(1,1,2)
`
`References (Bank, Row, Column)
`
`(B) With access scheduling (19 DRAM Cycles)
`
`Time (Cycles)
`
`DRAM Operations:
`
`P: bank precharge (3 cycle occupancy)
`A: row activation (3 cycle occupancy)
`C: column access (1 cycle occupancy)
`
`10
`
`11
`
`12
`
`13
`
`14
`
`15
`
`16
`
`17
`
`18
`
`19
`
`C
`
`A
`
`C
`
`C
`
`P
`
`A
`
`C
`
`C
`
`P C
`
`987654321
`P
`A
`C
`
`C
`
`P
`
`A
`
`(0,0,0)
`(0,1,0)
`(0,0,1)
`(0,1,3)
`(1,0,0)
`(1,1,1)
`(1,0,1)
`(1,1,2)
`
`References (Bank, Row, Column)
`
`Figure 1. Time to complete a series of memory references without (A) and with (B) access reordering.
`
`charge, a row access, and a column access for a total of
`seven cycles per reference, or 56 cycles for all eight refer-
`ences. If we reschedule these operations as shown in Figure
`1B they can be performed in 19 cycles.
`
`The following section discusses the characteristics of mod-
`ern DRAM architecture. Section 3 introduces the concept of
`memory access scheduling and the possible algorithms that
`can be used to reorder DRAM operations. Section 4
`describes the streaming media processor and benchmarks
`that will be used to evaluate memory access scheduling.
`Section 5 presents a performance comparison of the various
`memory access scheduling algorithms. Finally, Section 6
`presents related work to memory access scheduling.
`
`2 Modern DRAM Architecture
`
`As illustrated by the example in the Introduction, the order
`in which DRAM accesses are scheduled can have a dra-
`matic impact on memory throughput and latency. To
`improve memory performance, a memory controller must
`take advantage of the characteristics of modern DRAM.
`
`Figure 2 shows the internal organization of modern
`DRAMs. These DRAMs are three-dimensional memories
`with the dimensions of bank, row, and column. Each bank
`operates independently of the other banks and contains an
`array of memory cells that are accessed an entire row at a
`time. When a row of this memory array is accessed (row
`activation) the entire row of the memory array is transferred
`into the bank’s row buffer. The row buffer serves as a cache
`to reduce the latency of subsequent accesses to that row.
`While a row is active in the row buffer, any number of reads
`or writes (column accesses) may be performed, typically
`with a throughput of one per cycle. After completing the
`
`available column accesses, the cached row must be written
`back to the memory array by an explicit operation (bank
`precharge) which prepares the bank for a subsequent row
`activation. An overview of several different modern DRAM
`types and organizations, along with a performance compari-
`son for in-order access, can be found in [4].
`For example, the 128Mb NEC m PD45128163 [13], a typical
`SDRAM, includes four internal memory banks, each com-
`posed of 4096 rows and 512 columns. This SDRAM may be
`operated at 125MHz, with a precharge latency of 3 cycles
`(24ns) and a row access latency of 3 cycles (24ns). Pipe-
`lined column accesses that transfer 16 bits may issue at the
`rate of one per cycle (8ns), yielding a peak transfer rate of
`250MB/s. However, it is difficult to achieve this rate on
`non-sequential access patterns for several reasons. A bank
`cannot be accessed during the precharge/activate latency, a
`single cycle of high impedance is required on the data pins
`when switching between read and write column accesses,
`and a single set of address lines is shared by all DRAM
`operations (bank precharge, row activation, and column
`access). The amount of bank parallelism that is exploited
`and the number of column accesses that are made per row
`access dictate the sustainable memory bandwidth out of
`such a DRAM, as illustrated in Figure 1 of the Introduction.
`
`A memory access scheduler must generate a schedule that
`conforms to the timing and resource constraints of these
`modern DRAMs. Figure 3 illustrates these constraints for
`the NEC SDRAM with a simplified bank state diagram and
`a table of operation resource utilization. Each DRAM oper-
`ation makes different demands on the three DRAM
`resources: the internal banks, a single set of address lines,
`and a single set of data lines. The scheduler must ensure that
`
`2
`
`2 of 11
`
`

`

`(A) Simplified bank state diagram
`
` Bank Precharge
`
`IDLE
`
`ACTIVE
`
`Column Access
`
`Row Activation
`
`(B) Operation resource utilization
`
`Cycle
`
`1
`
`2
`
`3
`
`4
`
`Precharge:
`
`Activate:
`
`Read:
`
`Write:
`
`Bank
`Address
`Data
`
`Bank
`Address
`Data
`
`Bank
`Address
`Data
`
`Bank
`Address
`Data
`
`Figure 3. Simplified state diagram and resource utilization
`governing access to an internal DRAM bank.
`
`cally also support column accesses with automatic pre-
`charge, which implicitly precharges the DRAM bank as
`soon as possible after the column access.
`
`The shared address and data resources serialize access to the
`different DRAM banks. While the state machines for the
`individual banks are independent, only a single bank can
`perform a transition requiring a particular shared resource
`each cycle. For many DRAMs, the bank, row, and column
`addresses share a single set of lines. Hence, the scheduler
`must arbitrate between precharge, row, and column opera-
`tions that all need to use this single resource. Other
`DRAMs, such as Direct Rambus DRAMs (DRDRAMs) [3],
`provide separate row and column address lines (each with
`their own associated bank address) so that column and row
`accesses can be initiated simultaneously. To approach the
`peak data rate with serialized resources, there must be
`enough column accesses to each row to hide the precharge/
`activate latencies of other banks. Whether or not this can be
`achieved is dependent on the data reference patterns and the
`order in which the DRAM is accessed to satisfy those refer-
`ences. The need to hide the precharge/activate latency of the
`banks in order to sustain high bandwidth cannot be elimi-
`nated by any DRAM architecture without reducing the pre-
`charge/activate latency, which would likely come at the cost
`of decreased bandwidth or capacity, both of which are unde-
`sirable.
`
`3
`
`Bank N
`
`Bank 1
`
`Memory
`Array
`(Bank 0)
`
`Row Decoder
`
` Sense Amplifiers
`(Row Buffer)
`
`Column Decoder
`
`Data
`
`Address
`
`Figure 2. Modern DRAM organization.
`
`the required resources are available for each DRAM opera-
`tion it schedules.
`
`Each DRAM bank has two stable states: IDLE and ACTIVE,
`as shown in Figure 3A. In the IDLE state, the DRAM is pre-
`charged and ready for a row access. It will remain in this
`state until a row activate operation is issued to the bank. To
`issue a row activation, the address lines must be used to
`select the bank and the row being activated, as shown in
`Figure 3B. Row activation requires 3 cycles, during which
`no other operations may be issued to that bank, as indicated
`by the utilization of the bank resource for the duration of the
`operation. During that time, however, operations may be
`issued to other banks of the DRAM. Once the DRAM’s row
`activation latency has passed, the bank enters the ACTIVE
`state, during which the contents of the selected row are held
`in the bank’s row buffer. Any number of pipelined column
`accesses may be performed while the bank is in the ACTIVE
`state. To issue either a read or write column access, the
`address lines are required to indicate the bank and the col-
`umn of the active row in that bank. A write column access
`requires the data to be transferred to the DRAM at the time
`of issue, whereas a read column access returns the requested
`data three cycles later. Additional timing constraints not
`shown in Figure 3, such as a required cycle of high imped-
`ance between reads and writes, may further restrict the use
`of the data pins.
`
`The bank will remain in the ACTIVE state until a precharge
`operation is issued to return it to the IDLE state. The pre-
`charge operation requires the use of the address lines to
`indicate the bank which is to be precharged. Like row acti-
`vation, the precharge operation utilizes the bank resource
`for 3 cycles, during which no new operations may be issued
`to that bank. Again, operations may be issued to other banks
`during this time. After the DRAM’s precharge latency, the
`bank is returned to the IDLE state and is ready for a new row
`activation operation. Frequently, there are also timing con-
`straints that govern the minimum latency between a column
`access and a subsequent precharge operation. DRAMs typi-
`
`3 of 11
`
`

`

`V L/S Row
`
`Col
`
`Data State
`
`Bank 0 Pending References
`
`Memory Access
`Scheduler Logic
`
`Precharge0
`
`Row
` Arbiter0
`
`Memory References
`
`Column
`Arbiter
`
`Address
`Arbiter
`
`DRAM Operations
`
`V L/S Row
`
`Col
`
`Data State
`
`Bank N Pending References
`
`Row
` ArbiterN
`
`PrechargeN
`
`Figure 4. Memory access scheduler architecture.
`
`3 Memory Access Scheduling
`
`Memory access scheduling is the process of ordering the
`DRAM operations (bank precharge, row activation, and col-
`umn access) necessary to complete the set of currently
`pending memory references. Throughout the paper, the term
`operation denotes a command, such as a row activation or a
`column access, issued by the memory controller to the
`DRAM. Similarly, the term reference denotes a memory ref-
`erence generated by the processor, such as a load or store to
`a memory location. A single reference generates one or
`more memory operations depending on the schedule.
`
`Given a set of pending memory references, a memory
`access scheduler may chose one or more row, column, or
`precharge operations each cycle, subject to resource con-
`straints, to advance one or more of the pending references.
`The simplest, and most common, scheduling algorithm only
`considers the oldest pending reference, so that references
`are satisfied in the order that they arrive. If it is currently
`possible to make progress on that reference by performing
`some DRAM operation then the memory controller makes
`the appropriate access. While this does not require a compli-
`cated access scheduler in the memory controller, it is clearly
`inefficient, as illustrated in Figure 1 of the Introduction.
`
`If the DRAM is not ready for the operation required by the
`oldest pending reference, or if that operation would leave
`available resources idle, it makes sense to consider opera-
`tions for other pending references. Figure 4 shows the struc-
`ture of a more sophisticated access scheduler. As memory
`references arrive, they are allocated storage space while
`they await service from the memory access scheduler. In the
`figure, references are initially sorted by DRAM bank. Each
`pending reference is represented by six fields: valid (V),
`load/store (L/S), address (Row and Col), data, and whatever
`additional state is necessary for the scheduling algorithm.
`Examples of state that can be accessed and modified by the
`scheduler are the age of the reference and whether or not
`that reference targets the currently active row. In practice,
`
`the pending reference storage could be shared by all the
`banks (with the addition of a bank address field) to allow
`dynamic allocation of that storage at the cost of increased
`logic complexity in the scheduler.
`
`As shown in Figure 4, each bank has a precharge manager
`and a row arbiter. The precharge manager simply decides
`when its associated bank should be precharged. Similarly,
`the row arbiter for each bank decides which row, if any,
`should be activated when that bank is idle. A single column
`arbiter is shared by all the banks. The column arbiter grants
`the shared data line resources to a single column access out
`of all the pending references to all of the banks. Finally, the
`precharge managers, row arbiters, and column arbiter send
`their selected operations to a single address arbiter which
`grants the shared address resources to one or more of those
`operations.
`
`The precharge managers, row arbiters, and column arbiter
`can use several different policies to select DRAM opera-
`tions, as enumerated in Table 1. The combination of policies
`used by these units, along with the address arbiter’s policy,
`determines the memory access scheduling algorithm. The
`address arbiter must decide which of the selected precharge,
`activate, and column operations to perform subject to the
`constraints of the address line resources. As with all of the
`other scheduling decisions, the in-order or priority policies
`can be used by the address arbiter to make this selection.
`Additional policies that can be used are those that select pre-
`charge operations first, row operations first, or column oper-
`ations first. A column-first scheduling policy would reduce
`the latency of references to active rows, whereas a pre-
`charge-first or row-first scheduling policy would increase
`the amount of bank parallelism.
`
`If the address resources are not shared, it is possible for both
`a precharge operation and a column access to the same bank
`to be selected. This is likely to violate the timing constraints
`of the DRAM. Ideally, this conflict can be handled by hav-
`ing the column access automatically precharge the bank
`
`4
`
`4 of 11
`
`

`

`Table 1. Scheduling policies for the precharge managers, row arbiters, and column arbiter.
`
`Policy
`
`Arbiters
`
`in-order
`
`priority
`
`precharge, row,
`and column
`
`precharge, row,
`and column
`
`open
`
`precharge
`
`closed
`
`precharge
`
`most pending
`
`row and column
`
`fewest pending
`
`column
`
`Description
`A DRAM operation will only be performed if it is required by the oldest pending reference. While
`used by almost all memory controllers today, this policy yields poor performance compared to
`policies that look ahead in the reference stream to better utilize DRAM resources.
`The operation(s) required by the highest priority ready reference(s) are performed. Three possible
`priority schemes include: ordered, older references are given higher priority; age-threshold, refer-
`ences older than some threshold age gain increased priority; and load-over-store, load references
`are given higher priority. Age-threshold prevents starvation while allowing greater reordering
`flexibility than ordered. Load-over-store decreases load latency to minimize processor stalling on
`stream loads.
`A bank is only precharged if there are pending references to other rows in the bank and there are
`no pending references to the active row. The open policy should be employed if there is significant
`row locality, making it likely that future references will target the same row as previous references
`did.
`A bank is precharged as soon as there are no more pending references to the active row. The
`closed policy should be employed if it is unlikely that future references will target the same row as
`the previous set of references.
`The row or column access to the row with the most pending references is selected. This allows
`rows to be activated that will have the highest ratio of column to row accesses, while waiting for
`other rows to accumulate more pending references. By selecting the column access to the most
`demanded row, that bank will be freed up as soon as possible to allow other references to make
`progress. This policy can be augmented by one of the priority schemes described above to prevent
`starvation.
`The fewest pending policy selects the column access to the row targeted by the fewest pending
`references. This minimizes the time that rows with little demand remain active, allowing refer-
`ences to other rows in that bank to make progress sooner. A weighted combination of the fewest
`pending and most pending policies could also be used to select a column access. This policy can
`also be augmented by one of the priority schemes described above to prevent starvation.
`
`upon completion, which is supported by most modern
`DRAMs.
`
`4 Experimental Setup
`
`Streaming media data types do not cache well, so they
`require other types of support to improve memory perfor-
`mance. In a stream (or vector) processor, the stream transfer
`bandwidth, rather than the latency of any individual mem-
`ory reference, drives processor performance. A streaming
`media processing system, therefore, is a prime candidate for
`memory access scheduling. To evaluate the performance
`impact of memory access scheduling on media processing, a
`streaming media processor was simulated running typical
`media processing applications.
`
`4.1 Stream Processor Architecture
`
`Media processing systems typically do not cache streaming
`media data types, because modern cache hierarchies cannot
`handle them efficiently [10]. In a media computation on
`long streams of data, the same operations are performed
`repeatedly on consecutive stream elements, and the stream
`elements are discarded after the operations are performed.
`These streams do not cache well because they lack temporal
`locality (stream elements are usually only referenced once)
`and they have a large cache footprint, which makes it likely
`that they will interfere with other data in the cache. In many
`media processing systems, stream accesses bypass the cache
`
`so as not to interfere with other data that does cache well.
`Many streams are accessed sequentially, so prefetching
`streams into the cache can sometimes be effective at
`improving processor performance [15]. However, this is an
`inefficient way to provide storage for streaming data
`because address translation is required on every reference,
`accesses are made with long addresses, tag overhead is
`incurred in the cache, and conflicts may evict previously
`fetched data.
`
`The Imagine stream processor [16] employs a 64KB stream
`register file (SRF), rather than a cache, to capture the refer-
`ence locality of streams. Entire streams are transferred
`between the DRAMs and the SRF. This is more efficient
`than a cache because a single instruction, rather than many
`explicit instructions, can be used to transfer a stream of data
`to or from memory.
`
`Stream memory transfers (similar to vector memory trans-
`fers) are independent operations that are isolated from com-
`putation. Therefore, the memory system can be loading
`streams for the next set of computations and storing streams
`for the previous set of computations while the current set of
`computations are occurring. A computation cannot com-
`mence until all of the streams it requires are present in the
`stream register file. The Imagine streaming memory system
`consists of a pair of address generators, four interleaved
`memory bank controllers, and a pair of reorder buffers that
`place stream data in the SRF in the correct order. All of
`
`5
`
`5 of 11
`
`

`

`To Reorder Buffer 0
`
`To Reorder Buffer 1
`
`From Address Generator 0
`
`From Address Generator 1
`
`Memory Bank
`Controller
`
`Holding
`Buffer
`
`Holding
`Buffer
`
`Interface
`
`Reply
`Buffer
`
`Memory
`Access
`Scheduler
`
`MSHRs
`
`Bank
`Buffer
`
`Memory
`Controller
`
`Imagine Processor
`
`Off-chip DRAM
`
`Figure 5. Memory bank controller architecture.
`
`these units are on the same chip as the Imagine processor
`core.
`
`The address generators support three addressing modes:
`constant stride, indirect, and bit-reversed. The address gen-
`erators may generate memory reference streams of any
`length, as long as the data fits in the SRF. For constant stride
`references, the address generator takes a base, stride, and
`length, and computes successive addresses by incrementing
`the base address by the stride. For indirect references, the
`address generator takes a base address and an index stream
`from the SRF and calculates addresses by adding each index
`to the base address. Bit-reversed addressing is used for FFT
`memory references and is similar to constant stride address-
`ing, except that bit-reversed addition is used to calculate
`addresses.
`
`Figure 5 shows the architecture of the memory bank con-
`trollers.2 References arriving from the address generators
`are stored in a small holding buffer until they can be pro-
`cessed. Despite the fact that there is no cache, a set of regis-
`ters similar in function to the miss status holding registers
`(MSHRs) of a non-blocking cache [9] exist to keep track of
`in-flight references and to do read and write coalescing.
`When a reference arrives for a location that is already the
`target of another in-flight reference, the MSHR entry for
`
`that reference is updated to reflect that this reference will be
`satisfied by the same DRAM access. When a reference to a
`location that is not already the target of another in-flight ref-
`erence arrives, a new MSHR is allocated and the reference
`is sent to the bank buffer. The bank buffer corresponds
`directly to the pending reference storage in Figure 4,
`although the storage for all of the internal DRAM banks is
`combined into one 32 entry buffer. The memory controller
`schedules DRAM accesses to satisfy the pending references
`in the bank buffer and returns completed accesses to the
`MSHRs. The MSHRs send completed loads to the reply
`buffer where they are held until they can be sent back to the
`reorder buffers. As the name implies, the reorder buffers
`receive out of order references and transfer the data to the
`SRF in order.
`
`In this streaming memory system, memory consistency is
`maintained in two ways: conflicting memory stream refer-
`ences are issued in dependency order and the MSHRs
`ensure that references to the same address complete in the
`order that they arrive. This means that a stream load that fol-
`lows a stream store to overlapping locations may be issued
`as soon as the address generators have sent all of the store’s
`references to the memory banks.
`
`For the simulations, it was assumed that the processor fre-
`quency was 500 MHz and that the DRAM frequency was
`125 MHz.3 At this frequency, Imagine has a peak computa-
`tion rate of 20GFLOPS on single precision floating point
`computations and 20GOPS on 32-bit integer computations.
`Each memory bank controller has two external NEC
`m PD45128163 SDRAM chips attached to it to provide a
`column access width of 32 bits, which is the word size of
`the Imagine processor. These SDRAM chips were briefly
`described earlier and a complete specification can be found
`in [13]. The peak bandwidth of the SDRAMs connected to
`each memory bank controller is 500MB/s, yielding a total
`peak memory bandwidth of 2GB/s in the system.
`
`4.2 Benchmarks
`
`The experiments were run on a set of microbenchmarks and
`five media processing applications. Table 2 describes the
`microbenchmarks above the double line, and the applica-
`tions below the double line. For the microbenchmarks, no
`computations are performed outside of the address genera-
`tors. This allows memory references to be issued at their
`maximum throughput, constrained only by the buffer stor-
`age in the memory banks. For the applications, the simula-
`tions were run both with the applications’ computations and
`without. When running just the memory traces, dependen-
`cies were maintained by assuming the computation occurred
`at the appropriate times but was instantaneous. The applica-
`tions results show the performance improvements that can
`
`2. Note that these are external memory banks, each composed of
`separate DRAM chips in contrast to the internal memory banks
`within each DRAM chip.
`
`3. This corresponds to the expected clock frequency of the Imag-
`ine stream processor and the clock frequency of existing
`SDRAM parts.
`
`6
`
`6 of 11
`
`

`

`Table 2. Benchmarks.
`
`Name
`
`Unit Load
`
`Unit
`
`Unit Conflict
`
`Constrained
`Random
`
`Random
`
`FFT
`
`Depth
`
`QRD
`
`MPEG
`
`Tex
`
`Description
`Unit stride load stream accesses with paral-
`lel streams to different rows in different
`internal DRAM banks.
`Unit stride load and store stream accesses
`with parallel streams to different rows in
`different internal DRAM banks.
`Unit stride load and store stream accesses
`with parallel streams to different rows in the
`same internal DRAM banks.
`
`Random access load and store streams con-
`strained to a 64KB range.
`
`Random access load and store streams to the
`entire address space.
`Ten consecutive 1024-point real Fast Fou-
`rier Transforms.
`Stereo depth extraction from a pair of
`320x240 8-bit grayscale images.a
`QR matrix decomposition of a 192x96 ele-
`ment matrix.b
`MPEG2 encoding of three frames of
`360x288 24-bit color video.
`Triangle rendering of a 720x720 24-bit color
`image with texture mapping.c
`
`a. Depth performs depth extraction using Kanade’s algo-
`rithm [8]. Only two stereo images are used in the bench-
`mark, as opposed to the multiple cameras of the video-
`rate stereo machine.
`b. QRD uses blocked Householder transformations to gen-
`erate an orthogonal Q matrix and an upper triangular R
`matrix such that Q·R is equal to the input matrix.
`c. Tex applies modelview, projection, and viewport trans-
`formations on its unmeshed input triangle stream and
`performs perspective-correct bilinear interpolated tex-
`ture mapping on its generated fragments. A single frame
`of the SPECviewperf 6.1.1 Advanced Visualizer bench-
`mark image was rendered.
`
`be gained by using memory access scheduling with a mod-
`ern media processor. The application traces, with instanta-
`neous computation, show the potential of these scheduling
`methods as processing power increases and the applications
`become entirely limited by memory bandwidth.
`
`5 Experimental Results
`
`A memory controller that performs no access reordering
`will serve as a basis for comparison. This controller per-
`forms no access scheduling, as it uses an in-order policy,
`described in Table 1, for all decisions: a column access will
`only be performed for the oldest pending reference, a bank
`will only be precharged if necessary for the oldest pending
`reference, and a row will only be activated if it is needed by
`the oldest pending reference. No other references are con-
`
`sidered in the scheduling decision. This algorithm, or slight
`variations such as automatically precharging the bank when
`a cache line fetch is completed, can commonly be found in
`systems today.
`
`The gray bars of Figure 6 show the performance of the
`benchmarks using the baseline in-order access scheduler.
`Unsurprisingly, unit load performs very well with no access
`scheduling, achieving 97% of the peak bandwidth (2GB/s)
`of the DRAMs. The 3% overhead is the combined result of
`infrequent precharge/activate cycles and the start-up/shut-
`down delays of the streaming memory system.
`
`The 14% drop in sustained bandwidth from the unit load
`benchmark to the unit benchmark shows the performance
`degradation imposed by forcing intermixed load and store
`references to complete in order. Each time the references
`switch between loads and stores a cycle of high impedance
`must be left on the data pins, decreasing the sustainable
`bandwidth. The unit conflict benchmark further shows the
`penalty of swapping back and forth between rows in the
`DRAM banks, which drops the sustainable bandwidth down
`to 51% of the peak. The random benchmarks sustain about
`15% of the bandwidth of the unit load benchmark. This loss
`roughly corresponds to the degradation incurred by per-
`forming accesses with a throughput of one word every
`seven DRAM cycles (the random access throughput of the
`SDRAM) compared to a throughput of one word every
`DRAM cycle (the column access throughput of the
`SDRAM).
`
`The applications’ behavior closely mimics their associated
`microbenchmarks. The QRD and MPEG traces include
`many unit and small constant stride accesses, leading to a
`sustained bandwidth that approaches that of the unit bench-
`mark. The Depth trace consists almost exclusively of con-
`stant stride accesses, but dependencies limit the number of
`simultaneous stream accesses that can occur. The FFT trace
`is composed of constant stride loads and bit-reversed stores.
`The bit-reversed accesses sustain less bandwidth than con-
`stant stride accesses because they generate sequences of ref-
`erences that target a single memory bank and then a
`sequence of references that target the next memory bank
`and so on. This results in lower bandwidth than access pat-
`terns that more evenly distribute the references across the
`four memory banks. Finally, the Tex trace includes constant
`stride accesses, but is dominated by texture accesses which
`are essentially random within the texture memory space.
`These texture accesses lead to the lowest sustained band-
`width of the applications. Note that for the applications,
`memory bandwidth corresponds directly to performance
`because the applications make the same number of memory
`references regardles

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket