throbber
A Performance Comparison of Contemporary DRAM Architectures
`
`Vinodh Cuppu, Bruce Jacob
`Dept. of Electrical & Computer Engineering
`University of Maryland, College Park
`{ramvinod,blj}@eng.umd.edu
`
`Brian Davis, Trevor Mudge
` Dept. of Electrical Engineering & Computer Science
`University of Michigan, Ann Arbor
`{btdavis,tnm}@eecs.umich.edu
`
`ABSTRACT
`
`In response to the growing gap between memory access time and
`processor speed, DRAM manufacturers have created several new
`DRAM architectures. This paper presents a simulation-based per-
`formance study of a representative group, each evaluated in a small
`system organization. These small-system organizations correspond
`to workstation-class computers and use on the order of 10 DRAM
`chips. The study covers Fast Page Mode, Extended Data Out, Syn-
`chronous, Enhanced Synchronous, Synchronous Link, Rambus, and
`Direct Rambus designs. Our simulations reveal several things: (a)
`current advanced DRAM technologies are attacking the memory
`bandwidth problem but not the latency problem; (b) bus transmis-
`sion speed will soon become a primary factor limiting memory-sys-
`tem performance; (c) the post-L2 address stream still contains
`significant locality, though it varies from application to application;
`and (d) as we move to wider buses, row access time becomes more
`prominent, making it important to investigate techniques to exploit
`the available locality to decrease access time.
`
`1
`
`INTRODUCTION
`
`In response to the growing gap between memory access time and
`processor speed, DRAM manufacturers have created several new
`DRAM architectures. This paper presents a simulation-based perfor-
`mance study of a representative group, evaluating each in terms of
`its effect on total execution time. We simulate the performance of
`seven DRAM architectures: Fast Page Mode [35], Extended Data
`Out [16], Synchronous [17], Enhanced Synchronous [10], Synchro-
`nous Link [38], Rambus [31], and Direct Rambus [32]. While there
`are a number of academic proposals for new DRAM designs, space
`limits us to covering only existent commercial parts. To obtain accu-
`rate memory-request timing for an aggressive out-of-order proces-
`sor, we integrate our code into the SimpleScalar tool set [4].
`This paper presents a baseline study of a small-system DRAM
`organization: these are systems with only a handful of DRAM chips
`(0.1–1GB). We do not consider large-system DRAM organizations
`with many gigabytes of storage that are highly interleaved. The
`study asks and answers the following questions:
`• What is the effect of improvements in DRAM technology on the
`memory latency and bandwidth problems?
`Contemporary techniques for improving processor performance
`and tolerating memory latency are exacerbating the memory
`bandwidth problem [5]. Our results show that current DRAM
`architectures are attacking exactly this problem: the most recent
`technologies (SDRAM, ESDRAM, and Rambus) have reduced
`the stall time due to limited bandwidth by a factor of three
`compared to earlier DRAM architectures. However, the
`memory-latency component of overhead has not improved.
`
`• Where is time spent in the primary memory system (the memory
`system beyond the cache hierarchy, but not including secondary
`[disk] or tertiary [backup] storage)? What is the performance
`benefit of exploiting the page mode of contemporary DRAMs?
`For the newer DRAM designs, the time to extract the required
`data from the sense amps/row caches for transmission on the
`memory bus is the largest component in the average access time,
`though page mode allows this to be overlapped with column
`access and the time to transmit the data over the memory bus.
`• How much locality is there in the address stream that reaches the
`primary memory system?
`The stream of addresses that miss the L2 cache contains a
`significant amount of locality, as measured by the hit-rates in the
`DRAM row buffers. The hit rates for the applications studied
`range 8–95%, with a mean hit rate of 40% for a 1MB L2 cache.
`(This does not include hits to the row buffers when making
`multiple DRAM requests to read one cache-line.)
`We also make several observations. First, there is a one-time trade-
`off between cost, bandwidth, and latency: to a point, latency can be
`decreased by ganging together multiple DRAMs into a wide struc-
`ture. This trades dollars for bandwidth that reduces latency because
`a request size is typically much larger than the DRAM transfer
`width. Page mode and interleaving are similar optimizations that
`work because a request size is typically larger than the bus width.
`However, the latency benefits are limited by bus and DRAM speeds:
`to get further improvements, one must run the DRAM core and bus
`at faster speeds. Current memory busses are adequate for small sys-
`tems but are likely inadequate for large ones. Embedded DRAM [5,
`19, 37] is not a near-term solution, as its performance is poor on
`high-end workloads [3]. Faster buses are more likely solutions—wit-
`ness the elimination of the slow intermediate memory bus in future
`systems [12]. Another solution is to internally bank the memory
`array into many small arrays so that each can be accessed very
`quickly, as in the MoSys Multibank DRAM architecture [39].
`Second, widening buses will present new optimization opportu-
`nities. Each application exhibits a different degree of locality and
`therefore benefits from page mode to a different degree. As buses
`widen, this effect becomes more pronounced, to the extent that dif-
`ferent applications can have average access times that differ by 50%.
`This is a minor issue considering current bus technology. However,
`future bus technologies will expose the row access as the primary
`performance bottleneck, justifying the exploration of mechanisms to
`exploit locality to guarantee hits in the DRAM row buffers: e.g. row-
`buffer victim caches, prediction mechanisms, etc.
`Third, while buses as wide as the L2 cache yield the best mem-
`ory latency, they cannot halve the latency of a bus half as wide. Page
`mode overlaps the components of DRAM access when making mul-
`tiple requests to the same row. If the bus is as wide as a request, one
`
`Copyright © 1999 IEEE. Published in the Proceedings of the 26th International Symposium on Computer Architecture, May 2-4, 1999, in Atlanta GA, USA. Personal use of this material is per-
`mitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to
`reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O.
`Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.
`
`Petitioner Lenovo (United States) Inc. - Ex. 1011
`222
`1063-6897/99/$10.00 (c) 1999 IEEE
`
`1 of 12
`
`

`

`Data Transfer
`
`Data Transfer Overlap
`
`Column Access
`
`Row Access
`
`RAS
`
`CAS
`
`Address
`
`Row
`Address
`
`DQ
`
`Column
`Address
`
`Column
`Address
`
`Column
`Address
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Figure 2: FPM Read Timing. Fast page mode allows the DRAM controller
`to hold a row constant and receive multiple columns in rapid succession.
`
`varying several CPU-level parameters such as issue width, cache
`size & organization, number of processors, etc. This study focuses
`on the performance behavior of different DRAM architectures.
`
`3
`
`BACKGROUND
`
`A Random Access Memory (RAM) that uses a single transistor-
`capacitor pair for each binary value (bit) is referred to as a Dynamic
`Random Access Memory or DRAM. This circuit is dynamic
`because leakage requires that the capacitor be periodically refreshed
`for information retention. Initially, DRAMs had minimal I/O pin
`counts because the manufacturing cost was dominated by the num-
`ber of I/O pins in the package. Due largely to a desire to use stan-
`dardized parts, the initial constraints limiting the I/O pins have had a
`long-term effect on DRAM architecture: the address pins for most
`DRAMs are still multiplexed, potentially limiting performance. As
`the standard DRAM interface has become a performance bottleneck,
`a number of “revolutionary” proposals [26] have been made. In most
`cases, the revolutionary portion is the interface or access mecha-
`nism, while the DRAM core remains essentially unchanged.
`
`3.1
`
`The Conventional DRAM
`
`The addressing mechanism of early DRAM architectures is still uti-
`lized, with minor changes, in many of the DRAMs produced today.
`In this interface, shown in Figure 1, the address bus is multiplexed
`between row and column components. The multiplexed address bus
`uses two control signals—the row and column address strobe sig-
`nals, RAS and CAS respectively—which cause the DRAM to latch
`the address components. The row address causes a complete row in
`the memory array to propagate down the bit lines to the sense amps.
`The column address selects the appropriate data subset from the
`sense amps and causes it to be driven to the output pins.
`
`3.2
`
`Fast Page Mode DRAM (FPM DRAM)
`
`Fast-Page Mode DRAM implements page mode, an improvement
`on conventional DRAM in which the row-address is held constant
`and data from multiple columns is read from the sense amplifiers.
`The data held in the sense amps form an “open page” that can be
`accessed relatively quickly. This speeds up successive accesses to
`
`Data
`rd/wr
`
`ras
`cas
`
`Data In/Out
`Buffers
`
`Clock &
`Refresh Cktry
`
`Column Address
`Buffer
`
`Column Decoder
`
`Sense Amps/Word Drivers
`
`... Bit Lines ...
`
`Memory
`Array
`
`. . . .
`Decoder
`Row
`
`address
`
`Row Address
`Buffer
`
`Figure 1: Conventional DRAM block diagram. The conventional DRAM
`uses a split addressing mechanism still found in most DRAMs today.
`
`cannot exploit this overlap. For cost considerations, having at most
`an N/2-bit bus, N being the L2 cache width, might be a good choice.
`Fourth, critical-word-first does not mix well with burst mode.
`Critical-word-first is a strategy that requests a block of data poten-
`tially out of address-order; burst mode delivers data in a fixed but
`redefinable order. A burst-mode DRAM can thus can have longer
`latencies in real systems, even if its end-to-end latency is low.
`Finally, the choice of refresh mechanism can significantly alter
`the average memory access time. For some benchmarks and some
`refresh organizations, the amount of time spent waiting for a DRAM
`in refresh mode accounted for 50% of the total latency.
`As one might expect, our results and conclusions are dependent
`on our system specifications, which we chose to be representative of
`mid- to high-end workstations: a 100MHz 128-bit memory bus, an
`eight-way superscalar out-of-order CPU, lockup-free caches, and a
`small-system DRAM organization with ~10 DRAM chips.
`
`2
`
`RELATED WORK
`
`Burger, Goodman, and Kagi quantified the effect on memory behav-
`ior of high-performance latency-reducing or latency-tolerating tech-
`niques such as
`lockup-free caches, out-of-order execution,
`prefetching, speculative loads, etc. [5]. They concluded that to hide
`memory latency, these techniques often increase demands on mem-
`ory bandwidth. They classify memory stall cycles into two types:
`those due to lack of available memory bandwidth, and those due
`purely to latency. This is a useful classification, and we use it in our
`study. This study differs from theirs in that we focus on the access
`time of only the primary memory system, while their study com-
`bines all memory access time, including the L1 and L2 caches. Their
`study focuses on the behavior of latency-hiding techniques, while
`this study focuses on the behavior of different DRAM architectures.
`Several marketing studies compare the memory latency and
`bandwidth available from different DRAM architectures [7, 29, 30].
`This paper builds on these studies by looking at a larger assortment
`of DRAM architectures, measuring DRAM impact on total applica-
`tion performance, decomposing the memory access time into differ-
`ent components, and measuring the hit rates in the row buffers.
`Finally, there are many studies that measure system-wide perfor-
`mance, including that of the primary memory system [1, 2, 9, 18, 23,
`24, 33, 34]. Our results resemble theirs, in that we obtain similar fig-
`ures for the fraction of time spent in the primary memory system.
`However, these studies have different goals from ours, in that they
`are concerned with measuring the effects on total execution time of
`
`1063-6897/99/$10.00 (c) 1999 IEEE
`
`223
`
`2 of 12
`
`

`

`Data Transfer
`
`Data Transfer Overlap
`
`Column Access
`
`Row Access
`
`Column
`Address
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Clock
`
`RAS
`
`CAS
`
`Address
`Row
`Address
`
`DQ
`
`Data
`rd/wr
`
`ras
`cas
`
`Data In/Out
`Buffers
`
`Clock &
`Refresh Cktry
`
`Column Address
`Buffer
`
`Q D
`
`Column Decoder
`
`Sense Amps/Word Drivers
`
`... Bit Lines...
`
`Memory
`Array
`
`. . . .
`Decoder
`Row
`
`address
`
`Row Address
`Buffer
`
`Figure 3: Extended Data Out (EDO) DRAM block diagram. EDO adds a
`latch on the output that allows CAS to cycle more quickly than in FPM.
`
`RAS
`
`CAS
`
`Address
`Row
`Address
`
`DQ
`
`Column
`Address
`
`Column
`Address
`
`Column
`Address
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Data Transfer
`
`Transfer Overlap
`
`Column Access
`
`Row Access
`
`Figure 4: EDO Read Timing. The output latch in EDO DRAM allows more
`overlap between column access and data transfer than in FPM.
`
`the same row of the DRAM core. Figure 2 gives the timing for FPM
`reads. The labels show the categories to which the portions of time
`are assigned in our simulations. Note that page mode is supported in
`all the DRAM architectures in this study.
`
`3.3
`
`Extended Data Out DRAM (EDO DRAM)
`
`Extended Data Out DRAM, sometimes referred to as hyper-page
`mode DRAM, adds a latch between the sense-amps and the output
`pins of the DRAM, shown in Figure 3. This latch holds output pin
`state and permits the CAS to rapidly de-assert, allowing the memory
`array to begin precharging sooner. In addition, the latch in the output
`path also implies that the data on the outputs of the DRAM circuit
`remain valid longer into the next clock phase. Figure 4 gives the tim-
`ing for an EDO read.
`
`3.4
`
`Synchronous DRAM (SDRAM)
`
`Conventional, FPM, and EDO DRAM are controlled asynchro-
`nously by the processor or the memory controller; the memory
`latency is thus some fractional number of CPU clock cycles. An
`alternative is to make the DRAM interface synchronous such that
`the DRAM latches information to and from the controller based on a
`clock signal. A timing diagram is shown in Figure 5. SDRAM
`devices typically have a programmable register that holds a bytes-
`per-request value. SDRAM may therefore return many bytes over
`several cycles per request. The advantages include the elimination of
`the timing strobes and the availability of data from the DRAM each
`clock cycle. The underlying architecture of the SDRAM core is the
`same as in a conventional DRAM.
`
`Figure 5: SDRAM Read Operation Clock Diagram. SDRAM contains a
`writable register for the request length, allowing high-speed column access.
`
`4 cycles
`
`Address
`
`Col
`
`Col
`
`Col
`
`Data Transfer
`
`Transfer Overlap
`
`Column Access
`
`Row Access
`
`Command
`
`ACTV/
`READ
`
`Read
`Strobe
`
`Read
`Term
`
`DQ
`
`Bank/
`Row
`
`Dout
`
`Dout
`
`Dout
`
`Figure 6: Rambus DRAM Read Operation. Rambus DRAMs transfer on
`both edges of a fast clock and can handle multiple simultaneous requests.
`
`3.5
`
`Enhanced Synchronous DRAM (ESDRAM)
`
`Enhanced Synchronous DRAM is an incremental modification to
`Synchronous DRAM that parallels the differences between FPM
`and EDO DRAM. First, the internal timing parameters of the
`ESDRAM core are faster than SDRAM. Second, SRAM row-caches
`have been added at the sense-amps of each bank. These caches pro-
`vide the kind of improved intra-row performance observed with
`EDO DRAM, allowing requests to the last accessed row to be satis-
`fied even when subsequent refreshes, precharges, or activates are
`taking place.
`
`3.6
`
`Synchronous Link DRAM (SLDRAM)
`
`RamLink is the IEEE standard (P1596.4) for a bus architecture for
`devices. Synchronous Link (SLDRAM) is an adaptation of Ram-
`Link for DRAM, and is another IEEE standard (P1596.7). Both are
`adaptations of the Scalable Coherent Interface (SCI). The SLDRAM
`specification is therefore an open standard allowing for use by ven-
`dors without licensing fees. SLDRAM uses a packet-based split
`request/response protocol. Its bus interface is designed to run at
`clock speeds of 200-600 MHz and has a two-byte-wide datapath.
`SLDRAM supports multiple concurrent transactions, provided all
`transactions reference unique internal banks. The 64Mbit SLDRAM
`devices contain 8 banks per device.
`
`3.7
`
`Rambus DRAMs (RDRAM)
`
`Rambus DRAMs use a one-byte-wide multiplexed address/data bus
`to connect the memory controller to the RDRAM devices. The bus
`runs at 300 Mhz and transfers on both clock edges to achieve a theo-
`retical peak of 600 Mbytes/s. Physically, each 64-Mbit RDRAM is
`
`1063-6897/99/$10.00 (c) 1999 IEEE
`
`224
`
`3 of 12
`
`

`

`Table 1: DRAM Specifications used in simulations
`
`DRAM
`type
`
`Size
`
`Rows
`
`Columns
`
`Transfer
`Width
`
`Row
`Buffer
`
`Internal
`Banks
`
`FPMDRAM 64Mbit
`EDODRAM 64Mbit
`SDRAM
`64Mbit
`ESDRAM
`64Mbit
`SLDRAM
`64Mbit
`RDRAM
`64Mbit
`DRDRAM
`64Mbit
`
`4096
`4096
`4096
`4096
`1024
`1024
`512
`
`1024
`1024
`256
`256
`128
`256
`64
`
`16 bits
`16 bits
`16 bits
`16 bits
`64 bits
`64 bits
`128 bits
`
`16K bits
`16K bits
`4K bits
`4K bits
`8K bits
`16K bits
`4K bits
`
`1
`1
`4
`4
`8
`4
`16
`
`Speed
`
`–
`–
`100MHz
`100MHz
`200MHz
`300MHz
`400MHz
`
`Pre-
`charge
`
`40ns
`40ns
`20ns
`20ns
`30ns
`26.66ns
`20/40ns
`
`Row
`Access
`
`Column
`Access
`
`Data
`Transfer
`
`15ns
`12ns
`30ns
`20ns
`40ns
`40ns
`17.5ns
`
`30ns
`30ns
`30ns
`20ns
`40ns
`23.33ns
`30ns
`
`15ns
`15ns
`10ns
`10ns
`10ns
`13.33ns
`10ns
`
`Table 2: Time components in primary memory system
`
`Component
`
`Row Access Time
`
`Column Access Time
`Data Transfer Time
`
`Data Transfer Time Overlap
`
`Refresh Time
`Bus Wait Time
`Bus Transmission Time
`
`Description
`
`The time to (possibly) precharge the row buffers, present the row address, latch the
`row address, and read the data from the memory array into the sense amps
`The time to present the column address at the address pins and latch the value
`The time to transfer the data from the sense amps through the column muxes to
`the data-out pins
`The amount of time spent performing both column access and data transfer
`simultaneously (when using page mode, a column access can overlap with the
`previous data transfer for the same row)
`Note that, since determining the amount of overlap between column address and
`data transfer can be tricky in the interleaved examples, for those cases we simply
`call all time between the start of the first data transfer and the termination of the last
`column access Data Transfer Time Overlap (see Figure 8).
`Amount of time spent waiting for a refresh cycle to finish
`Amount of time spent waiting to synchronize with the 100MHz memory bus
`The portion of time to transmit a request over the memory bus to & from the DRAM
`system that is not overlapped with Column Access Time or Data Transfer Time
`
`divided into 4 banks, each with its own row buffer, and hence up to 4
`rows remain active or open1. Transactions occur on the bus using a
`split request/response protocol. Because the bus is multiplexed
`between address and data, only one transaction may use the bus dur-
`ing any 4 clock cycle period, referred to as an octcycle. The protocol
`uses packet transactions; first an address packet is driven, then the
`data. Different transactions can require different numbers of octcy-
`cles, depending on the transaction type, location of the data within
`the device, number of devices on the channel, etc. Figure 6 gives a
`timing diagram for a read transaction.
`
`3.8
`
`Direct Rambus (DRDRAM)
`
`Direct Rambus DRAMs use a 400 Mhz 3-byte-wide channel (2 for
`data, 1 for addresses/commands). Like the Rambus parts, Direct
`Rambus parts transfer at both clock edges, implying a maximum
`bandwidth of 1.6 Gbytes/s. DRDRAMs are divided into 16 banks
`with 17 half-row buffers2. Each half-row buffer is shared between
`adjacent banks, which implies that adjacent banks cannot be active
`simultaneously. This organization has the result of increasing the
`row-buffer miss rate as compared to having one open row per bank,
`but it reduces the cost by reducing the die area occupied by the row
`
`1. In this study, we model 64-Mbit Rambus parts, which have 4 banks and
`4 open rows. Earlier 16-Mbit Rambus organizations had 2 banks and 2
`open pages, and future 256-Mbit organizations may have even more.
`2. As with the previous part, we model 64-Mbit Direct Rambus, which has
`this organization. Future (256-Mbit) organizations may look different.
`
`buffers, compared to 16 full row buffers. A critical difference
`between RDRAM and DRDRAM is that because DRDRAM parti-
`tions the bus into different components, three transactions can simul-
`taneously utilize the different portions of the DRDRAM interface.
`
`4
`
`EXPERIMENTAL METHODOLOGY
`
`For accurate timing of memory requests in a dynamically reordered
`instruction stream, we integrated our code into SimpleScalar, an exe-
`cution-driven simulator of an aggressive out-of-order processor [4].
`We calculate the DRAM access time, much of which is overlapped
`with instruction execution. To determine the degree of overlap, and
`to separate out memory stalls due to bandwidth limitations vs.
`latency limitations, we run two other simulations—one with perfect
`primary memory (zero access time) and one with a perfect bus (as
`wide as an L2 cache line). Following the methodology in [5], we
`partition the total application execution time into three components:
`TP TL and TB which correspond to time spent processing, time spent
`stalling for memory due to latency, and time spent stalling for mem-
`ory due to limited bandwidth. In this paper, time spent “processing”
`includes all activity above the primary memory system, i.e. it con-
`tains all processor execution time and L1 and L2 cache activity. Let
`T be the total execution time for the realistic simulation; let TU be
`the execution time assuming unlimited bandwidth—the results from
`the simulation that models cacheline-wide buses. Then TP is the
`time given by the simulation that models a perfect primary memory
`system, and we calculate TL and TB: TL = TU – TP and TB = T – TU.
`In addition, we consider one more component: the degree to which
`the processor is able to overlap memory access time with processing
`
`1063-6897/99/$10.00 (c) 1999 IEEE
`
`225
`
`4 of 12
`
`

`

`nization fails to take advantage of some of the newer DRAM parts
`that can handle multiple concurrent requests. 100MHz 128-bit buses
`are common for high-end machines, so this is the bus configuration
`that we model. We assume that the communication overhead is only
`one 10ns cycle in each direction.
`The DRAM/bus configurations simulated are shown in Figure 7.
`For DRAMs other than Rambus and SLDRAM, eight DRAMs are
`arranged in parallel in a DIMM-like organization to obtain a 128-bit
`bus. We assume that the memory controller has no overhead delay.
`SLDRAM, RDRAM, and DRDRAM utilize narrower, but higher
`speed busses. These DRAM architectures can be arranged in parallel
`channels, but we study them here in the context of a single-width
`DRAM bus, which is the simplest configuration. This yields some
`latency penalties for these architectures, as our simulations require
`that the controller coalesce bus packets into 128-bit chunks to be
`transmitted over the 100MHz 128-bit memory bus. To put the
`designs on even footing, we ignore the transmission time over the
`narrow DRAM channel. Because of this organization, transfer rate
`comparisons may also be deceptive, as we are transferring data from
`eight conventional DRAM (FPM, EDO, SDRAM, ESDRAM) con-
`currently, versus only a single device in the case of the narrow bus
`architectures (SLDRAM, RDRAM, DRDRAM).
`The simulator models a synchronous memory interface: the pro-
`cessor’s interface to the memory controller has a clock signal. This
`is typically simpler to implement and debug than a fully asynchro-
`nous interface. If the processor executes at a faster clock rate than
`the memory bus (as is likely), the processor may have to stall for
`several cycles to synchronize with the bus before transmitting the
`request. We account for the number of stall cycles in Bus Wait Time.
`The simulator models several different refresh organizations, as
`described in Section 5. The amount of time (on average) spent stall-
`ing due to a memory reference arriving during a refresh cycle is
`accounted for in the time component labeled Refresh Time.
`
`4.2
`
`Interleaving
`
`For the 100MHz 128-bit bus configuration, the transfer size is eight
`times the request size; therefore each DRAM access is a pipelined
`operation that takes advantage of page mode. For the faster DRAM
`parts, this pipeline keeps the memory bus completely occupied.
`However, for the slower DRAM parts (FPM and EDO), the timing
`looks like that shown in Figure 8(a). While the address bus may be
`fully occupied, the memory bus is not, which puts the slower
`DRAMs at a disadvantage compared to the faster parts. For compar-
`ison, we model the FPM and EDO parts as interleaved as well
`(shown in Figure 8(b)). The degree of interleaving is that required to
`occupy the memory data bus as fully as possible. This may actually
`over-occupy the address bus, in which case we assume that there are
`more than one address buses between the controller and the DRAM
`parts. FPM DRAM specifies a 40ns CAS period and is four-way
`interleaved; EDO DRAM specifies a 25ns CAS period and is two-
`way interleaved. Both are interleaved at a bus-width granularity.
`
`5
`
`EXPERIMENTAL RESULTS
`
`For most graphs, the performance of several DRAM organizations is
`given: FPM1, FPM2, FPM3, EDO1, EDO2, SDRAM, ESDRAM,
`SLDRAM, RDRAM, and DRDRAM. The first two configurations
`(FPM1 and FPM2) show the difference between always keeping the
`row buffer open (thereby avoiding a precharge overhead if the next
`access is to the same row) and never keeping the row buffer open.
`FPM1 is the pessimistic strategy of closing the row buffer after every
`access and precharging immediately; FPM2 is the optimistic strat-
`egy of keeping the row buffer open and delaying precharge. The dif-
`
`CPU
`and caches
`
`128-bit 100MHz bus
`
`Memory
`Controller
`
`x16 DRAM
`
`x16 DRAM
`
`x16 DRAM
`
`x16 DRAM
`
`x16 DRAM
`
`x16 DRAM
`
`x16 DRAM
`
`x16 DRAM
`
`(a) Configuration used for non-interleaved FPMDRAM, EDODRAM, SDRAM, and ESDRAM
`
`DRAM
`
`DRAM
`
`DRAM
`
`DRAM
`
`DRAM
`
`DRAM
`
`DRAM
`
`DRAM
`
`Memory
`Controller
`
`CPU
`and caches
`
`128-bit 100MHz bus
`
`(b) Configuration used for SLDRAM, RDRAM, and DRDRAM
`
`DRAM
`
`DRAM
`
`DRAM
`
`DRAM
`
`DRAM
`
`CPU
`and caches
`
`128-bit 100MHz bus
`
`Memory
`Controller
`
`...
`
`(c) (Strawman) configuration used for parallel-channel SLDRAM & Rambus performance
`
`Figure 7: DRAM bus configurations. The DRAM/bus organizations used
`in (a) the non-interleaved FPM, EDO, SDRAM, and ESDRAM simulations; (b)
`the SLDRAM and Rambus simulations; and (c) the parallel-channel SLDRAM
`and Rambus performance numbers in Figure 11. Due to differences in bus
`design, the only bus overhead included in the simulations is that of the bus
`that is common to all organizations: the 100MHz 128-bit memory bus.
`
`time. We call this overlapped component TO, and if TM is the total
`time spent in the primary memory system (the time returned by our
`DRAM simulator), then TO = TP – (T – TM). This is the portion of
`TP that is overlapped with memory access.
`
`4.1
`
`DRAM Simulator Overview
`
`The DRAM simulator models the internal state of the following
`DRAM architectures: Fast Page Mode [35], Extended Data Out
`[16], Synchronous [17], Enhanced Synchronous [10, 17], Synchro-
`nous Link [38], Rambus [31], and Direct Rambus [32].
`The timing parameters for the different DRAM architectures are
`given in Table 1. Since we could not find a 64Mbit part specification
`for ESDRAM, we extrapolated based on the most recent SDRAM
`and ESDRAM datasheets. To measure DRAM behavior in systems
`of differing performance, we varied the speed at which requests
`arrive at the DRAM. We ran the L2 cache at speeds of 100ns, 10ns,
`and 1ns, and for each L2 access-time we scaled the main processor’s
`speed accordingly (the CPU runs at 10x the L2 cache speed).
`We wanted a model of a typical workstation, so the processor is
`eight-way superscalar, out-of-order, with lockup-free L1 caches. L1
`caches are split 64KB/64KB, 2-way set associative, with 64-byte
`linesizes. The L2 cache is unified 1MB, 4-way set associative, write-
`back, and has a 128-byte linesize. The L2 cache is lockup-free but
`only allows one outstanding DRAM request at a time; note this orga-
`
`1063-6897/99/$10.00 (c) 1999 IEEE
`
`226
`
`5 of 12
`
`

`

`compress, 100ns L2 cache
`
`Bus Wait Time
`Refresh Time
`Data Transfer Time
`Data Transfer Time Overlap
`Column Access Time
`Row Access Time
`Bus Transmission Time
`
`1200
`
`800
`
`400
`
`Time per Access (ns)
`
`0
`
`FPM1
`
`FPM2
`
`FPM3
`
`EDO1
`
`EDO2
`
`SDRAM1 ESDRAM SLDRAM RDRAM DRDRAM
`
`DRAM Configurations
`
`Figure 9: The penalty for choosing the wrong refresh organization.
`some instances, time waiting for refresh can account for more than 50%.
`
`In
`
`up to two orders of magnitude worse than the time-interspersed
`scheme. Particularly hard-hit was the compress benchmark, shown
`in Figure 9. Because such high overheads are easily avoided with an
`appropriate refresh organization, we only present results for the
`time-interspersed refresh approach.
`
`5.2
`
`Total Execution Time
`
`Figure 10(a) shows the total execution time for several benchmarks
`of SPECint ’95 using SDRAM for the primary memory system. The
`time is divided into processor computation, which includes accesses
`to the L1 and L2 caches, and time spent in the primary memory sys-
`tem. The graphs also show the overlap between processor computa-
`tion and DRAM access time. For each architecture, there are three
`vertical bars, representing L2 cache cycle times of 100ns, 10ns, and
`1ns (left, middle, and rightmost bars, respectively). For each DRAM
`architecture and L2 cache access time, the figure shows a bar repre-
`senting execution time, partitioned into four components:
`• Memory stall cycles due to limited bandwidth
`• Memory stall cycles due to latency
`• Processor time (includes L1 and L2 activity) that is overlapped
`with memory access
`• Processor time (includes L1 and L2 activity) that is not
`overlapped with memory access
`SimpleScalar schedules instructions extremely aggressively and
`hides much of the memory latency with other work—though this
`“other work” is not all useful work, as it includes all L1 and L2
`cache activity. For the 100ns L2 (corresponding to a 100MHz pro-
`cessor), between 50% and 99% of the memory access-time is hid-
`den, depending on the type of DRAM the CPU is attached to (the
`faster DRAM parts allow a processor to exploit greater degrees of
`concurrency). For 10ns (corresponding to a 1GHz processor),
`between 5% and 90% of the latency is hidden. As expected, the
`slower systems hide more of the DRAM access time than the faster
`systems.
`Figure 10(b) shows that the more advanced DRAM designs have
`reduced the proportion of overhead attributed to limited bandwidth
`by roughly a factor of three: from 3 CPI in FPMDRAM to 1 CPI in
`SDRAM, ESDRAM, and DRDRAM.
`Summary: The graphs demonstrate the degree to which con-
`temporary DRAM designs are addressing the memory bandwidth
`problem. Popular high-performance techniques such as lockup-free
`
`Data Transfer
`
`Data Transfer Overlap
`
`Column Access
`
`Row Access
`
`Bus Transmission
`
`Column
`Address
`
`Column
`Address
`
`Column
`Address
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Bus
`Cycle
`
`Bus
`Cycle
`
`Bus
`Cycle
`
`(a) Non-interleaved timing for access to DRAM
`
`Column
`Address
`
`Column
`Address
`
`Column
`Address
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Row
`Address
`
`Column
`Address
`
`Column
`Address
`
`Column
`Address
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Bus
`Cycle0
`
`Bus
`Cycle1
`
`Bus
`Cycle0
`
`Bus
`Cycle1
`
`Bus
`Cycle0
`
`Bus
`Cycle1
`
`(b) Interleaved timing for access to DRAM
`
`Address
`
`Row
`Address
`
`DQ
`
`Data bus
`
`Address0
`
`Row
`Address
`
`DQ0
`
`Address1
`
`DQ1
`
`Data bus
`
`Figure 8: Interleaving in DRAM simulator. Time in Data Transfer Overlap
`accounts for much activity in interleaved organizations; Bus Transmissionis
`the remainder of time that is not overlapped with anything else.
`
`ference is seen in Row Access Time, which, as the graphs show, is
`not large for present-day organizations. For all other DRAM simula-
`tions but ESDRAM, we keep the row buffer open, as the timing of
`the pessimistic strategy can be calculated without simulation. The
`FPM3 and EDO2 labels represent the interleaved organizations of
`FPM and EDO DRAM. The remaining labels are self-explanatory.
`
`5.1
`
`Handling Refresh
`
`Surprisingly, DRAM refresh organization can affect performance
`dramatically. Where the refresh organization is not specified for an
`architecture, we simulate a model in which the DRAM allocates
`bandwidth to either memory references or refresh operations, at the
`expens

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket