`
`Vinodh Cuppu, Bruce Jacob
`Dept. of Electrical & Computer Engineering
`University of Maryland, College Park
`{ramvinod,blj}@eng.umd.edu
`
`Brian Davis, Trevor Mudge
` Dept. of Electrical Engineering & Computer Science
`University of Michigan, Ann Arbor
`{btdavis,tnm}@eecs.umich.edu
`
`ABSTRACT
`
`In response to the growing gap between memory access time and
`processor speed, DRAM manufacturers have created several new
`DRAM architectures. This paper presents a simulation-based per-
`formance study of a representative group, each evaluated in a small
`system organization. These small-system organizations correspond
`to workstation-class computers and use on the order of 10 DRAM
`chips. The study covers Fast Page Mode, Extended Data Out, Syn-
`chronous, Enhanced Synchronous, Synchronous Link, Rambus, and
`Direct Rambus designs. Our simulations reveal several things: (a)
`current advanced DRAM technologies are attacking the memory
`bandwidth problem but not the latency problem; (b) bus transmis-
`sion speed will soon become a primary factor limiting memory-sys-
`tem performance; (c) the post-L2 address stream still contains
`significant locality, though it varies from application to application;
`and (d) as we move to wider buses, row access time becomes more
`prominent, making it important to investigate techniques to exploit
`the available locality to decrease access time.
`
`1
`
`INTRODUCTION
`
`In response to the growing gap between memory access time and
`processor speed, DRAM manufacturers have created several new
`DRAM architectures. This paper presents a simulation-based perfor-
`mance study of a representative group, evaluating each in terms of
`its effect on total execution time. We simulate the performance of
`seven DRAM architectures: Fast Page Mode [35], Extended Data
`Out [16], Synchronous [17], Enhanced Synchronous [10], Synchro-
`nous Link [38], Rambus [31], and Direct Rambus [32]. While there
`are a number of academic proposals for new DRAM designs, space
`limits us to covering only existent commercial parts. To obtain accu-
`rate memory-request timing for an aggressive out-of-order proces-
`sor, we integrate our code into the SimpleScalar tool set [4].
`This paper presents a baseline study of a small-system DRAM
`organization: these are systems with only a handful of DRAM chips
`(0.1–1GB). We do not consider large-system DRAM organizations
`with many gigabytes of storage that are highly interleaved. The
`study asks and answers the following questions:
`• What is the effect of improvements in DRAM technology on the
`memory latency and bandwidth problems?
`Contemporary techniques for improving processor performance
`and tolerating memory latency are exacerbating the memory
`bandwidth problem [5]. Our results show that current DRAM
`architectures are attacking exactly this problem: the most recent
`technologies (SDRAM, ESDRAM, and Rambus) have reduced
`the stall time due to limited bandwidth by a factor of three
`compared to earlier DRAM architectures. However, the
`memory-latency component of overhead has not improved.
`
`• Where is time spent in the primary memory system (the memory
`system beyond the cache hierarchy, but not including secondary
`[disk] or tertiary [backup] storage)? What is the performance
`benefit of exploiting the page mode of contemporary DRAMs?
`For the newer DRAM designs, the time to extract the required
`data from the sense amps/row caches for transmission on the
`memory bus is the largest component in the average access time,
`though page mode allows this to be overlapped with column
`access and the time to transmit the data over the memory bus.
`• How much locality is there in the address stream that reaches the
`primary memory system?
`The stream of addresses that miss the L2 cache contains a
`significant amount of locality, as measured by the hit-rates in the
`DRAM row buffers. The hit rates for the applications studied
`range 8–95%, with a mean hit rate of 40% for a 1MB L2 cache.
`(This does not include hits to the row buffers when making
`multiple DRAM requests to read one cache-line.)
`We also make several observations. First, there is a one-time trade-
`off between cost, bandwidth, and latency: to a point, latency can be
`decreased by ganging together multiple DRAMs into a wide struc-
`ture. This trades dollars for bandwidth that reduces latency because
`a request size is typically much larger than the DRAM transfer
`width. Page mode and interleaving are similar optimizations that
`work because a request size is typically larger than the bus width.
`However, the latency benefits are limited by bus and DRAM speeds:
`to get further improvements, one must run the DRAM core and bus
`at faster speeds. Current memory busses are adequate for small sys-
`tems but are likely inadequate for large ones. Embedded DRAM [5,
`19, 37] is not a near-term solution, as its performance is poor on
`high-end workloads [3]. Faster buses are more likely solutions—wit-
`ness the elimination of the slow intermediate memory bus in future
`systems [12]. Another solution is to internally bank the memory
`array into many small arrays so that each can be accessed very
`quickly, as in the MoSys Multibank DRAM architecture [39].
`Second, widening buses will present new optimization opportu-
`nities. Each application exhibits a different degree of locality and
`therefore benefits from page mode to a different degree. As buses
`widen, this effect becomes more pronounced, to the extent that dif-
`ferent applications can have average access times that differ by 50%.
`This is a minor issue considering current bus technology. However,
`future bus technologies will expose the row access as the primary
`performance bottleneck, justifying the exploration of mechanisms to
`exploit locality to guarantee hits in the DRAM row buffers: e.g. row-
`buffer victim caches, prediction mechanisms, etc.
`Third, while buses as wide as the L2 cache yield the best mem-
`ory latency, they cannot halve the latency of a bus half as wide. Page
`mode overlaps the components of DRAM access when making mul-
`tiple requests to the same row. If the bus is as wide as a request, one
`
`Copyright © 1999 IEEE. Published in the Proceedings of the 26th International Symposium on Computer Architecture, May 2-4, 1999, in Atlanta GA, USA. Personal use of this material is per-
`mitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to
`reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O.
`Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.
`
`Petitioner Lenovo (United States) Inc. - Ex. 1011
`222
`1063-6897/99/$10.00 (c) 1999 IEEE
`
`1 of 12
`
`
`
`Data Transfer
`
`Data Transfer Overlap
`
`Column Access
`
`Row Access
`
`RAS
`
`CAS
`
`Address
`
`Row
`Address
`
`DQ
`
`Column
`Address
`
`Column
`Address
`
`Column
`Address
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Figure 2: FPM Read Timing. Fast page mode allows the DRAM controller
`to hold a row constant and receive multiple columns in rapid succession.
`
`varying several CPU-level parameters such as issue width, cache
`size & organization, number of processors, etc. This study focuses
`on the performance behavior of different DRAM architectures.
`
`3
`
`BACKGROUND
`
`A Random Access Memory (RAM) that uses a single transistor-
`capacitor pair for each binary value (bit) is referred to as a Dynamic
`Random Access Memory or DRAM. This circuit is dynamic
`because leakage requires that the capacitor be periodically refreshed
`for information retention. Initially, DRAMs had minimal I/O pin
`counts because the manufacturing cost was dominated by the num-
`ber of I/O pins in the package. Due largely to a desire to use stan-
`dardized parts, the initial constraints limiting the I/O pins have had a
`long-term effect on DRAM architecture: the address pins for most
`DRAMs are still multiplexed, potentially limiting performance. As
`the standard DRAM interface has become a performance bottleneck,
`a number of “revolutionary” proposals [26] have been made. In most
`cases, the revolutionary portion is the interface or access mecha-
`nism, while the DRAM core remains essentially unchanged.
`
`3.1
`
`The Conventional DRAM
`
`The addressing mechanism of early DRAM architectures is still uti-
`lized, with minor changes, in many of the DRAMs produced today.
`In this interface, shown in Figure 1, the address bus is multiplexed
`between row and column components. The multiplexed address bus
`uses two control signals—the row and column address strobe sig-
`nals, RAS and CAS respectively—which cause the DRAM to latch
`the address components. The row address causes a complete row in
`the memory array to propagate down the bit lines to the sense amps.
`The column address selects the appropriate data subset from the
`sense amps and causes it to be driven to the output pins.
`
`3.2
`
`Fast Page Mode DRAM (FPM DRAM)
`
`Fast-Page Mode DRAM implements page mode, an improvement
`on conventional DRAM in which the row-address is held constant
`and data from multiple columns is read from the sense amplifiers.
`The data held in the sense amps form an “open page” that can be
`accessed relatively quickly. This speeds up successive accesses to
`
`Data
`rd/wr
`
`ras
`cas
`
`Data In/Out
`Buffers
`
`Clock &
`Refresh Cktry
`
`Column Address
`Buffer
`
`Column Decoder
`
`Sense Amps/Word Drivers
`
`... Bit Lines ...
`
`Memory
`Array
`
`. . . .
`Decoder
`Row
`
`address
`
`Row Address
`Buffer
`
`Figure 1: Conventional DRAM block diagram. The conventional DRAM
`uses a split addressing mechanism still found in most DRAMs today.
`
`cannot exploit this overlap. For cost considerations, having at most
`an N/2-bit bus, N being the L2 cache width, might be a good choice.
`Fourth, critical-word-first does not mix well with burst mode.
`Critical-word-first is a strategy that requests a block of data poten-
`tially out of address-order; burst mode delivers data in a fixed but
`redefinable order. A burst-mode DRAM can thus can have longer
`latencies in real systems, even if its end-to-end latency is low.
`Finally, the choice of refresh mechanism can significantly alter
`the average memory access time. For some benchmarks and some
`refresh organizations, the amount of time spent waiting for a DRAM
`in refresh mode accounted for 50% of the total latency.
`As one might expect, our results and conclusions are dependent
`on our system specifications, which we chose to be representative of
`mid- to high-end workstations: a 100MHz 128-bit memory bus, an
`eight-way superscalar out-of-order CPU, lockup-free caches, and a
`small-system DRAM organization with ~10 DRAM chips.
`
`2
`
`RELATED WORK
`
`Burger, Goodman, and Kagi quantified the effect on memory behav-
`ior of high-performance latency-reducing or latency-tolerating tech-
`niques such as
`lockup-free caches, out-of-order execution,
`prefetching, speculative loads, etc. [5]. They concluded that to hide
`memory latency, these techniques often increase demands on mem-
`ory bandwidth. They classify memory stall cycles into two types:
`those due to lack of available memory bandwidth, and those due
`purely to latency. This is a useful classification, and we use it in our
`study. This study differs from theirs in that we focus on the access
`time of only the primary memory system, while their study com-
`bines all memory access time, including the L1 and L2 caches. Their
`study focuses on the behavior of latency-hiding techniques, while
`this study focuses on the behavior of different DRAM architectures.
`Several marketing studies compare the memory latency and
`bandwidth available from different DRAM architectures [7, 29, 30].
`This paper builds on these studies by looking at a larger assortment
`of DRAM architectures, measuring DRAM impact on total applica-
`tion performance, decomposing the memory access time into differ-
`ent components, and measuring the hit rates in the row buffers.
`Finally, there are many studies that measure system-wide perfor-
`mance, including that of the primary memory system [1, 2, 9, 18, 23,
`24, 33, 34]. Our results resemble theirs, in that we obtain similar fig-
`ures for the fraction of time spent in the primary memory system.
`However, these studies have different goals from ours, in that they
`are concerned with measuring the effects on total execution time of
`
`1063-6897/99/$10.00 (c) 1999 IEEE
`
`223
`
`2 of 12
`
`
`
`Data Transfer
`
`Data Transfer Overlap
`
`Column Access
`
`Row Access
`
`Column
`Address
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Clock
`
`RAS
`
`CAS
`
`Address
`Row
`Address
`
`DQ
`
`Data
`rd/wr
`
`ras
`cas
`
`Data In/Out
`Buffers
`
`Clock &
`Refresh Cktry
`
`Column Address
`Buffer
`
`Q D
`
`Column Decoder
`
`Sense Amps/Word Drivers
`
`... Bit Lines...
`
`Memory
`Array
`
`. . . .
`Decoder
`Row
`
`address
`
`Row Address
`Buffer
`
`Figure 3: Extended Data Out (EDO) DRAM block diagram. EDO adds a
`latch on the output that allows CAS to cycle more quickly than in FPM.
`
`RAS
`
`CAS
`
`Address
`Row
`Address
`
`DQ
`
`Column
`Address
`
`Column
`Address
`
`Column
`Address
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Data Transfer
`
`Transfer Overlap
`
`Column Access
`
`Row Access
`
`Figure 4: EDO Read Timing. The output latch in EDO DRAM allows more
`overlap between column access and data transfer than in FPM.
`
`the same row of the DRAM core. Figure 2 gives the timing for FPM
`reads. The labels show the categories to which the portions of time
`are assigned in our simulations. Note that page mode is supported in
`all the DRAM architectures in this study.
`
`3.3
`
`Extended Data Out DRAM (EDO DRAM)
`
`Extended Data Out DRAM, sometimes referred to as hyper-page
`mode DRAM, adds a latch between the sense-amps and the output
`pins of the DRAM, shown in Figure 3. This latch holds output pin
`state and permits the CAS to rapidly de-assert, allowing the memory
`array to begin precharging sooner. In addition, the latch in the output
`path also implies that the data on the outputs of the DRAM circuit
`remain valid longer into the next clock phase. Figure 4 gives the tim-
`ing for an EDO read.
`
`3.4
`
`Synchronous DRAM (SDRAM)
`
`Conventional, FPM, and EDO DRAM are controlled asynchro-
`nously by the processor or the memory controller; the memory
`latency is thus some fractional number of CPU clock cycles. An
`alternative is to make the DRAM interface synchronous such that
`the DRAM latches information to and from the controller based on a
`clock signal. A timing diagram is shown in Figure 5. SDRAM
`devices typically have a programmable register that holds a bytes-
`per-request value. SDRAM may therefore return many bytes over
`several cycles per request. The advantages include the elimination of
`the timing strobes and the availability of data from the DRAM each
`clock cycle. The underlying architecture of the SDRAM core is the
`same as in a conventional DRAM.
`
`Figure 5: SDRAM Read Operation Clock Diagram. SDRAM contains a
`writable register for the request length, allowing high-speed column access.
`
`4 cycles
`
`Address
`
`Col
`
`Col
`
`Col
`
`Data Transfer
`
`Transfer Overlap
`
`Column Access
`
`Row Access
`
`Command
`
`ACTV/
`READ
`
`Read
`Strobe
`
`Read
`Term
`
`DQ
`
`Bank/
`Row
`
`Dout
`
`Dout
`
`Dout
`
`Figure 6: Rambus DRAM Read Operation. Rambus DRAMs transfer on
`both edges of a fast clock and can handle multiple simultaneous requests.
`
`3.5
`
`Enhanced Synchronous DRAM (ESDRAM)
`
`Enhanced Synchronous DRAM is an incremental modification to
`Synchronous DRAM that parallels the differences between FPM
`and EDO DRAM. First, the internal timing parameters of the
`ESDRAM core are faster than SDRAM. Second, SRAM row-caches
`have been added at the sense-amps of each bank. These caches pro-
`vide the kind of improved intra-row performance observed with
`EDO DRAM, allowing requests to the last accessed row to be satis-
`fied even when subsequent refreshes, precharges, or activates are
`taking place.
`
`3.6
`
`Synchronous Link DRAM (SLDRAM)
`
`RamLink is the IEEE standard (P1596.4) for a bus architecture for
`devices. Synchronous Link (SLDRAM) is an adaptation of Ram-
`Link for DRAM, and is another IEEE standard (P1596.7). Both are
`adaptations of the Scalable Coherent Interface (SCI). The SLDRAM
`specification is therefore an open standard allowing for use by ven-
`dors without licensing fees. SLDRAM uses a packet-based split
`request/response protocol. Its bus interface is designed to run at
`clock speeds of 200-600 MHz and has a two-byte-wide datapath.
`SLDRAM supports multiple concurrent transactions, provided all
`transactions reference unique internal banks. The 64Mbit SLDRAM
`devices contain 8 banks per device.
`
`3.7
`
`Rambus DRAMs (RDRAM)
`
`Rambus DRAMs use a one-byte-wide multiplexed address/data bus
`to connect the memory controller to the RDRAM devices. The bus
`runs at 300 Mhz and transfers on both clock edges to achieve a theo-
`retical peak of 600 Mbytes/s. Physically, each 64-Mbit RDRAM is
`
`1063-6897/99/$10.00 (c) 1999 IEEE
`
`224
`
`3 of 12
`
`
`
`Table 1: DRAM Specifications used in simulations
`
`DRAM
`type
`
`Size
`
`Rows
`
`Columns
`
`Transfer
`Width
`
`Row
`Buffer
`
`Internal
`Banks
`
`FPMDRAM 64Mbit
`EDODRAM 64Mbit
`SDRAM
`64Mbit
`ESDRAM
`64Mbit
`SLDRAM
`64Mbit
`RDRAM
`64Mbit
`DRDRAM
`64Mbit
`
`4096
`4096
`4096
`4096
`1024
`1024
`512
`
`1024
`1024
`256
`256
`128
`256
`64
`
`16 bits
`16 bits
`16 bits
`16 bits
`64 bits
`64 bits
`128 bits
`
`16K bits
`16K bits
`4K bits
`4K bits
`8K bits
`16K bits
`4K bits
`
`1
`1
`4
`4
`8
`4
`16
`
`Speed
`
`–
`–
`100MHz
`100MHz
`200MHz
`300MHz
`400MHz
`
`Pre-
`charge
`
`40ns
`40ns
`20ns
`20ns
`30ns
`26.66ns
`20/40ns
`
`Row
`Access
`
`Column
`Access
`
`Data
`Transfer
`
`15ns
`12ns
`30ns
`20ns
`40ns
`40ns
`17.5ns
`
`30ns
`30ns
`30ns
`20ns
`40ns
`23.33ns
`30ns
`
`15ns
`15ns
`10ns
`10ns
`10ns
`13.33ns
`10ns
`
`Table 2: Time components in primary memory system
`
`Component
`
`Row Access Time
`
`Column Access Time
`Data Transfer Time
`
`Data Transfer Time Overlap
`
`Refresh Time
`Bus Wait Time
`Bus Transmission Time
`
`Description
`
`The time to (possibly) precharge the row buffers, present the row address, latch the
`row address, and read the data from the memory array into the sense amps
`The time to present the column address at the address pins and latch the value
`The time to transfer the data from the sense amps through the column muxes to
`the data-out pins
`The amount of time spent performing both column access and data transfer
`simultaneously (when using page mode, a column access can overlap with the
`previous data transfer for the same row)
`Note that, since determining the amount of overlap between column address and
`data transfer can be tricky in the interleaved examples, for those cases we simply
`call all time between the start of the first data transfer and the termination of the last
`column access Data Transfer Time Overlap (see Figure 8).
`Amount of time spent waiting for a refresh cycle to finish
`Amount of time spent waiting to synchronize with the 100MHz memory bus
`The portion of time to transmit a request over the memory bus to & from the DRAM
`system that is not overlapped with Column Access Time or Data Transfer Time
`
`divided into 4 banks, each with its own row buffer, and hence up to 4
`rows remain active or open1. Transactions occur on the bus using a
`split request/response protocol. Because the bus is multiplexed
`between address and data, only one transaction may use the bus dur-
`ing any 4 clock cycle period, referred to as an octcycle. The protocol
`uses packet transactions; first an address packet is driven, then the
`data. Different transactions can require different numbers of octcy-
`cles, depending on the transaction type, location of the data within
`the device, number of devices on the channel, etc. Figure 6 gives a
`timing diagram for a read transaction.
`
`3.8
`
`Direct Rambus (DRDRAM)
`
`Direct Rambus DRAMs use a 400 Mhz 3-byte-wide channel (2 for
`data, 1 for addresses/commands). Like the Rambus parts, Direct
`Rambus parts transfer at both clock edges, implying a maximum
`bandwidth of 1.6 Gbytes/s. DRDRAMs are divided into 16 banks
`with 17 half-row buffers2. Each half-row buffer is shared between
`adjacent banks, which implies that adjacent banks cannot be active
`simultaneously. This organization has the result of increasing the
`row-buffer miss rate as compared to having one open row per bank,
`but it reduces the cost by reducing the die area occupied by the row
`
`1. In this study, we model 64-Mbit Rambus parts, which have 4 banks and
`4 open rows. Earlier 16-Mbit Rambus organizations had 2 banks and 2
`open pages, and future 256-Mbit organizations may have even more.
`2. As with the previous part, we model 64-Mbit Direct Rambus, which has
`this organization. Future (256-Mbit) organizations may look different.
`
`buffers, compared to 16 full row buffers. A critical difference
`between RDRAM and DRDRAM is that because DRDRAM parti-
`tions the bus into different components, three transactions can simul-
`taneously utilize the different portions of the DRDRAM interface.
`
`4
`
`EXPERIMENTAL METHODOLOGY
`
`For accurate timing of memory requests in a dynamically reordered
`instruction stream, we integrated our code into SimpleScalar, an exe-
`cution-driven simulator of an aggressive out-of-order processor [4].
`We calculate the DRAM access time, much of which is overlapped
`with instruction execution. To determine the degree of overlap, and
`to separate out memory stalls due to bandwidth limitations vs.
`latency limitations, we run two other simulations—one with perfect
`primary memory (zero access time) and one with a perfect bus (as
`wide as an L2 cache line). Following the methodology in [5], we
`partition the total application execution time into three components:
`TP TL and TB which correspond to time spent processing, time spent
`stalling for memory due to latency, and time spent stalling for mem-
`ory due to limited bandwidth. In this paper, time spent “processing”
`includes all activity above the primary memory system, i.e. it con-
`tains all processor execution time and L1 and L2 cache activity. Let
`T be the total execution time for the realistic simulation; let TU be
`the execution time assuming unlimited bandwidth—the results from
`the simulation that models cacheline-wide buses. Then TP is the
`time given by the simulation that models a perfect primary memory
`system, and we calculate TL and TB: TL = TU – TP and TB = T – TU.
`In addition, we consider one more component: the degree to which
`the processor is able to overlap memory access time with processing
`
`1063-6897/99/$10.00 (c) 1999 IEEE
`
`225
`
`4 of 12
`
`
`
`nization fails to take advantage of some of the newer DRAM parts
`that can handle multiple concurrent requests. 100MHz 128-bit buses
`are common for high-end machines, so this is the bus configuration
`that we model. We assume that the communication overhead is only
`one 10ns cycle in each direction.
`The DRAM/bus configurations simulated are shown in Figure 7.
`For DRAMs other than Rambus and SLDRAM, eight DRAMs are
`arranged in parallel in a DIMM-like organization to obtain a 128-bit
`bus. We assume that the memory controller has no overhead delay.
`SLDRAM, RDRAM, and DRDRAM utilize narrower, but higher
`speed busses. These DRAM architectures can be arranged in parallel
`channels, but we study them here in the context of a single-width
`DRAM bus, which is the simplest configuration. This yields some
`latency penalties for these architectures, as our simulations require
`that the controller coalesce bus packets into 128-bit chunks to be
`transmitted over the 100MHz 128-bit memory bus. To put the
`designs on even footing, we ignore the transmission time over the
`narrow DRAM channel. Because of this organization, transfer rate
`comparisons may also be deceptive, as we are transferring data from
`eight conventional DRAM (FPM, EDO, SDRAM, ESDRAM) con-
`currently, versus only a single device in the case of the narrow bus
`architectures (SLDRAM, RDRAM, DRDRAM).
`The simulator models a synchronous memory interface: the pro-
`cessor’s interface to the memory controller has a clock signal. This
`is typically simpler to implement and debug than a fully asynchro-
`nous interface. If the processor executes at a faster clock rate than
`the memory bus (as is likely), the processor may have to stall for
`several cycles to synchronize with the bus before transmitting the
`request. We account for the number of stall cycles in Bus Wait Time.
`The simulator models several different refresh organizations, as
`described in Section 5. The amount of time (on average) spent stall-
`ing due to a memory reference arriving during a refresh cycle is
`accounted for in the time component labeled Refresh Time.
`
`4.2
`
`Interleaving
`
`For the 100MHz 128-bit bus configuration, the transfer size is eight
`times the request size; therefore each DRAM access is a pipelined
`operation that takes advantage of page mode. For the faster DRAM
`parts, this pipeline keeps the memory bus completely occupied.
`However, for the slower DRAM parts (FPM and EDO), the timing
`looks like that shown in Figure 8(a). While the address bus may be
`fully occupied, the memory bus is not, which puts the slower
`DRAMs at a disadvantage compared to the faster parts. For compar-
`ison, we model the FPM and EDO parts as interleaved as well
`(shown in Figure 8(b)). The degree of interleaving is that required to
`occupy the memory data bus as fully as possible. This may actually
`over-occupy the address bus, in which case we assume that there are
`more than one address buses between the controller and the DRAM
`parts. FPM DRAM specifies a 40ns CAS period and is four-way
`interleaved; EDO DRAM specifies a 25ns CAS period and is two-
`way interleaved. Both are interleaved at a bus-width granularity.
`
`5
`
`EXPERIMENTAL RESULTS
`
`For most graphs, the performance of several DRAM organizations is
`given: FPM1, FPM2, FPM3, EDO1, EDO2, SDRAM, ESDRAM,
`SLDRAM, RDRAM, and DRDRAM. The first two configurations
`(FPM1 and FPM2) show the difference between always keeping the
`row buffer open (thereby avoiding a precharge overhead if the next
`access is to the same row) and never keeping the row buffer open.
`FPM1 is the pessimistic strategy of closing the row buffer after every
`access and precharging immediately; FPM2 is the optimistic strat-
`egy of keeping the row buffer open and delaying precharge. The dif-
`
`CPU
`and caches
`
`128-bit 100MHz bus
`
`Memory
`Controller
`
`x16 DRAM
`
`x16 DRAM
`
`x16 DRAM
`
`x16 DRAM
`
`x16 DRAM
`
`x16 DRAM
`
`x16 DRAM
`
`x16 DRAM
`
`(a) Configuration used for non-interleaved FPMDRAM, EDODRAM, SDRAM, and ESDRAM
`
`DRAM
`
`DRAM
`
`DRAM
`
`DRAM
`
`DRAM
`
`DRAM
`
`DRAM
`
`DRAM
`
`Memory
`Controller
`
`CPU
`and caches
`
`128-bit 100MHz bus
`
`(b) Configuration used for SLDRAM, RDRAM, and DRDRAM
`
`DRAM
`
`DRAM
`
`DRAM
`
`DRAM
`
`DRAM
`
`CPU
`and caches
`
`128-bit 100MHz bus
`
`Memory
`Controller
`
`...
`
`(c) (Strawman) configuration used for parallel-channel SLDRAM & Rambus performance
`
`Figure 7: DRAM bus configurations. The DRAM/bus organizations used
`in (a) the non-interleaved FPM, EDO, SDRAM, and ESDRAM simulations; (b)
`the SLDRAM and Rambus simulations; and (c) the parallel-channel SLDRAM
`and Rambus performance numbers in Figure 11. Due to differences in bus
`design, the only bus overhead included in the simulations is that of the bus
`that is common to all organizations: the 100MHz 128-bit memory bus.
`
`time. We call this overlapped component TO, and if TM is the total
`time spent in the primary memory system (the time returned by our
`DRAM simulator), then TO = TP – (T – TM). This is the portion of
`TP that is overlapped with memory access.
`
`4.1
`
`DRAM Simulator Overview
`
`The DRAM simulator models the internal state of the following
`DRAM architectures: Fast Page Mode [35], Extended Data Out
`[16], Synchronous [17], Enhanced Synchronous [10, 17], Synchro-
`nous Link [38], Rambus [31], and Direct Rambus [32].
`The timing parameters for the different DRAM architectures are
`given in Table 1. Since we could not find a 64Mbit part specification
`for ESDRAM, we extrapolated based on the most recent SDRAM
`and ESDRAM datasheets. To measure DRAM behavior in systems
`of differing performance, we varied the speed at which requests
`arrive at the DRAM. We ran the L2 cache at speeds of 100ns, 10ns,
`and 1ns, and for each L2 access-time we scaled the main processor’s
`speed accordingly (the CPU runs at 10x the L2 cache speed).
`We wanted a model of a typical workstation, so the processor is
`eight-way superscalar, out-of-order, with lockup-free L1 caches. L1
`caches are split 64KB/64KB, 2-way set associative, with 64-byte
`linesizes. The L2 cache is unified 1MB, 4-way set associative, write-
`back, and has a 128-byte linesize. The L2 cache is lockup-free but
`only allows one outstanding DRAM request at a time; note this orga-
`
`1063-6897/99/$10.00 (c) 1999 IEEE
`
`226
`
`5 of 12
`
`
`
`compress, 100ns L2 cache
`
`Bus Wait Time
`Refresh Time
`Data Transfer Time
`Data Transfer Time Overlap
`Column Access Time
`Row Access Time
`Bus Transmission Time
`
`1200
`
`800
`
`400
`
`Time per Access (ns)
`
`0
`
`FPM1
`
`FPM2
`
`FPM3
`
`EDO1
`
`EDO2
`
`SDRAM1 ESDRAM SLDRAM RDRAM DRDRAM
`
`DRAM Configurations
`
`Figure 9: The penalty for choosing the wrong refresh organization.
`some instances, time waiting for refresh can account for more than 50%.
`
`In
`
`up to two orders of magnitude worse than the time-interspersed
`scheme. Particularly hard-hit was the compress benchmark, shown
`in Figure 9. Because such high overheads are easily avoided with an
`appropriate refresh organization, we only present results for the
`time-interspersed refresh approach.
`
`5.2
`
`Total Execution Time
`
`Figure 10(a) shows the total execution time for several benchmarks
`of SPECint ’95 using SDRAM for the primary memory system. The
`time is divided into processor computation, which includes accesses
`to the L1 and L2 caches, and time spent in the primary memory sys-
`tem. The graphs also show the overlap between processor computa-
`tion and DRAM access time. For each architecture, there are three
`vertical bars, representing L2 cache cycle times of 100ns, 10ns, and
`1ns (left, middle, and rightmost bars, respectively). For each DRAM
`architecture and L2 cache access time, the figure shows a bar repre-
`senting execution time, partitioned into four components:
`• Memory stall cycles due to limited bandwidth
`• Memory stall cycles due to latency
`• Processor time (includes L1 and L2 activity) that is overlapped
`with memory access
`• Processor time (includes L1 and L2 activity) that is not
`overlapped with memory access
`SimpleScalar schedules instructions extremely aggressively and
`hides much of the memory latency with other work—though this
`“other work” is not all useful work, as it includes all L1 and L2
`cache activity. For the 100ns L2 (corresponding to a 100MHz pro-
`cessor), between 50% and 99% of the memory access-time is hid-
`den, depending on the type of DRAM the CPU is attached to (the
`faster DRAM parts allow a processor to exploit greater degrees of
`concurrency). For 10ns (corresponding to a 1GHz processor),
`between 5% and 90% of the latency is hidden. As expected, the
`slower systems hide more of the DRAM access time than the faster
`systems.
`Figure 10(b) shows that the more advanced DRAM designs have
`reduced the proportion of overhead attributed to limited bandwidth
`by roughly a factor of three: from 3 CPI in FPMDRAM to 1 CPI in
`SDRAM, ESDRAM, and DRDRAM.
`Summary: The graphs demonstrate the degree to which con-
`temporary DRAM designs are addressing the memory bandwidth
`problem. Popular high-performance techniques such as lockup-free
`
`Data Transfer
`
`Data Transfer Overlap
`
`Column Access
`
`Row Access
`
`Bus Transmission
`
`Column
`Address
`
`Column
`Address
`
`Column
`Address
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Bus
`Cycle
`
`Bus
`Cycle
`
`Bus
`Cycle
`
`(a) Non-interleaved timing for access to DRAM
`
`Column
`Address
`
`Column
`Address
`
`Column
`Address
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Row
`Address
`
`Column
`Address
`
`Column
`Address
`
`Column
`Address
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Valid
`Dataout
`
`Bus
`Cycle0
`
`Bus
`Cycle1
`
`Bus
`Cycle0
`
`Bus
`Cycle1
`
`Bus
`Cycle0
`
`Bus
`Cycle1
`
`(b) Interleaved timing for access to DRAM
`
`Address
`
`Row
`Address
`
`DQ
`
`Data bus
`
`Address0
`
`Row
`Address
`
`DQ0
`
`Address1
`
`DQ1
`
`Data bus
`
`Figure 8: Interleaving in DRAM simulator. Time in Data Transfer Overlap
`accounts for much activity in interleaved organizations; Bus Transmissionis
`the remainder of time that is not overlapped with anything else.
`
`ference is seen in Row Access Time, which, as the graphs show, is
`not large for present-day organizations. For all other DRAM simula-
`tions but ESDRAM, we keep the row buffer open, as the timing of
`the pessimistic strategy can be calculated without simulation. The
`FPM3 and EDO2 labels represent the interleaved organizations of
`FPM and EDO DRAM. The remaining labels are self-explanatory.
`
`5.1
`
`Handling Refresh
`
`Surprisingly, DRAM refresh organization can affect performance
`dramatically. Where the refresh organization is not specified for an
`architecture, we simulate a model in which the DRAM allocates
`bandwidth to either memory references or refresh operations, at the
`expens