`
`
`
`ARCHITECTURE
`A
`QUANTITATIVE
`APPROACH
`
`JOHN L HENNESSY
`&
`DAVID A PATTERSON
`
`Qualcomm, Ex. 1009, Page 1
`
`
`
`Sponsoring Editor Bruce Spatz
`Production Manager Shirley Jowell
`Technical Writer Walker Cunningham
`Text Design Gary Head
`Cover Design David Lance Goines
`Copy Editor Linda Medoff
`Proofreader Paul Medoff
`Computer Typesetting and Graphics Fifth Street Computer Services
`
`Library of Congress Cataloging-in-Publication Data
`Patterson, David A.
`Computer architecture : a quantitative approach / David A.
`Patterson, John L. Hennessy
`p.
`cm.
`Includes bibliographical references
`ISBN 1-55880- 069-8
`1. Computerarchitecture. 2. Electronic digital computers
`--Design and construction.
`I. Hennessy, John L.
`IT. Title.
`QA76.9.A73P377 1990
`004.2°2--dce20
`
`Morgan Kaufmann Publishers, Inc.
`Editorial Office: 2929 Campus Drive. San Mateo, CA 94403
`Order from: P.O. Box 50490, Palo Alto, CA 94303-9953
`
`©1990 by Morgan Kaufmann Publishers,Inc.
`All rights reserved,
`
`89-85227
`CIP
`
`Nopart of this publication may be reproduced, stored in a retrieval system, or transmitted
`in any form or by any means—lectronic. mechanical, recording, or otherwise-—-without
`the prior permission of the publisher.
`
`All instruction sets and other design information of the DLX computer system contained
`herein is copyrighted by the publisher and may not be incorporated in other publications
`or distributed by media without formal acknowledgement and written consent from the
`publisher. Use of the DLX in other publications for educational purposes is encouraged
`and application for permission is welcomed.
`ADVICE, PRAISE, & ERRORS: Any correspondencerelated to this publication or
`intended for the authors should be addressed to the editorial offices of Morgan Kaufmann
`Publishers, Inc., Dept. P&H APE. Information regarding errorsightings is encouraged.
`Anyerror sightings that are accepted for correction in subsequent printings will be
`rewarded by the authors with a payment of $1.00 (U.S.) per correction. Electronic mail
`can be sent to bugs@vsop.stanford.edu.
`INSTRUCTOR SUPPORT: For information on Classroom software and otherinstructor
`materials available to adopters, please contact the editorial offices of Morgan Kaufmann
`Publishers, Inc.
`
`
`
`Qualcomm, Ex. 1009, Page 2
`
`Qualcomm, Ex. 1009, Page 2
`
`
`
`408
`
`8.3 Caches
`
`Cache: a safe place for hiding or storing things.
`
`
`8.3 Caches
`
`
`
`
`Websier’s New World Dictionaryof the American La
`Secand Coilege Edition ~
`
`
`
`
`
`
`Cache is the namefirst chosen to represent the level of the memory hien
`between the CPU and main memory, and that is the dominant use ofthe
`While the concept of cuches is younger than the IBM 360 architecture, oo
`appear todayin every class of computer and in some computers more than ©
`In fact, the word has become so popular that it has replaced “buffer” in
`computer-sciencecircles.
`The general
`terms defined in the prior section can be used for
`although the word lize is often used instead of block. Figure 8.5 show)
`typical range of memory-hierarchy parameters for caches.
`
`
`||M
`
`|Block (line)size 4— 128 bytes
`
`
`| Hit time
`{ ~4 clock cycles (normally 1}
`iss penalty
`8 = 32 clock cycles
`ih
`(Access me)
`(6— 10 clock cycles)
`
`(Transfer time) a (2-22 clock cycles)
`
`Missrate
`1% -— 20%
`ps SS ———
`| Cachesize
`| KB —-256 KB
`
`
`
`
`
`
`
`
`(number of blocks in cache). »
`Ifa block can be placed anywhere in the cache, the cache is said to be ©
`
`|
`
`i
`
`FIGURE 8.5 Typical values of key memory-hierarchy parameters for caches in
`workstations and mintcomputers.
`
`
`
`
`
`Nowlet’s examine caches in more detail by answering the four memos
`hierarchy questions.
`
`QI: Where Can a Block Be Placed in a Cache?
`
`Restrictions on where a block is placed create three categories of
`organization:
`
`co”
`
`«
`
`If each block has only one place it can appear in the cache, the cache is =
`to be direct mapped. The mapping is usually (block-frame address) mow
`
`associative.
`
`
`
`Qualcomm, Ex. 1009, Page 3
`
`Qualcomm, Ex. 1009, Page 3
`
`
`
` Mamery-Hierarchy Design
`
`
`
`
`wr
` Glock
`
`TPebeten7sta
`no. OL2345678S012394567808
`
`FIGURE 8.6 The cache has6 blocks, while memory has 32 blocks, The set-
`associalive organization has 4 sets with 2 blocks per sel, called two-way sel associative.
`(Real caches contain hundreds of blecks and real memories contain hundreds of thousands
`of blocks.) Assume |hat [here is nothing in the cache and that the block-frame address|n
`question identifies lower-level block 12. The three options far caches are shownleft to right.
`In fully associative, black 12 from) the lower level can go into any ol the 8 blocks of the
`cache. With direct mapped, block 12 can onty be placed into block 4 (12 modulo &). Set
`associative, which has some of both features. allows the block to be placed anywhere in set
`0 (12 modulo 4}. With two blocks per set. this means block 12 can be placed eitherin block
`0 or block 1 of the cache.
`
`Qualcomm, Ex. 1009, Page 4
`
`
`
`409
`
`a
`
`Ifa block can be placed in a restricted set of places in the cache, the cacheis
`said to be ser associative. A set is a group of two or more blocks in the cache.
`A blockis first mapped onto a set, and then the block can be placed anywhere
`within the set. The set is usually chosen bybit selection; thar is, (block-frame
`address) modulo (number of sets in cacbe). If there are n blocks in a set, the
`cache placement is called #-wayset associative.
`
`The range of caches from direct mapped to fully associative is really a
`continuum of levels of set associativity: Direct mapped is simply one-way sel
`associative and a fully associative eache with m blocks could be called m-way
`sel associative. Figure 8.6 shows where block 12 can be placed in a cache
`according to the block-placement policy.
`
`
`Fully aasesiative
`biook 12 can pe
`anyurrere
`
`Bock
`
`G1 234557
`
`Binck-iramne adoness
`
`Direci mapped
`block 12 can go
`only infa black 4
`(ft? mod a}
`Biok OPE R4567
`
`Seq aspocialive:
`blnok 12 can go
`anywhere i sero
`(12 moe 4}
`Bock O2234567
`
`Set Set Set Set
`o
`\
`2
`3
`
`Qualcomm, Ex. 1009, Page 4
`
`
`
`410
`
`8.3 Caches
`
`Q2: How Is a Block Found If It Is in the Cache?
`
`
`
`
`
`
`Caches include an address tag on each block that gives the block-frame
`The tag of every cache block that might contain the desired info
`checked to see if it matches the block-frame address from the CPL’. F -
`
`
`gives an example. Because speedis of the essence, all possible tags are
`in parallel; serial search would make set associativity Counterproducti\ =
`
`
`
`Fully agsocsative
`OT FI4567
`
`Oiract mapped
`Block O1289 4567
`no.
`
`Se! associative
`Block O1234567
`no.
`
`Block
`no.
`
`Data
`
`I
`
`
`FIGURE 8.7 In fully associative placement, the biock for block-frame adc
`appearin any of the 8 blocks; thus, all 8 tags must be searched. The desir==
`
`found in cache block 6 in thig example. In direct-mapped placementthere is any
`block where memory block 12 can be found. In set-associative placement, wiih =
`
`memory block 12 must be in set 0 {12 mod 4); thus. the tags of cache blocks [ =~
`checked, In this case the data is found in cache block 1. Speed of cache access =
`
`that searching must be pertormedin parallel for fully associative and set-asso0—
`
`mappings.
`
`
`
`There must be a way to knowthat a cache block does not b>
`information. The most common procedure is to add a valid bit to the
`
`whether or not this entry contains a valid address. If the bit is nov
`
`cannot be a match on this address.
`
`
`A common omission in finding the cost of caches is to forget the
`
`tag memory. One tag is required for each block. An advantage of =
`
`block sizes is that the tag overhead per cache entry becomes a smaller =
`
`the total cost of the cache.
`
`
`Before proceeding to the next question, let's explore the relatioe
`
`CPU address to the cache. Figure 8.8 shows howan address is divide’
`
`fields to find data in a set-associative cache: the block-offset field Use”
`
`the desired data from the block, the index field used to select the set.
`field used for the comparison. While the comparison could be maile = =
`the address than the tag, there is no need:
`
`Tag
`
`
`Se Set Sel Sat
`Oo
`1 28
`
`EY LUD
`
`|
`
`sew TUTTE TT
`
`
`
`Qualcomm, Ex. 1009, Page 5
`
`Qualcomm, Ex. 1009, Page 5
`
`
`
`
`
`4ii
`
`a Checking the index would be redundant, since it was used to select the set to
`be checked (an address stored in set 0, for example, must have 0 in the index
`field or it couldn’t be stored in set 0).
`
`s The offset is unnecessary in the comparison because all block offsets match
`and the entirc block is present or nol.
`
`If the total size is kept the same, increasing associativity increases the number of
`blocks per set, thereby decreasing the size of the index and increasing the size of
`the tag. That is, the tag/index boundary in Figure 8.8 moves to the right with
`increasing associativity.
`
`
`
`
`FIGURE 8.8 The 3 portions of an address in a set-associative or direct-mapped cache.
`The tag is used to checkall the blocks in the set and the index is used to select the sel. The
`block offset is the address of the desired data within the block.
`
` Memory-Hierarchy Design
`
`= Least-recently used (I.LRU)—To reduce the chance of throwing out informa-
`tion that will be needed soon, accesses to blocks are recorded. The block
`replaced is the one that hus been unusedfor the longest time. This makes use
`of a corollary of temporal locality: If recently used blocks are likely to be
`used again, then the best candidate for disposal is the least recently used.
`Figure 8.9 (page 412) shows which block is the least-recently used for a
`sequence of block-frame addresses in a fully associative memory hierarchy.
`
`Q3: Which Block Should Be Replaced on a Cache Miss?
`
`If the choice were between a block that has valid data and a block that doesn’t,
`then it would be easy to select which block to replace. Alas, the high hit rate of
`caches means that the overwhelming decision is between blocks that have valid
`data.
`
`is that hardware decisions are
`A benefit of direct-mapped placement
`simplified. In fact, so simple that there is no choice: Only one block is checked
`for a hit, and only that block can be replaced. With fully associative or set-
`associative placement, there are several blocks to choose from on a miss. There
`are two primary strategies employed for selecting which block to replace:
`
`a Random—tTospread allocation uniformly, candidate blocks are randomly
`selected. Some systems use a scheme for spreading data across a set of blocks
`in a pseudorandomized manner to get reproducible behavior, which is
`particularly useful during hardware debugging.
`
`Qualcomm, Ex. 1009, Page 6
`
`Qualcomm, Ex. 1009, Page 6
`
`
`
`412
`
`|
`
`|
`
`
`
`Block-frame addresses
`LRUblocknumber | 0
`
`;
`
`
`
`
`
`oe Ee
`
`3
`| 2
`] To lo
`
`| 0 | 0“lathe |3
`
`i
`
`8.3 Gaches
`
`
`A virtue of randomis that it is stmple to build in hardware. As the nun
`blocks to keep track of increases, LRU becomes increasingly expensive
`
`frequently only approximated. Figure 8.10 shows the difference in miss —
`
`between LRU and random replacement. Replacementpolicy plays a greate
`in smaller caches than in larger caches where there are more choices of
`
`replace.
`
`
`
`I
`
`
`
`0
`
`
`
`
`FIGURE 8.9 Least-recently used blocks for a sequence of block-frame addres
`a fully associative memory hierarchy. This assumesthat there are 4 blacks and ™
`
`the oeginning the LRU block is number 0. The LRU block numberis shown belaw esc
`block reference. Anotherpolicy, First-in—first-out (FIFO}, simply discards the block
`
`used N unique accesses before, incsependent ofits reference pattern in the last N — 7
`references. Random replacement generally outperforms FIFO andit is easier to imp
`
`
`
`§-way
`
`Rantios
`16 KB 4.39% 4.965 518% 5.69% 4.67% 5.29%
`
`
`
`
`Associativity:
`Size
`
`2-way
`Random
`
`LRU
`
`4-way
`Random
`
`LRU
`
`LRU
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
` 256 KB LU5% LAs 113% 1.13% 1.12% 12%
`
`
`
`
`
`
`
`
`
`
`
`
`fe KB 1.535 1.88% 2.01% 154% 1.66% 1.39%
`
`
`FIGURE 8.10 Miss rates comparing least-recently used versus random replac:
`for several sizes and associativities. This data was collected for a block size of 16 ©
`
`using one of the VAX traces containing user and Operating system code (SAVEO). Tha
`trace is included in the software supplementfor course use.. Thereis little difference
`between LAU and random fer larger size cachesin this trace.
`
`(4: What Happens ona Write?
`
`
`
`Reads dominate cache accesses. All instruction accesses are reads, and =
`
`instructions don’t write to memory. Figure 4.34 (page 181) suggests a mix of
`
`stores and 17% loads for four DLX programs, making wrires less than Lo ~
`
`the memory traffic. Making the common case fast means optimizing caches ©
`
`reads, but Amdahl’s Lawreminds us that high-performance designs ¢
`
`neglect the speed ofwrites,
`
`Fortunately, the common case is also the easy case to make fast. The bis
`
`
`can be read at the same time that the tag is read and compared, so the block »
`begins as soon as the block-frame address is available. If the read is a hit.
`
`
`block is passed on to the CPU immediately. If it is a miss, there is no benef
`but also no harm.
`
`Qualcomm, Ex. 1009, Page 7
`
`Qualcomm, Ex. 1009, Page 7
`
`
`
`Such is not the case for writes. The processor specifies the size of the write,
`usually between | and 8 bytes; only that portion of a block can be changed. In
`general this means a read-modify-write sequence of operations on the block:
`read the original block, modify one portion, and write the new block value.
`Moreover, modifying a block cannot begin until the tag is checked Lo see if it is
`a hit. Because tag checking cannot occur in parallel, then, writes normally take
`longer than reads.
`Thus, it is the write policies that distinguish many cache designs. There are
`two basic options when writing to the cache:
`
`a Write through (or store througl)—The information is written to both the
`block in the cache and to the block in the lower-level memory.
`
`a Write back (also called copy back or store in}—The information is written
`only to the block in the cache. The modified cache block is written to main
`memory only when i is replaced.
`
` Memory-Hierarchy Design
`
`Write-back cache blocks are called clean or dirty, depending on whether the
`information in the cache differs from that in lower-level memory. To reduce the
`frequency of writing back blocks on replacement, a feature called the dirtybit is
`commonlyused. This status bit indicates whether or not the block was modified
`while in the cache. If it wasn’t, the block is not written, since the lowerlevel has
`the same information as the cache.
`Both write back and write through have their advantages. With write back,
`writes occur at the speed of the cache memory, and multiple writes within a
`block require only one write to the lower-level memory. Since every write
`doesn’t go to memory, write back uses less memory bandwidth, making write
`back attractive in multiprocessors. With write through, read misses don’t result
`in writes to the lower levct, and write through is easier to implement than write
`back. Write through also has the advantage that main memory has the mast
`current copy of the data. This is important in multiprocessors and for I/O, which
`we shall examine in Section 8.8. Hence, multiprocessers want write back to
`reduce the memory traffic per processor and write through to keep the cache and
`memory consistent.
`When the CPU must wait for writes to complete during write throughs, the
`CPUis said to write stall. A common optimization to reduce write stalls is a
`write buffer, which allows the processor to continue while the memory is
`updated. As we shall see in Section 8.8, write stalls can occur even with write
`buffers.
`
`There are two options on a write miss:
`
`a Write allocate (also called fetch on write)}—The block is loaded, followed by
`the wnite-hit actions above. This is similar to a read miss.
`
`a No write atiocate (also called write around)—The block is modified in the
`lowerlevel und not loaded into the cache.
`
`Qualcomm, Ex. 1009, Page 8
`
`Qualcomm, Ex. 1009, Page 8
`
`
`
`
`
`414
`
`8.3 Caches
`
`While either write-miss policy could be used with write through or write back,
`generally write-back caches use write allocate (hoping that subsequent writes to
`that block will be captured by the cachc) and write-through caches often use nu
`write allocate (since subsequent writes to that block will still have to go to
`memory).
`
`An Example Cache: The VAX-11/780 Cache
`
`To give substance to these ideas, Figure 8.11 shows the organization of the
`cache on the VAX-11/780. The cache contains 8192 bytes of data in 8-byte
`blocks with two-way—set-associative placement, random replacement, write
`through with a one-word write buffer, and no write allocate on a write miss.
`Let’s trace a cache hit through the steps of a hit as laheled in Figure 8.11
`(The five steps are shown as circled numbers.) The address coming into the
`cache is divided into two fields: the 29-bit block-frame address and 3-bit block
`offset. The block-frame address is further divided into an address tag and cache
`index. Step 1 shows this division.
`The cache index selects the set to be tested to see if the block is in the cache
`(A set is one block from each bank in Figure 8.11.) The size of the inde»
`depends on cache size, block size, and set associativity. In this case, a 9-hy
`index results:
`
`Blocks _
`a
`Bank
`
`8192
`Cache size
`a ee ee Si OY
`§ * 2
`Block size * Set associativity
`
`In a two-way-Set-associative cache, the index is sent to both banks. This ts»
`step 2.
`After reading an address tag from each bank, the tag portion of the block-
`frame address is compared to the tags. This is step 3 in the figure. To be sure the
`tag contains valid information,
`the valid bit must be set, or the results of the
`compurison are ignored.
`Assuming one of the tags does match, a 2:1 multiplexer (step 4) is set Iv
`select the block from the matching set, Why can’t both tags match? It is the jo=
`of the replacement algorithm to make sure that an address appears in only on-
`block. To reduce the hit time, the data is read at the same timeas the addres
`tags; thus, by the time the block multiplexer is ready, the data is also ready.
`This step is needed in set-associative caches, but
`it can be omitted from
`direct-mapped caches since there is no selection to be made. The multiplexes”
`used in this step can be on the critical
`timing path, endangering the clock cyc™
`time of the CPU. (The example on pages 418-419 and the fallacy on page 45.
`explore the trade-off of lower miss rates and higher clock cycle time.)
`In the final step the word is sent to the CPU. All five steps occur within ©
`|
`single CPU clock cycle.
`What happens on a miss? The cache sendsastall signal to the CPU telling
`to wait, and two words (eight bytes} are read from memory. That takes 6 cles”
`cycles on the VAX-11/780 (ignoring bus interference), When the data arrive
`
`
`
`Qualcomm, Ex. 1009, Page 9
`
`Qualcomm, Ex. 1009, Page 9
`
`
`
`
`
`
`Memory-Hierarchy Design
`
`4i5
`
`the cacbe must pick a block to replace; the VAX-11/780 selects one of the two
`sets at random. Replacing a block means updating the data, the uddress tag, and
`the valid bit. Once this is done, the cache goes through a regular hit cycle and
`returns the data to the CPU.
`Writes are more complicated in the VAX-11/780, as they are in any cache. If
`the wordto be written is in the cache, the first four steps are the same. The next
`step is to write the data in ihe block, then write the changed-data portion into the
`
`[
`
`Block-trama Block
`oes bona
`|__Ta
`index!
`|
`
`Qo
`patho
`(512
`lacks)
`
`—
`
`Bank 1
`{512
`blocks)
`
`
`
`
`
`
`write back,
`'eNl writes to
`often use no
`ive to go to
`
`tion ofthe
`a
`in 8-byte
`Tent, write
`miss.
`Bure 8.1].
`2 into the
`\-bit block
`and cache
`
`he cache.
`he index
`a 9-be
`
`
`
`
`
`
`
`aS ae
`oa
`
`:
`
`
`| cpu
`
`address
`em Gn
`
`
`
`
`
`
`
`
`FIGURE 8.11 The organization of the VAX-11/780 cache. The 8-KB cacheis two-way
`set associative with 8-byte blocks. It has 512 sets with two blocks perset; the set is
`selected by the 9-bit index. The five steps of a read hit, shown as circled numbers in order
`of occurrence, label this organization. The line from memory to the cache is used on a miss
`to load the cache, Such multiplexing is not needed in a direct-mapped cache. Note that the
`offsel is connected to chip select of the data SRAMs to allow the proper words to be sent ta
`the 2:1 multiplexer.
`
`Qualcomm, Ex. 1009, Page 10
`
`Qualcomm, Ex. 1009, Page 10
`
`
`
`cache. The VAX-11/780 uses no write allocate. Consequently, on a write mo
`the CPU writes “around” the cache to lower-level memory and does nol afie>
`the cache.
`
`Since this is a write-through cache, the process isn’t yet over, The wor)
`also sent to a one-word write buffer. If the write huffer is empty, the word
`full address are written in the buffer, and we are finished. The CPU contin-—
`working while the write buffer writes the word to memory. If the buffer is
`the cache (and CPU) must wait until the buffer is empty.
`
`Cache Performance
`
`CPU time can be divided into the clock cycles the CPU spends executing ©
`program and the clock cycles the CPU spends waiting for the memory syst
`Thus,
`
`CPU time = (CPU-execution clock cycles + Memory-stall clock cycles ) * Clock cycle ©
`
`To simplify evaluation of cache alternatives, sometimes designers asso
`that all memory stalls are due to the cache. This is true for many machine ~
`machines where this is not true, the cache still dominates stalls that ure
`exclusively due to the cache. We use this simplifying assumption here, but ~~
`important to accountfor all memory stalls when calculating final performance —
`The formula above raises the question whether the clock cycles for a ca>
`access should be considered part of CPU-execution clock cycles or part of
`me
`ory-stall clock cycles. While either convention is defensible, the most woe
`accepted is to include hit clock cyeles in CPU-execution clock cycles.
`Memory-stall clock cycles can then be defined in terms of the number
`memory accesses per program, miss penalty (in clock cycles), and miss rate
`reads and writes:
`
`CPU time =IC * Ce eeet Miss rate * Miss penalty)* Clock cycle
`
`Read miss rate * Read miss penal
`
`Writes_
`Program
`
`* Write miss rate * Write miss penalty
`
`Wesimplify the complete formula by combining the reads and writes togethe
`
`Memory accessess
`Memory-stall clock cycles = a * Miss rate * Miss penaln
`
`Factoring instruction count ([C} from ¢xecution time and memary
`cycles, we now get a CPU-time formula that includes memory accesses:
`instruction, miss rate, and miss penalty:
`
`=
`
`Memoryaccesses
`
`Qualcomm, Ex. 1009, Page 11
`
`Qualcomm, Ex. 1009, Page 11
`
`
`
`
`
`| write mise
`5 not affecs
`
`he word jx
`2 Word ane
`' continue:
`ffer is full.
`
`culing the
`¥ SVstem.
`
`yele time
`+ D53500e
`
`lines; on
`are mo
`
`but it is
`Rance"
`a Cache
`f mem
`widely
`
`iber of
`ate for
`
`Memory-Hierarchy Design
`
`417
`
`Some designers prefer measuring miss rate as misses per instruction rather
`than misses per memory reference:
`
`__ Memoryaccesses
`Misses
`=,
`:
`Instruction
`Instruction
`
`« Miss rate
`
`is independent of the hardware
`it
`The advantage of this measure ig that
`implementation. For example,
`the VAX-11/780 instruction unit can make
`repeated references to a single byte (sce Section 8.7), which can artificially
`reduce the miss rate if measured as misses per memoryreference rather than per
`instruction executed. The drawback is that
`this measure is architecture
`
`is most popular with architects working with a single
`thus it
`dependent,
`computer family. They then use this version of the CPU-time formula:
`
`CPU ume = IC * (cel
`
`Execution
`
`_ Misses
`
`*
`
`iss
`
`ne
`
`*
`
`Instruction Miss penalty) Clock cycle time
`
`:
`
`Wecan now explore the consequences of caches on performance.
`
`Let's use the VAX-11/780 as a first example. The cache miss penalty is 6 clock
`cycles, and all instructions normally take 8.5 clock cycles G(gnoring memory
`stalls). Assume the miss rate is 11%, and there is an average of 3.0 memory
`references per instruction. What is the impact on performance when behavior of
`the cache is included?
`
`CPU time = IC * (cPr ;
`
`Execution
`
`4
`
`Memory-stall clock cycles
`Instruction
`
`ye Clock cycle
`
`Example
`
`Answer
`
`time
`
`The performance, including cache misses, is
`CPU time th ie IC * (8.54 3.0 # 11% * 6) * Clock cycle time
`
`= Instruction count * 10.5 * Clock cycle time
`
`The clock cycle time and instruction count are the same, with or without a
`cache, so CPU time increases with CPI from 8.5 to 10.5. Hence, the impact of
`the memoryhierarchyis to stretch the CPU time by 24%.
`
`
`Example
`
`Let's now calculate the impact on performance when behavior of the cache is
`included on a machine with a lower CPI. Assume that the cache miss penalty is
`10 eciock cycles and. on average, instructions take 1.5 clock cycles; the miss rate
`is 11%, and there is an average of |.4 memory references per instruction.
`
`Answer
`
`CPU time =IC * (crt
`
`sot
`
`Execution
`
`Memory-stall clock cycles
`Instruction
`
`;
`
`;
`
`) * Clock cycle time
`
`Qualcomm, Ex. 1009, Page 12
`
`Qualcomm, Ex. 1009, Page 12
`
`
`
`Making the same assumptions as in the previous example on cache hits, = =
`formance, including cache misses, 1s
`
`CPL UME ich ac IC # (1.5 + 1.4#11%#10) * Clock cycle ume
`
`= Instruction count*3.0#Clock cycle time
`
`The clock cycle time and instruction count are the same, with or with
`cache, so CPU time increases with CPI from 1.5 to 3.0. Including ©
`behavior doubles execution time.
`
`As these examples illustrate, cache-hehavior penalties range from sign ©
`to enormous. Furthermore, cache misses have a double-barreled impac
`CPUwith a low CPI and a fast clock:
`
`{. The lower the CPI, the more pronounced the imnpuct is.
`
`2.
`
`Independent of the CPL), main memories have similar memory-access ©
`since they are built from the same memory chips. When calculating CP
`cache miss penalty is measured in CPU clock cycles needed for =
`Therefore, a higher CPU clock rate leads to a larger miss penalty, eo _
`main memories are the same speed.
`
`The importance of the cache for CPUs with low CPI and high clock rates
`greater; and, consequently, greater is the danger of neglecting cache behay —
`assessing performance of such machines.
`While minimizing average memory-access time is a rcasonable goal an”
`will use it in much of this chapter, keep in mind that the final goal is to po
`CPU execution time.
`
`Average memory-access time = Hit time + Miss rate * Miss penalty
`
`Whatis the impact of two different cache organizations on the performarice
`CPU? Assumethat the CPI is normally 1.5 with a clock cycle time of 20 ns.
`there are 1.3 memory references per instruction, and that the size of both cas
`is 64 KB. One cache is direct mapped and the other is two-waysel associ
`Since the speed of the CPU is Lied directly to the speed of the caches, assum=
`CPU clock cycle time must be stretched 8.5% to accommodate the se)
`multiplexer of the set-associative cache (step 4 in Figure 8.11 on page 415 _
`the first approximation, the cache miss penalty is 200 ns for either co
`organization. (In practice it must he rounded up or downto an integer numbes ©
`clock cycles.) First, calculate the average memory-access time, and then
`performance.
`
`Example
`
`Figure 8.12 on page 42] shows that the miss rate of a direct-mapped 64-5 —
`cache is 3.9% and the miss rate for a two-way—set-associative cache ofthe
`size is 3.0%. Average memory-access timeis
`
`Qualcomm, Ex. 1009, Page 13
`
`Qualcomm, Ex. 1009, Page 13
`
`
`
`
`
`
`hout a
`cache
`
`The average memory-access lime is better for the two-way-set-associative
`cache.
`
`Substituting 200ns for (Miss penalty * Clock cycle time), the performance of
`each cache organization is
`
`
`
`CPU performanceis
`fi + Misses_MissCPUtime =IC * (CPI penalt ) * Clock cycle time
`
`
`
`
`IC:
`!
`=
`*
`.
`rf
`SSS
`a
`
`
`a ExecutionT(rstruction P ¥ ¥
`
`
`= IC * (CPlexecuion * Clock cycle time +
`Memory accesses
`* Miss rate * Miss penalty * Clock cycle time)
`Instruction
`
`Thus, the time for each organization is Average memory-access LIMC | way = 20 + .039*200 = 27.8 ns Average memory-access UMe2way = 20*1.085 + .030#200 = 27.7 ns
`
`
`
`na
`
`
`
`
`
`
`
`CPU time) way = IC*(1.5*20 + 1.3*0.039*200) = 40,1#1C
`CPU times. way = IC#C1.5*20%1.085 + 1.3%0.030200) = 40.4% IC
`
`
`
`
`
`Mes
`> the
`Mss,
`nif
`
`a
`
`
`
`
`2
`E
`i
`.
`
`r j
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`a Capacity—If the cache cannot contain 20 >= blacks needed Aa
`
`of a program, capacity misses Wi! @ccer due to blocks being
`
`
`later retrieved.
`
`and relative performanceis
`
`CPUtimes yay 40.4*Instruction count
`CPU timeyway 40.1 * Instruction count
`In contras! to the results of average access-time comparison. the direct-mapped
`cache leads to slightly better performance. Since CPU time is our bottom-line
`evaluation (and direct mapped is simpler to build). the preferred cache is direct
`mapped in this example. (See the fallacy on page 48! for more on this king o
`trade-off.)
`
`
`
`
`The Three Sources of Gache Misses: Compulsory,
`Capacity, and Conflicts
`Anintuitive model of cache behavior attributcs all misses to a=
`sources:
`
`ef thers
`
`*
`
`the block
`sacke
`is
`;
`cr
`=o
`the
`Dio
`so
`s Compulsory—The first aceess to a block is fot ia ie cache
`must be brought into the cache. These are also called
`opp misses OFfirst
`refercnee misses.
`
`
`
`Qualcomm, Ex. 1009, Page 14
`
`Qualcomm, Ex. 1009, Page 14
`
`
`
`4-way: from 8-wayassociative to 4-way associalive
`
`2-way: from 4-way associative to 2-way associative
`
`[-way: from 2-wayassociative to [-way associative (direct mappe
`
`ms=
`
`« Confficr—If the block-placement sirategy is scl aasoceetres of Geet
`conflict misses (in addition tc compulsory and capacity mexaecs)
`because a block can be discarded and later retrieved of too mummy Bie
`to its set, These are also called collision mustes
`
`Figure 8.12 shows the relative frequency of cache misses, braless
`the “three Cs." To show the benefit of associativity, conflict mies ae
`into misses caused by each decrease in associativity. The categor=)=
`#-way, meaning the misses caused by going to the lower level 7
`from the next one above. Here are the four vategories:
`
`8-way: from fully associative (10 conflicts) to 8-way associa.
`
`is directly contradictoryto the three C’s model.
`
`Figure 8.13 (page 422) presents the same data graphically. The top go)
`absolute miss rates; the bottom graph plots percentageof ail the misses)
`size,
`Having identified the three Cs, what can a computer designer doa
`Conceptually, conflicts are the easiest: Fully associative placement’ 9
`conflict misses. Associativity 1s expensive in hardware, however, and =
`access time (see the example ahove or the second fallacy in Sectos
`7)
`leading to lower overall performance. There is little to he done about —
`except to buy larger memory chips. If the upper-level memory is much
`than what is needed for a program, and a significant percentage of te
`spent moving data between twolevels in the hierarchy, the memory fy
`=
`said to thrash. Because so many replacements are required, thrashing
`machine runs close to the speed of the lower-level memory, or mayo
`slower due to the miss overhead. Making blocks larger reduces the
`compulsory misses, but it can increase conflict misses.
`The three C’s give insight into the cause of misses. but this simple
`its limits. For example, increasing cache size reduces conflict misses a
`capacity misses, since a larger cache spreads oat references. Thus, a tm
`move from one categoryto the other as parameters change. Three Co
`replacementpolicy, since it is difficult to model and since. in general, 11 9
`significance. In specific circumstances the replacement policy can actus~~
`to anomalous behavior, such as poorer miss rates for larger associativity.
`
`Qualcomm, Ex. 1009, Page 15
`
`Qualcomm, Ex. 1009, Page 15
`
`
`
`
`
`
`Memory-HierarchyDesign
`_ = an _ __ 421
`
`
`Cachesize
`
`Degree
`associative
`
`Total
`miss
`rate
`
`Miss-rate components (relative percent)
`(Sum = 100% of total miss rate)
`Capacity
`
`Conflict
`
`Compulsory
`
`cl Mapped,
`will occur
`plocks map
`
`down by
`are divided
`ire labeled
`soctativity
`
`{KB
`1 KB
`
`1KB
`
`l-way
`2-way
`
`4-way
`
`0.191
`0.161
`
`0.152
`
`KB B-way
`2 KB
`i-way
`
`2-way
`
`2KB
`
`0.122
`
`4KB
`
`l-way
`
`0.109
`
`0.115
`4-way
`2KB
`_2KB —&way0.113
`
`0.009
`0.009
`
`0,009
`
`0.009
`
`0.009
`0.009
`
`0.009
`
`5%
`6%
`
`6%
`
`1%
`
`8%
`
`8%
`
`85
`
`Q.141
`0.141
`
`0.141
`
`0.103
`
`0.103
`0.103
`
`0.073
`
`
`
`0.149 0.09% AE 4% 0.000 MH
`0.148
`0.009
`6%
`0.103
`TOG
`0.036
`24%
`
`73%
`o7%5
`
`92%
`
`0.042
`0.012
`
`0.003
`
`22%
`1%
`
`2%
`
`84%
`
`0.010
`
`8%
`
`2%
`0.003
`905:
`
`91% 001%
`
`67%
`
`0.027
`
`25%
`
`|
`
`|
`
`|
`
`4KB
`4KB
`
`2-way
`4-way
`
`0.095
`0.087
`
`0.609
`0.009
`
`9%
`10%
`
`0.073
`0.073
`
`TM
`849,
`
`
`0.013
`0.005
`
`14%
`6%
`
`87% ==.002, |
`
`—4KB f-way
`
`0.084 0.009
`
`11% 0.073,
`
`8 KB
`8KB
`
`8 KB
`
`1-way
`2-way
`
`4-way
`
`—§KB Bway
`'\6KB
`l-way
`
`16 KB
`
`2-way
`
`0.087
`0.069
`
`0.065
`
`0.063
`0.066
`
`0.054
`
`1.009
`0.009
`
`0,009
`
`0.009
`
`10%
`13%
`
`14%
`
`17%
`
`
`0.009 4% 00528 DH
`0.009
`14%
`0.038
`57%
`0.019
`29%
`
`0.052
`0.052
`
`0.052
`
`0.038
`
`60%
`TS%
`
`80%
`
`70%
`
`0.0