`APPROACH
`
` -COMPUTER
`
`DAVID A PATTERSON
`
`
`
`ARCHITECTURE
`
`- A
`
`JOHN L HENNESSY
`
`.
`
`8T
`
`-
`
`Qualcomm, Ex. 1009, Page 1
`
`
`
`Sponsoring Editor Bruce Spatz
`Production Manager Shirley Jowell
`Technical Writer Walker Cunningham
`Text Design Gary Head
`Caver Design David Lance Goines
`Copy Editor Linda Medoff
`Proofreader Paul Medoff
`Computer Typesetting and Graphics Fifth Street Computer Services
`
`Library of Congress Cataloging-in-Publication Data
`Patterson, David A.
`Computer architecture : a quantitative approach / David A.
`Patterson, John L. Hennessy
`p.
`cm.
`Includes bibliographical references
`ISBN 1-55880- 069—8
`
`1. Computer architecture. 2. Electronic digital computers
`—-Design and construction.
`I. Hennessy, John L. II. Title.
`QA76.9.A73P377 1990
`004.2'2--dc20
`
`Morgan Kaufmann Publishers. Inc.
`Editorial Office: 2929 Campus Drive. San Mateo, CA 94403
`Order from: PO. Box 50490,?310 Alto, CA 94303-9953
`
`©1990 by Morgan Kaufmann Publishers, Inc.
`All rights reserved.
`
`89-85227
`CIP
`
`No pan of this publication may be reproduced, stored in a retrieval system, or transmitted
`in any form or by any means—electronic. mechanical, rec0rding, or otherwise—without
`the prior permission ofthc publisher.
`
`All instruction sets and other design information of the DLX computer system contained
`herein is copyrighted by the publisher and may not be incorporated in other publications
`or distributed by media without formal acknowledgement and written consent from the
`publisher. Use of the DLX in other publications for educational purposes is encouraged
`and application for pemiission is welcomed.
`ADVICE. PRAISE, & ERRORS: Any correspondence related to this publication or
`intended for the authors should be addressed to the editorial offices of Morgan Kaut‘mann
`Publishers. Inc, Dept. P&l—I APE. Information regarding error sightings is encouraged.
`Any error sightings that are accepted for correction in subsequent printings will be
`rewarded by the authors with a payment of $1.00 (U.S.) per correction. Elchronic mail
`can be sent to bugs@vsop.stanford.erlu.
`INSTRUCTOR SUPPORT: For information on classroom software and other instruct0r
`materials aVailable to adopters, please contact the editorial offices of Morgan Kaul‘mann
`Publishers, Inc.
`
`
`
`Qualcomm, Ex. 1009, Page 2
`
`Qualcomm, Ex. 1009, Page 2
`
`
`
`4w
`
`8.3
`
`8.3 Caches
`
`Caches
`
`Cache: a safe placefar hiding or storing things.
`
`1
`
`
`ERIK Gin-Ema
`4 A 128 bytes
`I Hit time
`i — 4 clock cycles (normally 1)
`14in penalty
`8 — 32 clock cycles
`
`i
`(Access time)
`(6 — 10 clock cycles)
`
`
`i
`(Transfer timet — t2 — 12 clock :y—cles)
`Miss rate
`1% — 20%
`'— — _ __
`
`‘
`‘4'
`
`—
`
`
`
`
`
`Webster’s New World Dictionary of the American :-
`
`Secmid College Edition-rt r
`
`Cache is the name first choscn to represent the level of the memory hi
`
`between the CPE and main memory, and that is the dominant use of it:
`While the concept of caches is younger than the IBM 360 architecture, 1-
`-
`
`appear today in every class of computer and in some computers more than -
`In fact, the word has become so popular that it has replaced “buffer" in
`
`computcrrscience circles.
`The general
`terms defined in the prior section can be used for
`although the word line is often used instead of block. Figure 8.5 shut-1
`
`typical range of memory-hierarchy parameters for caches.
`
`
`
`
`
`- t.
`
`! Cache size
`
`1 KB — 256 KB
`
`FIGURE 8.5 Typical values of key memory-hierarchy parameters tor caches in
`workstations and minlcomputers.
`
`Now let’s examine caches in more detail by answering the four me 1
`hierarchy questions.
`
`
`
`
`
`
`
`Q1: Where Can a Block Be Placed in a Cache?
`
`Restrictions on where a block is placed create three categories of
`organization:
`
`-"
`
`I
`
`-
`
`If each block has only one place it can appear in the cache, the cache is -
`to be direct mapped. The mapping is usually (block—frame address) in: -
`(number of blocks in cache).
`
`If a block can be placed anywhere in the cache, the cache is said to be '.
`ussnr'iatit'c.
`
`.
`
`
`
`Qualcomm, Ex. 1009. Page 3
`
`Qualcomm, Ex. 1009, Page 3
`
`
`
`409
`
`
`u
`
`If a block can be placed in a restricted set of places in the cache. the cache is
`said to be .t'er associative. A SM is a group of two or more blocks in the cttche.
`A block is first mapped onto a act, and then the block can be placed anywhere
`within the set, The set is usually chosen by bit selection; that is. {block-frame
`address) modulo (number of sets in cache). If there are it blocks in a set. the
`cache placement is called n-wu_v .t'c't‘ associative.
`
`The range of caches from direct mapped to fully associative is really a
`continuum of levels of set associativity: Direct mapped is simply oneiway set
`associative and a fully associative cache with in blocks could be called m-wuy
`set associative. Figure 3.6 shows where block 12 can be placed in a cache
`according to the block-placement policy.
`
`
`Fully associating
`brbnltracarrgc
`any-mars
`
`Block 6123455?
`
`[intact ‘rtacpgd
`hm mango
`9.7.3: into tam d
`._-r2.'rloda,l
`23
`Block DI
`45:17
`
`Sr. asrmiahw.
`charisma:
`51an .n wt :?
`rt! mud-ll
`Black Ll'lzfiéfili?
`
`"D II
`
`II
`
`no. if“i
`
`5unit4mm add-ass
`
`
`
` ' nu
`
`SatfialSelsul
`D
`1
`2'
`Ci
`
` MumflryHSBrarchy Design
`
`
`
` 'I
`
`1r
`'1
`ll
`h0- 0123456T39D—234EE?BD
`
`FIGURE 8.6 The cache has 3 blocks, while memory has 32 blocks. The set-
`assocralive organization has 4 sets with 2 blocks per set. callad twoway set associative.
`(Fleal caches contain hundreds of blocks and real memories contain hundreds ol thousands
`at blocks.) Assume that there is nothing in the cache and mat the block-frame address in
`question identilies lower-level block 12. The three options lor caches are shown left to right.
`In fully associative. block 12 from lhe lower level can go into any oi the 8 blocks or the
`cache. with direct mapped. block 12 can only be placed into block 4 (12 modulo at. Set
`associative, which has some of both leatures allows the block to be placed anywhere in set
`0 (12 modulo 4). With two blocks per set. this means block 12 can be placed either in block
`0 or block 1 of the cache.
`
`Qualcomm, Ex. 1009, Page 4
`
`Qualcomm, Ex. 1009, Page 4
`
`
`
`410
`
`8.3 Caches
`
`
`
`Q2: How Is a Block Found If It Is in the Cache?
`
`
`
`Caches include an address tag on each block that gives the blockiframie
`
`The tag of every cache block that might contain the desired info
`checked to see if it matches the blockiframe address from the CPU. F .-
`
`gives an example. Because speed is of the essence. all possible tags are
`
`in parallel; serial search would make set associativity counterproductixc
`
`
`
`
`
`Fully ESSUI'JHEW
`Block 0'234557
`
`L’Mac'l minced
`Block 0123-155..-
`
`5:! assoclallue
`Block 0I23456?
`
`lllllllllll "all
`
`
`
`Illlllll
`
`1
`
`
`
`
`Sal! Slut Se! EM
`
`
`
`FIGURE 8.7 In tully associative placement, the block for block-frame all..-
`appear in any of the 8 blocks; thus, all 5 tags must be searched. The desire:
`
`found in cache block 6 in this example. In direct-mapped placement there is III-I'Iff
`block where memory block 12 can be found. In set-associative placement: with I
`
`memory block 12 must be in set 0 (12 mod 4); thus. the tags oI cache blocks D - -
`checked. In this case the data is tound in cache block 1. Speed of cache access -
`
`that searching must be performed in parallel tor fully associative and set-ase- --
`-
`mappings.
`
`
`
`
`There must be a way to know that a cache block does not
`.._
`information. The most common procedure is to add a valid bit to the
`
`whether or not this entry contains a valid address. If the bit is not
`
`cannot be a match on this address.
`A common omission in finding the cost of caches is to forget the
`
`tag memory. One tag is required for each block. An advantage of -
`-
`
`block sizes is that the tag overhead per cache entry becomes a smaller +-
`
`the total cost of the cache.
`
`Before proceeding to the next question. let’s explore the retattu.
`
`CPU address to the cache. Figure 8.8 shows how an address is divided
`
`fields to find data in a seteassociative cache: the block-ogre: field Had
`
`the desired data from the block, the index field used to select the set.
`' '
`field used for the comparison. While the comparison could be mad: .
`
`the address than the tag‘ there is no need:
`
`
`
`
`
`
`Qualcomm, Ex. 1009, Page 5
`
`Qualcomm, Ex. 1009, Page 5
`
`
`
`41 1
`
`
`I Checking the index would be redundant, since it was used to select the set to
`be checked (an address stored in set 0. for example. must have 0 in the indc»;
`field or it couldn’t be stored in set 0).
`
`u The offset is unnecessary in the comparison because all block offsets match
`and the entire block is present or not.
`
`If the total size is kept the same, met-easing associativity increases the number of
`blocks per set. thereby decreasing the size of the index and increasing the size of
`the tag. That is, the tag/index boundary in Figure 8.8 moves to the right with
`increasing associativity,
`
`
`
`
`HGUFIE 8.8 The 3 portions ol an address in a eel-associative or direct-mapped cache.
`The tag is used to check all the blocks in the set and the index is used to select the set. The
`block offset is the address of the desired data Within the block.
`
`Q3: Which Block Should Be Replaced on a Cache Miss?
`
`If the choice were between a block that has valid data and a block that doesn't.
`
`then it would be easy to select which block to replace. Alas, the high hit rate of
`caches means that the overwhelming decision is between blocks that have valid
`data.
`
` Memory-Hierarchy Design
`
`is that hardware decisions are
`A benefit of directsmapped placement
`simplified. In fact, so simple that there is no choice: Only one block is checked
`for a hit, and only that block can be replaced. With fully associative or set—
`associativc placement, there are several blocks to choose from on a miss. There
`are two primary strategiEs employed for selecting which block to replace:
`
`- Randmn—To spread allocation uniformly, candidate blocks are randomly
`selected. Some systems use a scheme for spreading data across a set of blocks
`in a pseudorandomized manner to get reproducible behavior, which is
`particularly useful during hardware debugging.
`
`- Least-recently used (LRUi—To reduce the chance of throwing out informa—
`tion that will be needed soon. accesses to blocks are recorded. The block
`
`replaced is the one that has been unused for the longest time. This makes use
`01' a corollary of temporal locality: It recently used blocks are likely to be
`used again. then the best candidate for disposal is the least recently used.
`Figure 8.9 (page 412) shows which block is the least—recently used for a
`sequence of block-frame addresses in a fully associative memory hierarchy.
`
`Qualcomm, Ex. 1009, Page 6
`
`Qualcomm, Ex. 1009, Page 6
`
`
`
`412
`
`8.3 Caches
`
`
`
`
`A virtue of random is that it is simple to build in hardware. As the n --
`
`blocks to keep track of increases, LRU becomes increasingly expensive
`
`frequently only approximated. Figure 8.10 shows the difference in min‘
`
`between LRL‘ and random replacement. Replacement policy plays a g r.
`-
`.
`in smaller caches than in larger caches where there are more choices of
`
`replace.
`
`
`
`
`
`
`
`I
`
`
`
`Block-frameaddresses
`LRUblocltnumberl 0
`
`3 i 27.713Cl":
`
`i 0—!—
`
`l
`
`FIGURE 8.9 Least-recently used blocks for a sequence of block-trams add: -
`a fully associative memory hierarchy. This assumes that there are 4 blocks and -.
`
`the beginning the LRU block Is number 0. The LRU block number is shown below .--- -
`block reference. Another policy. First-in—lirsbout (FlFO). simply discards the block
`used N unique accesses before. independent of its relerence pattern in the last N - 1
`references. Random replacement generally outperforms FIFO and it is easier to i
`+
`
`
`
`
`
`
`
`Ameiath'ity:
`
`2-way
`
`4-way
`
`8-way
`
`Random
`. Size
`LRL‘
`Random
`LRL'
`16 KB
`5.13% 5.69%
`167% 5.29%
`
`
`
`
`LRU
`Ranti-
`4.39% 4.96%
`
`
`LSJW-
`I 64 KB
`1.33%"
`2.01%
`1.54%
`1.66%
`1.39%
`
`iESGKB
`
`1.!5‘éi.
`
`1.17%
`
`
`
`
`
`1.13%-
`
`1.13%
`
`1,12%
`
`1.12!-
`
`
`
`
`Reads dominate cache accesses. A11 instructi0n accesses are reads, and -
`'
`
`instructions don't write to memory. Figure 4.34 (page 181) suggesrs a mix of
`
`stores and 17% loads for four DLX programs. making writes less than if?! '
`
`the memory traffic. Making the common case fast means optimizing caches '
`reads. but Amdahl‘s Law reminds us that high~perf0rmance designs c
`
`neglect the speed of writes.
`
`-
`Fortunately. the common case is also the easy case to make fast. The -
`
`
`can be read at the same time that the tag is read and compared. so the block -
`begins as soon as the block—frame address is available. If the read is a hiL
`block is passed on to the CPU immediately. If it is a miss, there is no ben '
`but also no harm.
`
`FIGUHE 8.10 Miss rates comparing least-recently used versus random repl r
`for several sizes and associatlvities. This data was collected for a block size of TE -
`using one of the VAX traces containing user and operating system code (SAVED). TT'I'
`trace is included in the software supplement for course use.. There is little difference
`between LRU and random for larger stze caches in this trace.
`
`Q4: What Happens on 3 Write?
`
`
`
`
`
`'
`
`Qualcomm, Ex. 1009. Page 7
`
`Qualcomm, Ex. 1009, Page 7
`
`
`
` Memory-Hierarchy Design
`
`Such is not the case for writes. The processor specifies the size of the write.
`usually between l and 8 bytes; only that portion of a block can be changed. in
`general this means a read—modify~write sequence of operations on the block:
`read the original block, modify one portion, and write the new block value.
`Moreover, modifying a block cannot begin until the tag is checked to see if it is
`a hit. Because tag checking cannot occur in parallel, then, writes normally take
`longer than reads.
`Thus, it is the write policies that distinguish many cache designs. There are
`two basic options when writing to the cache:
`
`- Write through (or store llll‘Oltgll)‘ The information is written to both the
`block in the cache and to the block in the lower-level memory.
`
`I Write back (also called copy back or store in)—The information is written
`only to the block in the cache. The modified cache block is written to main
`memory only when it is replaced.
`
`Write—back cache blocks are called clean or dirty. depending on whether the
`information in the cache differs from that in lower-level memory. To reduce the
`frequency of writing back blocks on replacement, a feature called the dirty bit is
`commonly used. This status bit indicates whether or not the block was modified
`while in the cache. Ifit wasn't. the block is not written. since the lower level has
`the same information as the cache.
`
`lower level and not loaded into the cache.
`
`Both write back and write through have their advantages. With write back,
`writes occur at the speed of the cache memory. and multiple writes within a
`block require only one write to the lowerrlevcl memory. Since every write
`doesn’t go to memory, write back uses less memory bandwidth, making write
`back attractive in multiprocessors. With write through. read misses don’t result
`in writes to the lower level, and write through is easier to implement than write
`back. Write through also has the advantage that main memory has the most
`current copy of the data. This is important in multiprocessors and for 1/0, which
`we shall examine in Section 8.8. Hence, multiprocessors want write back to
`reduce the memory traffic per processor and write through to keep the cache and
`memory consistent.
`When the CPU must wait for writes to complete during write throughs, the
`CPU is said to write stall. A common optimization to reduce write stalls is a
`write btrfj'cr, which allows the processor to continue while the memory is
`updated. As we shall see in Section 8.8, write stalls can occur even with write
`buffers.
`
`There are two options on a write miss:
`
`I Write allocate (also calledfctt‘h on wrr'tr:]-- The block is loaded, followed by
`the write—hit actions above. This is similar to a read miss.
`
`- N0 write allocate (also called irritc amurta’i—The block is modified in the
`
`Qualcomm, Ex. 1009. Page 8
`
`Qualcomm, Ex. 1009, Page 8
`
`
`
`
`
`I14
`
`8.3 Caches
`
`While either writeemiss policy could be used with write through or write back.
`generally write»bacl( caches use write allocate (hoping that subsequent writes in
`that block will be captured by the cache) and write-through caches often use nu
`write allocate (since subsequent writes to that block will still have to go to
`memory).
`
`An Example Cache: The VAX-1 11780 Cache
`
`To give substance to these ideas, Figure 8.11 shows the organization of the
`cache on the VAX-111780. The cache contains 8192 bytes of data in 8-b)’[t
`blocks with two—wayiset-associativc placement, random replacement, write
`through with a one-word write buffer. and no write allocate on a write miss.
`Let‘s trace a cache hit through the steps of a hit as labeled in Figure 8.1I
`(The five steps are shown as circled numbers.) The address coming into the
`cache is divided into two fields: the 29-bit blockeframe address and 3-bit hloclt
`
`offset. The block-frame address is further divided into an address tag and cache
`index. Step 1 shows this division.
`The cache index selects the set to be tested to see if the block is in the cache
`
`(A set is one block from each bank in Figure 8.1L) The size of the index
`depends on cache size, block size, and set associativity. In this case, a 9-hi1
`index results:
`
`'9a
`Qache size
`Blocks _
`Bank _ Block size * Set associativity_ 8 *
`
`197
`.= 512 = 29
`
`In a two-way—set-associative cache, the index is sent to both banks. This Ii
`step 2.
`After reading an address tag from each bank, the tag portion of the block-
`frame address is compared to the tags. This is step 3 in the figure. To be sure the
`tag contains valid information,
`the valid bit must be set, or the results of the
`comparison are ignored.
`Assuming one of the tags does match, a 2:1 multiplexer (step 4) is set It:-
`select the block from the matching set. Why can’t both tags match? It is the job
`of the replacement algorithm to make sure that an address appears in only on:
`block. To reduce the hit time, the data is read at the same time as the address
`tags; thus, by the time the block multiplexer is ready, the data is also ready.
`This step is needed in scteassociativc caches. but
`it can be omitted frcm:
`direct-mapped caches since there is no selection to be made. The multiplexer
`used in this step can be on the critical
`timing path, endangering the clock cycle
`time of the CPU. (The example on pages 418—419 and the fallacy on page 4Sl
`explore the trade-off of lower miss rates and higher clock cycle time.)
`In the final step the word is sent to the CPU. All five steps occur within a
`'
`single CPU clock cycle.
`What happens on a miss? The cache sends a stall signal to the CPU telling n:
`to wait, and two words (eight bytes) are read from memory. That takes 6 cloci
`cycles on the VAX-11/780 (ignoring bus interference). When the data arrives,
`
`
`
`Qualcomm, Ex. 1009. Page 9
`
`Qualcomm, Ex. 1009, Page 9
`
`
`
`
`
`
`Memory-Hierarchy Design
`
`41 5
`
`the cache must pick a block to replace; the VAX—11/780 selects one of the two
`sets at random. Replacing a block means updating the data, the address tag, and
`the valid bit. Once this is done, the cache goes through a regular hit cycle and
`returns the data to the CPU.
`
`Writes are more complicated in the VAXel 1/780, as they are in any cache. If
`the word to be written is in the cache, the first four steps are the same. The next
`step is to write the data in the block, then write the Changed-data portion into the
`
`r write back.
`‘ent writes to
`often use no
`We to go to
`
`lion ot' the
`
`l in 8—byle
`Tent, write
`miss.
`gure 8.1]_
`g into the
`l-bit block
`and cache
`
`
`
`|__
`
`Ellock-Im-tm BtOCit
`address
`all“:
`<25"
`(9:.
`"3’7
`Ta
`index
`
`l
`
`Tau
`Valid
`:1:- LE;-
`
`
`Data
`-
`
`
`
`
`
`l CPU
`address
`Data Data
`
`In
`out
`
`
`
`
`
`
`
`
`
`
`
`Bank I
`5512
`blocks)
`
`
`
`FIGURE 8.11 The organization at the VAX-11l'r'80 cache. The 8—KB cache is twoway
`set associative with Brbyte blocks. it has 512 sets with two blocks per set; the set is
`selected by the 9-bit index. The five steps of a read hit, shown as circled numbers in order
`of occurrence, label this organization. The line from memory to the cache is used on a misa
`to load the cache. Such multiplextng is not needed in a directrmapped cache. Note that the
`oftset is connected to chip select ol the data SRAMs to allow the proper words to be sent to
`the 2:1 multiplexer.
`
`Qualcomm, Ex. 1009, Page 10
`
`Qualcomm, Ex. 1009, Page 10
`
`
`
`cache. The VAX—l 1/780 uses no write allocate. Consequently. on a write min
`the CPU writes “around" the cache to lower-level memory and does not affix:
`the cache.
`
`Since this is :1 write-through cache. the process isn’t yet over. The word I.
`also sent to a one-word writc buffer. If the write huffer is empty. the word
`full address are written in the buffer, and we are finished. The CPU cont' - -
`working while the write buffer writes the word to memory. If the buffer is fol,
`the cache (and CPU) must wait until the buffer is empty.
`
`Cache Performance
`
`CPU time can be divided into the clock cycles the CPU Spends executing -
`program and the Clock cycles the CPU spends waiting for the memory 51.-
`-
`Thus.
`
`CPU time = (CPU—execution clock cycles + Memory-stall clock cycles ) * Clock cycle i
`
`-
`To simplify evaluation of cache alternatives, sometimes designers 33‘s
`that all memory stalls are due to the cache. This is true for many machines -
`machines where this is not true, the cache still dominates stalls that an-
`exclusively due to the cache. We use this simplifying assumption here. but i:
`important to account for all memory stalls when calculating final peri'onnatn-t -
`The formula above raises the question whether the clock cycles for a '
`-
`access should be considered part of CPU—execution clock cycles or part of
`-
`ory-stall clock cycles. While either Convention is defensible, the most is -- .
`accepted is to include hit clock cycles in CPU—execution clock cycles.
`Memory-stall clocl. cycles can then be defined in terms of the number
`memory accesses per program, miss penalty (in clock cycles), and miss rat:
`reads and writes:
`
`* Miss rate - Miss penalty)* Clock cycle
`
`'
`
`Reads
`Memory-stall clock cycles — fir >1= Read miss rate * Read miss penile!
`Program
`
`.
`.
`.
`.
`Writes
`+' — - Write miss rate =I< Write miss penalty
`Program
`
`We simplify the complete formula by combining the reads and writes togedtl:
`
`Memoryistall clock cycles =
`
`Memory aceessess
`Program
`
`* Miss rate =% Miss penalty '
`
`Factoring instruction count (lC) from cxccntion time and memory
`cycles. we now get a CPU-time formula that includes memory accesses '
`instruction. miss rate, and miss penalty:
`
`-
`_
`..
`CPU time _ 1C 4: (CHE-“Cmion +
`
`Memory EECQEEQEJ
`Instruction
`
`-
`
`,.
`
`-
`
`..
`
`Qualcomm, Ex. 1009, Page 11
`
`Qualcomm, Ex. 1009, Page 11
`
`
`
`
`
`I write miss
`s not affect
`
`he word is
`2 word and
`i continua.
`tier is full
`
`Memorvaierarchy Design
`
`41?
`
`Some designers prefer measuring miss rate are misses per instruction rather
`than misses per memory reference:
`
`M‘s“? : MEFF'QD/fifil-L‘éifi'f-fi t Miss rate
`Instruction
`Instruction
`
`is independent of the hardware
`it
`The advantage of this measure is that
`implementation. For example,
`the VAX-HERO instruction unit can make
`repeated references to a single byte (sec Section 833'), Which can artificially
`reduce the miss rate if measured as misses per memory reference rather than per
`instruction executed. The drawback is that
`this measure is architecture
`
`is most popular with architects working with a single
`thus it
`dependent.
`computer family. They then use this version of the CPU-time formula:
`
`CPU time = H: * (CPI
`
`’t‘
`'
`-
`n
`*
`.
`7 Misses
`Execution + Instruction Mlbb DUMMY) Clock cycle ttme
`
`We can now explore the consequences of caches on performance.
`
`Example
`
`Let’s use the VAX—l 1/780 as a first example. The cache miss penalty is 6 clock
`cycles, and all instructions normally take 8.5 clock cycles (ignoring memory
`stalls). Assume the miss rate is 11%, and there is an average of 3.0 memory
`references per instruction. What is the impact on performance when behavior of
`the cache is included?
`
`Answer
`
`CPU time : IC *- (CPI .
`.
`Execution
`time
`
`+
`
`
`Memofll'sluilPlQC—k C)IE1BH)* ClOCk CyCi‘S
`.
`Instruction
`
`The performance, including cache misses, is
`
`CPU timewith cache = IC * (8.5 + 3.0
`
`11% a 6) * Clock cycle time
`
`= Instruction count
`
`10.5 * Clock cycle time
`
`The clock cycle time and instmction count are the same, with or without a
`cache. so CPU time increases with CPI from 8.5 to 10.5. Hence. the impact of
`the memory hierarchy is to stretch the CPU time by 24%.
`
`
`Example
`
`Let's now calculate the impact on performance when behavior of the cache is
`included on a machine with a lower CPI. Assume that the cache miss penalty is
`10 clock cycles and. on average, instructions take 1.5 clock cycles; the miss rate
`is 11%. and there is an average of 1.4 memory references per instruction.
`
`Answer
`
`CPU time =IC * (CPI
`
`‘
`.
`EXCLUUOH
`
`+
`
`Memorv—gtall clock cycles
`_
`.
`) 4“ Clock cycle time
`Tnstructton
`
`Qualcomm, Ex. 1009, Page 12
`
`Qualcomm, Ex. 1009, Page 12
`
`
`
`Making the same assumptions as in the previous example on each: hits. the _
`formance. including cache misses. is
`
`CPI; timewith cache = [C (1.5 + 1.4=l<1 t%>1=10) Clock cycle time
`
`: Instruction C(iunt>'=3.(1=i=C1ticl( cycle time
`
`The clock cycle time and instruction count are the same. with or v.1 .
`cache, so CPU time increases with CPI from 1.5 to 3.0. Including -
`behavior doubles execution time.
`
`As these examples illustrate, cache—hehavior penalties range from sig u
`to enormous. Furthermore. cache misses have a double-barrcled impafi
`CPU with a low CPI and a fast clock:
`
`1. The lower the CPI, the more pronounced the impact is.
`
`Avcragc memory-access time = Hit time + Miss rate * Miss penalty
`
`7..
`
`Independent of the CPU. main memories have similar memory—access it
`since they are built from the same memory chips. When calculating CPL
`cache miss penalty is measured in CPU clock cycles needed for 1
`Therefore. a higher CPU clock rate leads to a larger miss penalty. e
`main memories are the same speed.
`
`-
`
`'
`
`The importance of thc cache for CPUs with low CPI and high clock rates is
`greater; and. consequently. greater is the danger of neglecting cache behasU
`assessing performance of such machines.
`While minimizing average memoryeaecess time is a reasonable goal and
`will use it in much of this chapter, keep in mind that the final goal is to -r
`CPU execution time.
`
`What is the impact of two different cache organizations on the perfomtanre
`CPU? Assume that the CPI is normally 1.5 with a clock cycle time of 20 list
`there are 1.3 memory references per instruction, and that the size of both 1:
`is 64 KB. One cache is direct mapped and the other is two-way set a‘ of
`Since the speed of the CPU is tied directly to the speed of the caches, assunr
`CPU eloclt cycle time must be stretched 8.5% to accommodate the set
`multiplexer of the setrassociativc cache (step 4 in Figure 8.11 on page 415 1:
`the first approximation. the cache miss penalty is 200 ns for either - .
`organization. (In practice it must he rounded up or down to an integer numbu'
`clock cycles.) First, calculate the average memory—access time, and then
`performance.
`
`-
`
`_
`
`Figure 8.12 on page 421 ShtWJs that the miss rate of a direct-mapped 6-1- -_
`cache is 3.9% and the miss rate for a two-wayfiset-assoeiative cache of the
`size is 3.0%. Average memory-access time is
`
`Qualcomm, Ex. 1009, Page 13
`
`Qualcomm, Ex. 1009, Page 13
`
`
`
`
`
`1h
`
`-_
`
`6 p6]
`
`hour a
`cache
`
`_.
`1:,ch
`
`meg
`. the
`lie;
`n if
`
`The average memory—access time is better for the two-way—set-associative
`cache.
`
`CPU performance is
`
`Substituting 200ns for (Miss penalty * Clock cycle time), the performance of
`each cache organiaation is
`
`
`
`it
`:3:
`
`§PUlhI1F3weez _ 40-élilpstrsghm COM
`CPU timepway— 40.1 =8 Instruction count
`
`and relative performance is
`
` Memory»Hlerarcl-ty DeSIQn
`
`.
`.
`_
`.
`
`
`Thus. the time for each organization is
`
`Average memory-access timepway = 20 + 039*200 = 27.8 ns
`
`Average memory—access time}way : 20*1085 + .030*200 = 27.7 ns
`
`
`
`n a
`
`
`
`
`
`
`
`CPU time1_way = IC*(J 520 + t.3-=0.039*200) = 40.1*IC
`
`CPU timezw}. = IC*(1_5=*20=1<1.085 + 13800304200) = 40.4%: 1C
`
`
`
`,
`.
`Misses
`_
`.
`_
`.
`CPU time = IC * (CPIExec‘ution + ingru—etioh Miss penalty) >5 Clock cycle time
`= IC -= (CPIExecuuon Clock cycle time +
`.
`Metnorv accesses
`.
`.
`'—-
`" -.-
`— Miss rate -I= MISS penalty * Clock cycle time)
`lnstructton
`
`
`
`
`
`
`
`
`
`In contrast to the results of average access-time comparison. the direct-mapped
`cache leads to slightly better performance. Since CPU time is our bottom~line
`evaluation (and direct mapped is simpler to buildt.1h:: preferred cache is dtttct
`mapped in this example. (See the fallacy on page 48l for more on this kind of
`traderofl'.)
`
`1
`I?
`i
`'-
`
`E .
`
`r
`
`
`
`
`
`
`
`
`
`5] Ck
`-1:r_'
`..
`-
`-
`'-
`'
`“
`
`
`(3
`1
`-
`(.mnprrlwry—The first access to .r block is lit-)1 in Pb: .11 -
`must be brought into the cache. These are Malia-.1
`_'
`"Hint Y GUM”
`
`
`rqfi't'rm‘e mines.
`("rtnat'l't'ft'rlf the each-I: Cannot cunt-b1 a!” -.:-_- run-ts ncedect dugngnexecunon
`
`
`
`of a FmgTam. capacity misses Mi! "LU—7 due I“ 1310““ being '
`
`later retrieved.
`
`
`The Three Sources of Cache Misses: cmlluwl
`Capacity, and Conflicts
`
`.
`
`An intuitive model of Cache behavior attributes all misses teed!
`sources;
`
`a! .i__.
`_..
`
`-
`
`.
`
`
`
`
`
`Qualcomm, Ex. 1009, Page 14
`
`Qualcomm, Ex. 1009, Page 14
`
`
`
`- Conflict—1ft]: block-plum strategy [1 1:: 15““ It”
`conflict misses tin addition to compulmfi mm! caesium} ”in:
`because a block can be discarded and IilJi'T retrieved d me Full} -'
`to its set. These are also called rtrh'mtm mitt-res
`
`—
`
`Figure 8J2 shows the relative frequency of cache misses. bruit!
`the “ch Cs." To show the benefit of associativttt . conflict muses -
`into misses caused by each decrease in asaoeiativiry. The cutgmies a
`n--way, meaning the misses caused by going In the |fl't\i:l' level of
`--
`from the next one above. Here are the four categories:
`
`8-way: from fully associative (no conflicts} to 8-way Mancini“:
`
`4—way: from 8-way associative to 4rway associative
`
`eray: from 4—way associative to 2—way associative
`
`leway: from 2-way associative to l~way associative (direct mappodi
`
`Figure 8.13 (page 42?.) presents the same data graphicaliy. The {up w '
`absolute miss rates: the bottom graph plots percentage of all the misc: t5-
`size.
`
`'
`
`‘
`
`I
`_
`-
`
`Having identified the three Cs. what can a computer designer tin nit-2‘
`Conceptually. conflicrs are the easiest: Fully associative ptacemertl
`-
`conflict misses. Associativity is expensive in hardware. however. and -.
`access time {sec the example above or the second fallacy in Stflifl
`leading to lower overall performance. There is little to he done about .
`except to buy larger memory chips. If the upper—level memory is nut-cit
`than what is needed for a program, and a significant percentage ol the
`spent moving data between two levels in the hierarchy, the memory In
`-
`said to thrash. Because so many leplacements are required. thrashing
`inachine runs close to the speed of the lower-level memory. or nmyh
`slower due to the miss ovcrhettd. Making blocks larger reduces the
`compulsory misses. but it can increase conflict misses.
`The three C's give insight into the cause of misses. but this simple
`its limits. For example. increasing cache size reduces conflict misses a:
`capacity misses. since a larger cache spreads oat references. Thus, a this
`move from one category to the other as parameters change. Three C's '
`:
`replacement policy, since it is difficult to model and since. in general. it is -
`significance. In specific circumstances the replacement policy can actually _
`to anomalous behavior, such as poorer miss rates for larger associativily.
`
`‘
`
`- -
`
`-
`
`is directly contradictory to the three C’s model.
`
`Qualcomm, Ex. 1009, Page 15
`
`Qualcomm, Ex. 1009, Page 15
`
`
`
`
`
`
`MemoryLHierarchy Des1gn __ __ __ __ __ __ 421
`
`
`Cache size
`
`Degree
`associative
`
`Total
`miss
`rate
`
`Miss-rate components (relative percent)
`(Sum = 100% of total miss rate)
`Capacity
`
`Conflict
`
`Compulsory
`
`73%
`0.141
`5%
`3791-
`0.141
`651-
`921:1.
`0141
`15111.
`
`1111.; __0.1L 939';
`6‘31:
`0.103
`700':
`
`1 KB
`1 KB
`1 KB
`
`1 KB__
`2 KB
`
`2 KB
`
`2 KB
`
`1—way
`2-way
`4-way
`
`8-way
`1-way
`
`2-way
`
`0.191
`0.161
`0.152
`
`0149
`0.148
`
`0.122
`
`0.115
`
`0.009
`0.009
`0.009
`£109
`0.009
`
`0.009
`
`0.009
`
`
`0.113
`0.0011
`0.109
`0.009
`0.095
`0.009
`0.087
`0.009
`0.0811- __0.009
`
`0.087
`11.009
`0.069
`0.009
`
`0.065
`
`0.009
`
`7'11-
`
`8‘11-
`
`
`
`0.103
`
`0.103
`
`111;
`01-03
`81:11
`0.073
`9%
`0.073
`10%
`0.073
`112%— 003
`1091-
`0.052
`13%
`0.052
`
`14%
`
`0.052
`
`84‘}?
`
`80‘};
`
`90%.}-
`2‘31-
`
`
`91%
`__0.001_ _ 10:
`67%
`0.027
`25%
`7791'-
`0.013
`14%
`84%
`0.005
`691.
`11151. __0.002__
`'1_1_
`
`6051?
`0.026
`3091
`75‘}?
`0.008
`12%
`
`
`£009 __14% __ 0