throbber
QUANTITATIVE
`APPROACH
`
` -COMPUTER
`
`DAVID A PATTERSON
`
`
`
`ARCHITECTURE
`
`- A
`
`JOHN L HENNESSY
`
`.
`
`8T
`
`-
`
`Qualcomm, Ex. 1009, Page 1
`
`

`

`Sponsoring Editor Bruce Spatz
`Production Manager Shirley Jowell
`Technical Writer Walker Cunningham
`Text Design Gary Head
`Caver Design David Lance Goines
`Copy Editor Linda Medoff
`Proofreader Paul Medoff
`Computer Typesetting and Graphics Fifth Street Computer Services
`
`Library of Congress Cataloging-in-Publication Data
`Patterson, David A.
`Computer architecture : a quantitative approach / David A.
`Patterson, John L. Hennessy
`p.
`cm.
`Includes bibliographical references
`ISBN 1-55880- 069—8
`
`1. Computer architecture. 2. Electronic digital computers
`—-Design and construction.
`I. Hennessy, John L. II. Title.
`QA76.9.A73P377 1990
`004.2'2--dc20
`
`Morgan Kaufmann Publishers. Inc.
`Editorial Office: 2929 Campus Drive. San Mateo, CA 94403
`Order from: PO. Box 50490,?310 Alto, CA 94303-9953
`
`©1990 by Morgan Kaufmann Publishers, Inc.
`All rights reserved.
`
`89-85227
`CIP
`
`No pan of this publication may be reproduced, stored in a retrieval system, or transmitted
`in any form or by any means—electronic. mechanical, rec0rding, or otherwise—without
`the prior permission ofthc publisher.
`
`All instruction sets and other design information of the DLX computer system contained
`herein is copyrighted by the publisher and may not be incorporated in other publications
`or distributed by media without formal acknowledgement and written consent from the
`publisher. Use of the DLX in other publications for educational purposes is encouraged
`and application for pemiission is welcomed.
`ADVICE. PRAISE, & ERRORS: Any correspondence related to this publication or
`intended for the authors should be addressed to the editorial offices of Morgan Kaut‘mann
`Publishers. Inc, Dept. P&l—I APE. Information regarding error sightings is encouraged.
`Any error sightings that are accepted for correction in subsequent printings will be
`rewarded by the authors with a payment of $1.00 (U.S.) per correction. Elchronic mail
`can be sent to bugs@vsop.stanford.erlu.
`INSTRUCTOR SUPPORT: For information on classroom software and other instruct0r
`materials aVailable to adopters, please contact the editorial offices of Morgan Kaul‘mann
`Publishers, Inc.
`
`
`
`Qualcomm, Ex. 1009, Page 2
`
`Qualcomm, Ex. 1009, Page 2
`
`

`

`4w
`
`8.3
`
`8.3 Caches
`
`Caches
`
`Cache: a safe placefar hiding or storing things.
`
`1
`
`
`ERIK Gin-Ema
`4 A 128 bytes
`I Hit time
`i — 4 clock cycles (normally 1)
`14in penalty
`8 — 32 clock cycles
`
`i
`(Access time)
`(6 — 10 clock cycles)
`
`
`i
`(Transfer timet — t2 — 12 clock :y—cles)
`Miss rate
`1% — 20%
`'— — _ __
`
`‘
`‘4'
`
`—
`
`
`
`
`
`Webster’s New World Dictionary of the American :-
`
`Secmid College Edition-rt r
`
`Cache is the name first choscn to represent the level of the memory hi
`
`between the CPE and main memory, and that is the dominant use of it:
`While the concept of caches is younger than the IBM 360 architecture, 1-
`-
`
`appear today in every class of computer and in some computers more than -
`In fact, the word has become so popular that it has replaced “buffer" in
`
`computcrrscience circles.
`The general
`terms defined in the prior section can be used for
`although the word line is often used instead of block. Figure 8.5 shut-1
`
`typical range of memory-hierarchy parameters for caches.
`
`
`
`
`
`- t.
`
`! Cache size
`
`1 KB — 256 KB
`
`FIGURE 8.5 Typical values of key memory-hierarchy parameters tor caches in
`workstations and minlcomputers.
`
`Now let’s examine caches in more detail by answering the four me 1
`hierarchy questions.
`
`
`
`
`
`
`
`Q1: Where Can a Block Be Placed in a Cache?
`
`Restrictions on where a block is placed create three categories of
`organization:
`
`-"
`
`I
`
`-
`
`If each block has only one place it can appear in the cache, the cache is -
`to be direct mapped. The mapping is usually (block—frame address) in: -
`(number of blocks in cache).
`
`If a block can be placed anywhere in the cache, the cache is said to be '.
`ussnr'iatit'c.
`
`.
`
`
`
`Qualcomm, Ex. 1009. Page 3
`
`Qualcomm, Ex. 1009, Page 3
`
`

`

`409
`
`
`u
`
`If a block can be placed in a restricted set of places in the cache. the cache is
`said to be .t'er associative. A SM is a group of two or more blocks in the cttche.
`A block is first mapped onto a act, and then the block can be placed anywhere
`within the set, The set is usually chosen by bit selection; that is. {block-frame
`address) modulo (number of sets in cache). If there are it blocks in a set. the
`cache placement is called n-wu_v .t'c't‘ associative.
`
`The range of caches from direct mapped to fully associative is really a
`continuum of levels of set associativity: Direct mapped is simply oneiway set
`associative and a fully associative cache with in blocks could be called m-wuy
`set associative. Figure 3.6 shows where block 12 can be placed in a cache
`according to the block-placement policy.
`
`
`Fully associating
`brbnltracarrgc
`any-mars
`
`Block 6123455?
`
`[intact ‘rtacpgd
`hm mango
`9.7.3: into tam d
`._-r2.'rloda,l
`23
`Block DI
`45:17
`
`Sr. asrmiahw.
`charisma:
`51an .n wt :?
`rt! mud-ll
`Black Ll'lzfiéfili?
`
`"D II
`
`II
`
`no. if“i
`
`5unit4mm add-ass
`
`
`
` ' nu
`
`SatfialSelsul
`D
`1
`2'
`Ci
`
` MumflryHSBrarchy Design
`
`
`
` 'I
`
`1r
`'1
`ll
`h0- 0123456T39D—234EE?BD
`
`FIGURE 8.6 The cache has 3 blocks, while memory has 32 blocks. The set-
`assocralive organization has 4 sets with 2 blocks per set. callad twoway set associative.
`(Fleal caches contain hundreds of blocks and real memories contain hundreds ol thousands
`at blocks.) Assume that there is nothing in the cache and mat the block-frame address in
`question identilies lower-level block 12. The three options lor caches are shown left to right.
`In fully associative. block 12 from lhe lower level can go into any oi the 8 blocks or the
`cache. with direct mapped. block 12 can only be placed into block 4 (12 modulo at. Set
`associative, which has some of both leatures allows the block to be placed anywhere in set
`0 (12 modulo 4). With two blocks per set. this means block 12 can be placed either in block
`0 or block 1 of the cache.
`
`Qualcomm, Ex. 1009, Page 4
`
`Qualcomm, Ex. 1009, Page 4
`
`

`

`410
`
`8.3 Caches
`
`
`
`Q2: How Is a Block Found If It Is in the Cache?
`
`
`
`Caches include an address tag on each block that gives the blockiframie
`
`The tag of every cache block that might contain the desired info
`checked to see if it matches the blockiframe address from the CPU. F .-
`
`gives an example. Because speed is of the essence. all possible tags are
`
`in parallel; serial search would make set associativity counterproductixc
`
`
`
`
`
`Fully ESSUI'JHEW
`Block 0'234557
`
`L’Mac'l minced
`Block 0123-155..-
`
`5:! assoclallue
`Block 0I23456?
`
`lllllllllll "all
`
`
`
`Illlllll
`
`1
`
`
`
`
`Sal! Slut Se! EM
`
`
`
`FIGURE 8.7 In tully associative placement, the block for block-frame all..-
`appear in any of the 8 blocks; thus, all 5 tags must be searched. The desire:
`
`found in cache block 6 in this example. In direct-mapped placement there is III-I'Iff
`block where memory block 12 can be found. In set-associative placement: with I
`
`memory block 12 must be in set 0 (12 mod 4); thus. the tags oI cache blocks D - -
`checked. In this case the data is tound in cache block 1. Speed of cache access -
`
`that searching must be performed in parallel tor fully associative and set-ase- --
`-
`mappings.
`
`
`
`
`There must be a way to know that a cache block does not
`.._
`information. The most common procedure is to add a valid bit to the
`
`whether or not this entry contains a valid address. If the bit is not
`
`cannot be a match on this address.
`A common omission in finding the cost of caches is to forget the
`
`tag memory. One tag is required for each block. An advantage of -
`-
`
`block sizes is that the tag overhead per cache entry becomes a smaller +-
`
`the total cost of the cache.
`
`Before proceeding to the next question. let’s explore the retattu.
`
`CPU address to the cache. Figure 8.8 shows how an address is divided
`
`fields to find data in a seteassociative cache: the block-ogre: field Had
`
`the desired data from the block, the index field used to select the set.
`' '
`field used for the comparison. While the comparison could be mad: .
`
`the address than the tag‘ there is no need:
`
`
`
`
`
`
`Qualcomm, Ex. 1009, Page 5
`
`Qualcomm, Ex. 1009, Page 5
`
`

`

`41 1
`
`
`I Checking the index would be redundant, since it was used to select the set to
`be checked (an address stored in set 0. for example. must have 0 in the indc»;
`field or it couldn’t be stored in set 0).
`
`u The offset is unnecessary in the comparison because all block offsets match
`and the entire block is present or not.
`
`If the total size is kept the same, met-easing associativity increases the number of
`blocks per set. thereby decreasing the size of the index and increasing the size of
`the tag. That is, the tag/index boundary in Figure 8.8 moves to the right with
`increasing associativity,
`
`
`
`
`HGUFIE 8.8 The 3 portions ol an address in a eel-associative or direct-mapped cache.
`The tag is used to check all the blocks in the set and the index is used to select the set. The
`block offset is the address of the desired data Within the block.
`
`Q3: Which Block Should Be Replaced on a Cache Miss?
`
`If the choice were between a block that has valid data and a block that doesn't.
`
`then it would be easy to select which block to replace. Alas, the high hit rate of
`caches means that the overwhelming decision is between blocks that have valid
`data.
`
` Memory-Hierarchy Design
`
`is that hardware decisions are
`A benefit of directsmapped placement
`simplified. In fact, so simple that there is no choice: Only one block is checked
`for a hit, and only that block can be replaced. With fully associative or set—
`associativc placement, there are several blocks to choose from on a miss. There
`are two primary strategiEs employed for selecting which block to replace:
`
`- Randmn—To spread allocation uniformly, candidate blocks are randomly
`selected. Some systems use a scheme for spreading data across a set of blocks
`in a pseudorandomized manner to get reproducible behavior, which is
`particularly useful during hardware debugging.
`
`- Least-recently used (LRUi—To reduce the chance of throwing out informa—
`tion that will be needed soon. accesses to blocks are recorded. The block
`
`replaced is the one that has been unused for the longest time. This makes use
`01' a corollary of temporal locality: It recently used blocks are likely to be
`used again. then the best candidate for disposal is the least recently used.
`Figure 8.9 (page 412) shows which block is the least—recently used for a
`sequence of block-frame addresses in a fully associative memory hierarchy.
`
`Qualcomm, Ex. 1009, Page 6
`
`Qualcomm, Ex. 1009, Page 6
`
`

`

`412
`
`8.3 Caches
`
`
`
`
`A virtue of random is that it is simple to build in hardware. As the n --
`
`blocks to keep track of increases, LRU becomes increasingly expensive
`
`frequently only approximated. Figure 8.10 shows the difference in min‘
`
`between LRL‘ and random replacement. Replacement policy plays a g r.
`-
`.
`in smaller caches than in larger caches where there are more choices of
`
`replace.
`
`
`
`
`
`
`
`I
`
`
`
`Block-frameaddresses
`LRUblocltnumberl 0
`
`3 i 27.713Cl":
`
`i 0—!—
`
`l
`
`FIGURE 8.9 Least-recently used blocks for a sequence of block-trams add: -
`a fully associative memory hierarchy. This assumes that there are 4 blocks and -.
`
`the beginning the LRU block Is number 0. The LRU block number is shown below .--- -
`block reference. Another policy. First-in—lirsbout (FlFO). simply discards the block
`used N unique accesses before. independent of its relerence pattern in the last N - 1
`references. Random replacement generally outperforms FIFO and it is easier to i
`+
`
`
`
`
`
`
`
`Ameiath'ity:
`
`2-way
`
`4-way
`
`8-way
`
`Random
`. Size
`LRL‘
`Random
`LRL'
`16 KB
`5.13% 5.69%
`167% 5.29%
`
`
`
`
`LRU
`Ranti-
`4.39% 4.96%
`
`
`LSJW-
`I 64 KB
`1.33%"
`2.01%
`1.54%
`1.66%
`1.39%
`
`iESGKB
`
`1.!5‘éi.
`
`1.17%
`
`
`
`
`
`1.13%-
`
`1.13%
`
`1,12%
`
`1.12!-
`
`
`
`
`Reads dominate cache accesses. A11 instructi0n accesses are reads, and -
`'
`
`instructions don't write to memory. Figure 4.34 (page 181) suggesrs a mix of
`
`stores and 17% loads for four DLX programs. making writes less than if?! '
`
`the memory traffic. Making the common case fast means optimizing caches '
`reads. but Amdahl‘s Law reminds us that high~perf0rmance designs c
`
`neglect the speed of writes.
`
`-
`Fortunately. the common case is also the easy case to make fast. The -
`
`
`can be read at the same time that the tag is read and compared. so the block -
`begins as soon as the block—frame address is available. If the read is a hiL
`block is passed on to the CPU immediately. If it is a miss, there is no ben '
`but also no harm.
`
`FIGUHE 8.10 Miss rates comparing least-recently used versus random repl r
`for several sizes and associatlvities. This data was collected for a block size of TE -
`using one of the VAX traces containing user and operating system code (SAVED). TT'I'
`trace is included in the software supplement for course use.. There is little difference
`between LRU and random for larger stze caches in this trace.
`
`Q4: What Happens on 3 Write?
`
`
`
`
`
`'
`
`Qualcomm, Ex. 1009. Page 7
`
`Qualcomm, Ex. 1009, Page 7
`
`

`

` Memory-Hierarchy Design
`
`Such is not the case for writes. The processor specifies the size of the write.
`usually between l and 8 bytes; only that portion of a block can be changed. in
`general this means a read—modify~write sequence of operations on the block:
`read the original block, modify one portion, and write the new block value.
`Moreover, modifying a block cannot begin until the tag is checked to see if it is
`a hit. Because tag checking cannot occur in parallel, then, writes normally take
`longer than reads.
`Thus, it is the write policies that distinguish many cache designs. There are
`two basic options when writing to the cache:
`
`- Write through (or store llll‘Oltgll)‘ The information is written to both the
`block in the cache and to the block in the lower-level memory.
`
`I Write back (also called copy back or store in)—The information is written
`only to the block in the cache. The modified cache block is written to main
`memory only when it is replaced.
`
`Write—back cache blocks are called clean or dirty. depending on whether the
`information in the cache differs from that in lower-level memory. To reduce the
`frequency of writing back blocks on replacement, a feature called the dirty bit is
`commonly used. This status bit indicates whether or not the block was modified
`while in the cache. Ifit wasn't. the block is not written. since the lower level has
`the same information as the cache.
`
`lower level and not loaded into the cache.
`
`Both write back and write through have their advantages. With write back,
`writes occur at the speed of the cache memory. and multiple writes within a
`block require only one write to the lowerrlevcl memory. Since every write
`doesn’t go to memory, write back uses less memory bandwidth, making write
`back attractive in multiprocessors. With write through. read misses don’t result
`in writes to the lower level, and write through is easier to implement than write
`back. Write through also has the advantage that main memory has the most
`current copy of the data. This is important in multiprocessors and for 1/0, which
`we shall examine in Section 8.8. Hence, multiprocessors want write back to
`reduce the memory traffic per processor and write through to keep the cache and
`memory consistent.
`When the CPU must wait for writes to complete during write throughs, the
`CPU is said to write stall. A common optimization to reduce write stalls is a
`write btrfj'cr, which allows the processor to continue while the memory is
`updated. As we shall see in Section 8.8, write stalls can occur even with write
`buffers.
`
`There are two options on a write miss:
`
`I Write allocate (also calledfctt‘h on wrr'tr:]-- The block is loaded, followed by
`the write—hit actions above. This is similar to a read miss.
`
`- N0 write allocate (also called irritc amurta’i—The block is modified in the
`
`Qualcomm, Ex. 1009. Page 8
`
`Qualcomm, Ex. 1009, Page 8
`
`

`

`
`
`I14
`
`8.3 Caches
`
`While either writeemiss policy could be used with write through or write back.
`generally write»bacl( caches use write allocate (hoping that subsequent writes in
`that block will be captured by the cache) and write-through caches often use nu
`write allocate (since subsequent writes to that block will still have to go to
`memory).
`
`An Example Cache: The VAX-1 11780 Cache
`
`To give substance to these ideas, Figure 8.11 shows the organization of the
`cache on the VAX-111780. The cache contains 8192 bytes of data in 8-b)’[t
`blocks with two—wayiset-associativc placement, random replacement, write
`through with a one-word write buffer. and no write allocate on a write miss.
`Let‘s trace a cache hit through the steps of a hit as labeled in Figure 8.1I
`(The five steps are shown as circled numbers.) The address coming into the
`cache is divided into two fields: the 29-bit blockeframe address and 3-bit hloclt
`
`offset. The block-frame address is further divided into an address tag and cache
`index. Step 1 shows this division.
`The cache index selects the set to be tested to see if the block is in the cache
`
`(A set is one block from each bank in Figure 8.1L) The size of the index
`depends on cache size, block size, and set associativity. In this case, a 9-hi1
`index results:
`
`'9a
`Qache size
`Blocks _
`Bank _ Block size * Set associativity_ 8 *
`
`197
`.= 512 = 29
`
`In a two-way—set-associative cache, the index is sent to both banks. This Ii
`step 2.
`After reading an address tag from each bank, the tag portion of the block-
`frame address is compared to the tags. This is step 3 in the figure. To be sure the
`tag contains valid information,
`the valid bit must be set, or the results of the
`comparison are ignored.
`Assuming one of the tags does match, a 2:1 multiplexer (step 4) is set It:-
`select the block from the matching set. Why can’t both tags match? It is the job
`of the replacement algorithm to make sure that an address appears in only on:
`block. To reduce the hit time, the data is read at the same time as the address
`tags; thus, by the time the block multiplexer is ready, the data is also ready.
`This step is needed in scteassociativc caches. but
`it can be omitted frcm:
`direct-mapped caches since there is no selection to be made. The multiplexer
`used in this step can be on the critical
`timing path, endangering the clock cycle
`time of the CPU. (The example on pages 418—419 and the fallacy on page 4Sl
`explore the trade-off of lower miss rates and higher clock cycle time.)
`In the final step the word is sent to the CPU. All five steps occur within a
`'
`single CPU clock cycle.
`What happens on a miss? The cache sends a stall signal to the CPU telling n:
`to wait, and two words (eight bytes) are read from memory. That takes 6 cloci
`cycles on the VAX-11/780 (ignoring bus interference). When the data arrives,
`
`
`
`Qualcomm, Ex. 1009. Page 9
`
`Qualcomm, Ex. 1009, Page 9
`
`

`

`
`
`
`Memory-Hierarchy Design
`
`41 5
`
`the cache must pick a block to replace; the VAX—11/780 selects one of the two
`sets at random. Replacing a block means updating the data, the address tag, and
`the valid bit. Once this is done, the cache goes through a regular hit cycle and
`returns the data to the CPU.
`
`Writes are more complicated in the VAXel 1/780, as they are in any cache. If
`the word to be written is in the cache, the first four steps are the same. The next
`step is to write the data in the block, then write the Changed-data portion into the
`
`r write back.
`‘ent writes to
`often use no
`We to go to
`
`lion ot' the
`
`l in 8—byle
`Tent, write
`miss.
`gure 8.1]_
`g into the
`l-bit block
`and cache
`
`
`
`|__
`
`Ellock-Im-tm BtOCit
`address
`all“:
`<25"
`(9:.
`"3’7
`Ta
`index
`
`l
`
`Tau
`Valid
`:1:- LE;-
`
`
`Data
`-
`
`
`
`
`
`l CPU
`address
`Data Data
`
`In
`out
`
`
`
`
`
`
`
`
`
`
`
`Bank I
`5512
`blocks)
`
`
`
`FIGURE 8.11 The organization at the VAX-11l'r'80 cache. The 8—KB cache is twoway
`set associative with Brbyte blocks. it has 512 sets with two blocks per set; the set is
`selected by the 9-bit index. The five steps of a read hit, shown as circled numbers in order
`of occurrence, label this organization. The line from memory to the cache is used on a misa
`to load the cache. Such multiplextng is not needed in a directrmapped cache. Note that the
`oftset is connected to chip select ol the data SRAMs to allow the proper words to be sent to
`the 2:1 multiplexer.
`
`Qualcomm, Ex. 1009, Page 10
`
`Qualcomm, Ex. 1009, Page 10
`
`

`

`cache. The VAX—l 1/780 uses no write allocate. Consequently. on a write min
`the CPU writes “around" the cache to lower-level memory and does not affix:
`the cache.
`
`Since this is :1 write-through cache. the process isn’t yet over. The word I.
`also sent to a one-word writc buffer. If the write huffer is empty. the word
`full address are written in the buffer, and we are finished. The CPU cont' - -
`working while the write buffer writes the word to memory. If the buffer is fol,
`the cache (and CPU) must wait until the buffer is empty.
`
`Cache Performance
`
`CPU time can be divided into the clock cycles the CPU Spends executing -
`program and the Clock cycles the CPU spends waiting for the memory 51.-
`-
`Thus.
`
`CPU time = (CPU—execution clock cycles + Memory-stall clock cycles ) * Clock cycle i
`
`-
`To simplify evaluation of cache alternatives, sometimes designers 33‘s
`that all memory stalls are due to the cache. This is true for many machines -
`machines where this is not true, the cache still dominates stalls that an-
`exclusively due to the cache. We use this simplifying assumption here. but i:
`important to account for all memory stalls when calculating final peri'onnatn-t -
`The formula above raises the question whether the clock cycles for a '
`-
`access should be considered part of CPU—execution clock cycles or part of
`-
`ory-stall clock cycles. While either Convention is defensible, the most is -- .
`accepted is to include hit clock cycles in CPU—execution clock cycles.
`Memory-stall clocl. cycles can then be defined in terms of the number
`memory accesses per program, miss penalty (in clock cycles), and miss rat:
`reads and writes:
`
`* Miss rate - Miss penalty)* Clock cycle
`
`'
`
`Reads
`Memory-stall clock cycles — fir >1= Read miss rate * Read miss penile!
`Program
`
`.
`.
`.
`.
`Writes
`+' — - Write miss rate =I< Write miss penalty
`Program
`
`We simplify the complete formula by combining the reads and writes togedtl:
`
`Memoryistall clock cycles =
`
`Memory aceessess
`Program
`
`* Miss rate =% Miss penalty '
`
`Factoring instruction count (lC) from cxccntion time and memory
`cycles. we now get a CPU-time formula that includes memory accesses '
`instruction. miss rate, and miss penalty:
`
`-
`_
`..
`CPU time _ 1C 4: (CHE-“Cmion +
`
`Memory EECQEEQEJ
`Instruction
`
`-
`
`,.
`
`-
`
`..
`
`Qualcomm, Ex. 1009, Page 11
`
`Qualcomm, Ex. 1009, Page 11
`
`

`

`
`
`I write miss
`s not affect
`
`he word is
`2 word and
`i continua.
`tier is full
`
`Memorvaierarchy Design
`
`41?
`
`Some designers prefer measuring miss rate are misses per instruction rather
`than misses per memory reference:
`
`M‘s“? : MEFF'QD/fifil-L‘éifi'f-fi t Miss rate
`Instruction
`Instruction
`
`is independent of the hardware
`it
`The advantage of this measure is that
`implementation. For example,
`the VAX-HERO instruction unit can make
`repeated references to a single byte (sec Section 833'), Which can artificially
`reduce the miss rate if measured as misses per memory reference rather than per
`instruction executed. The drawback is that
`this measure is architecture
`
`is most popular with architects working with a single
`thus it
`dependent.
`computer family. They then use this version of the CPU-time formula:
`
`CPU time = H: * (CPI
`
`’t‘
`'
`-
`n
`*
`.
`7 Misses
`Execution + Instruction Mlbb DUMMY) Clock cycle ttme
`
`We can now explore the consequences of caches on performance.
`
`Example
`
`Let’s use the VAX—l 1/780 as a first example. The cache miss penalty is 6 clock
`cycles, and all instructions normally take 8.5 clock cycles (ignoring memory
`stalls). Assume the miss rate is 11%, and there is an average of 3.0 memory
`references per instruction. What is the impact on performance when behavior of
`the cache is included?
`
`Answer
`
`CPU time : IC *- (CPI .
`.
`Execution
`time
`
`+
`
`
`Memofll'sluilPlQC—k C)IE1BH)* ClOCk CyCi‘S
`.
`Instruction
`
`The performance, including cache misses, is
`
`CPU timewith cache = IC * (8.5 + 3.0
`
`11% a 6) * Clock cycle time
`
`= Instruction count
`
`10.5 * Clock cycle time
`
`The clock cycle time and instmction count are the same, with or without a
`cache. so CPU time increases with CPI from 8.5 to 10.5. Hence. the impact of
`the memory hierarchy is to stretch the CPU time by 24%.
`
`
`Example
`
`Let's now calculate the impact on performance when behavior of the cache is
`included on a machine with a lower CPI. Assume that the cache miss penalty is
`10 clock cycles and. on average, instructions take 1.5 clock cycles; the miss rate
`is 11%. and there is an average of 1.4 memory references per instruction.
`
`Answer
`
`CPU time =IC * (CPI
`
`‘
`.
`EXCLUUOH
`
`+
`
`Memorv—gtall clock cycles
`_
`.
`) 4“ Clock cycle time
`Tnstructton
`
`Qualcomm, Ex. 1009, Page 12
`
`Qualcomm, Ex. 1009, Page 12
`
`

`

`Making the same assumptions as in the previous example on each: hits. the _
`formance. including cache misses. is
`
`CPI; timewith cache = [C (1.5 + 1.4=l<1 t%>1=10) Clock cycle time
`
`: Instruction C(iunt>'=3.(1=i=C1ticl( cycle time
`
`The clock cycle time and instruction count are the same. with or v.1 .
`cache, so CPU time increases with CPI from 1.5 to 3.0. Including -
`behavior doubles execution time.
`
`As these examples illustrate, cache—hehavior penalties range from sig u
`to enormous. Furthermore. cache misses have a double-barrcled impafi
`CPU with a low CPI and a fast clock:
`
`1. The lower the CPI, the more pronounced the impact is.
`
`Avcragc memory-access time = Hit time + Miss rate * Miss penalty
`
`7..
`
`Independent of the CPU. main memories have similar memory—access it
`since they are built from the same memory chips. When calculating CPL
`cache miss penalty is measured in CPU clock cycles needed for 1
`Therefore. a higher CPU clock rate leads to a larger miss penalty. e
`main memories are the same speed.
`
`-
`
`'
`
`The importance of thc cache for CPUs with low CPI and high clock rates is
`greater; and. consequently. greater is the danger of neglecting cache behasU
`assessing performance of such machines.
`While minimizing average memoryeaecess time is a reasonable goal and
`will use it in much of this chapter, keep in mind that the final goal is to -r
`CPU execution time.
`
`What is the impact of two different cache organizations on the perfomtanre
`CPU? Assume that the CPI is normally 1.5 with a clock cycle time of 20 list
`there are 1.3 memory references per instruction, and that the size of both 1:
`is 64 KB. One cache is direct mapped and the other is two-way set a‘ of
`Since the speed of the CPU is tied directly to the speed of the caches, assunr
`CPU eloclt cycle time must be stretched 8.5% to accommodate the set
`multiplexer of the setrassociativc cache (step 4 in Figure 8.11 on page 415 1:
`the first approximation. the cache miss penalty is 200 ns for either - .
`organization. (In practice it must he rounded up or down to an integer numbu'
`clock cycles.) First, calculate the average memory—access time, and then
`performance.
`
`-
`
`_
`
`Figure 8.12 on page 421 ShtWJs that the miss rate of a direct-mapped 6-1- -_
`cache is 3.9% and the miss rate for a two-wayfiset-assoeiative cache of the
`size is 3.0%. Average memory-access time is
`
`Qualcomm, Ex. 1009, Page 13
`
`Qualcomm, Ex. 1009, Page 13
`
`

`

`
`
`1h
`
`-_
`
`6 p6]
`
`hour a
`cache
`
`_.
`1:,ch
`
`meg
`. the
`lie;
`n if
`
`The average memory—access time is better for the two-way—set-associative
`cache.
`
`CPU performance is
`
`Substituting 200ns for (Miss penalty * Clock cycle time), the performance of
`each cache organiaation is
`
`
`
`it
`:3:
`
`§PUlhI1F3weez _ 40-élilpstrsghm COM
`CPU timepway— 40.1 =8 Instruction count
`
`and relative performance is
`
` Memory»Hlerarcl-ty DeSIQn
`
`.
`.
`_
`.
`
`
`Thus. the time for each organization is
`
`Average memory-access timepway = 20 + 039*200 = 27.8 ns
`
`Average memory—access time}way : 20*1085 + .030*200 = 27.7 ns
`
`
`
`n a
`
`
`
`
`
`
`
`CPU time1_way = IC*(J 520 + t.3-=0.039*200) = 40.1*IC
`
`CPU timezw}. = IC*(1_5=*20=1<1.085 + 13800304200) = 40.4%: 1C
`
`
`
`,
`.
`Misses
`_
`.
`_
`.
`CPU time = IC * (CPIExec‘ution + ingru—etioh Miss penalty) >5 Clock cycle time
`= IC -= (CPIExecuuon Clock cycle time +
`.
`Metnorv accesses
`.
`.
`'—-
`" -.-
`— Miss rate -I= MISS penalty * Clock cycle time)
`lnstructton
`
`
`
`
`
`
`
`
`
`In contrast to the results of average access-time comparison. the direct-mapped
`cache leads to slightly better performance. Since CPU time is our bottom~line
`evaluation (and direct mapped is simpler to buildt.1h:: preferred cache is dtttct
`mapped in this example. (See the fallacy on page 48l for more on this kind of
`traderofl'.)
`
`1
`I?
`i
`'-
`
`E .
`
`r
`
`
`
`
`
`
`
`
`
`5] Ck
`-1:r_'
`..
`-
`-
`'-
`'
`“
`
`
`(3
`1
`-
`(.mnprrlwry—The first access to .r block is lit-)1 in Pb: .11 -
`must be brought into the cache. These are Malia-.1
`_'
`"Hint Y GUM”
`
`
`rqfi't'rm‘e mines.
`("rtnat'l't'ft'rlf the each-I: Cannot cunt-b1 a!” -.:-_- run-ts ncedect dugngnexecunon
`
`
`
`of a FmgTam. capacity misses Mi! "LU—7 due I“ 1310““ being '
`
`later retrieved.
`
`
`The Three Sources of Cache Misses: cmlluwl
`Capacity, and Conflicts
`
`.
`
`An intuitive model of Cache behavior attributes all misses teed!
`sources;
`
`a! .i__.
`_..
`
`-
`
`.
`
`
`
`
`
`Qualcomm, Ex. 1009, Page 14
`
`Qualcomm, Ex. 1009, Page 14
`
`

`

`- Conflict—1ft]: block-plum strategy [1 1:: 15““ It”
`conflict misses tin addition to compulmfi mm! caesium} ”in:
`because a block can be discarded and IilJi'T retrieved d me Full} -'
`to its set. These are also called rtrh'mtm mitt-res
`
`—
`
`Figure 8J2 shows the relative frequency of cache misses. bruit!
`the “ch Cs." To show the benefit of associativttt . conflict muses -
`into misses caused by each decrease in asaoeiativiry. The cutgmies a
`n--way, meaning the misses caused by going In the |fl't\i:l' level of
`--
`from the next one above. Here are the four categories:
`
`8-way: from fully associative (no conflicts} to 8-way Mancini“:
`
`4—way: from 8-way associative to 4rway associative
`
`eray: from 4—way associative to 2—way associative
`
`leway: from 2-way associative to l~way associative (direct mappodi
`
`Figure 8.13 (page 42?.) presents the same data graphicaliy. The {up w '
`absolute miss rates: the bottom graph plots percentage of all the misc: t5-
`size.
`
`'
`
`‘
`
`I
`_
`-
`
`Having identified the three Cs. what can a computer designer tin nit-2‘
`Conceptually. conflicrs are the easiest: Fully associative ptacemertl
`-
`conflict misses. Associativity is expensive in hardware. however. and -.
`access time {sec the example above or the second fallacy in Stflifl
`leading to lower overall performance. There is little to he done about .
`except to buy larger memory chips. If the upper—level memory is nut-cit
`than what is needed for a program, and a significant percentage ol the
`spent moving data between two levels in the hierarchy, the memory In
`-
`said to thrash. Because so many leplacements are required. thrashing
`inachine runs close to the speed of the lower-level memory. or nmyh
`slower due to the miss ovcrhettd. Making blocks larger reduces the
`compulsory misses. but it can increase conflict misses.
`The three C's give insight into the cause of misses. but this simple
`its limits. For example. increasing cache size reduces conflict misses a:
`capacity misses. since a larger cache spreads oat references. Thus, a this
`move from one category to the other as parameters change. Three C's '
`:
`replacement policy, since it is difficult to model and since. in general. it is -
`significance. In specific circumstances the replacement policy can actually _
`to anomalous behavior, such as poorer miss rates for larger associativily.
`
`‘
`
`- -
`
`-
`
`is directly contradictory to the three C’s model.
`
`Qualcomm, Ex. 1009, Page 15
`
`Qualcomm, Ex. 1009, Page 15
`
`

`

`
`
`
`MemoryLHierarchy Des1gn __ __ __ __ __ __ 421
`
`
`Cache size
`
`Degree
`associative
`
`Total
`miss
`rate
`
`Miss-rate components (relative percent)
`(Sum = 100% of total miss rate)
`Capacity
`
`Conflict
`
`Compulsory
`
`73%
`0.141
`5%
`3791-
`0.141
`651-
`921:1.
`0141
`15111.
`
`1111.; __0.1L 939';
`6‘31:
`0.103
`700':
`
`1 KB
`1 KB
`1 KB
`
`1 KB__
`2 KB
`
`2 KB
`
`2 KB
`
`1—way
`2-way
`4-way
`
`8-way
`1-way
`
`2-way
`
`0.191
`0.161
`0.152
`
`0149
`0.148
`
`0.122
`
`0.115
`
`0.009
`0.009
`0.009
`£109
`0.009
`
`0.009
`
`0.009
`
`
`0.113
`0.0011
`0.109
`0.009
`0.095
`0.009
`0.087
`0.009
`0.0811- __0.009
`
`0.087
`11.009
`0.069
`0.009
`
`0.065
`
`0.009
`
`7'11-
`
`8‘11-
`
`
`
`0.103
`
`0.103
`
`111;
`01-03
`81:11
`0.073
`9%
`0.073
`10%
`0.073
`112%— 003
`1091-
`0.052
`13%
`0.052
`
`14%
`
`0.052
`
`84‘}?
`
`80‘};
`
`90%.}-
`2‘31-
`
`
`91%
`__0.001_ _ 10:
`67%
`0.027
`25%
`7791'-
`0.013
`14%
`84%
`0.005
`691.
`11151. __0.002__
`'1_1_
`
`6051?
`0.026
`3091
`75‘}?
`0.008
`12%
`
`
`£009 __14% __ 0

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket