throbber
COMPUTER
`
`
`
`ARCHITECTURE
`A
`QUANTITATIVE
`APPROACH
`
`JOHN L HENNESSY
`&
`DAVID A PATTERSON
`
`Qualcomm, Ex. 1009, Page 1
`
`

`

`Sponsoring Editor Bruce Spatz
`Production Manager Shirley Jowell
`Technical Writer Walker Cunningham
`Text Design Gary Head
`Cover Design David Lance Goines
`Copy Editor Linda Medoff
`Proofreader Paul Medoff
`Computer Typesetting and Graphics Fifth Street Computer Services
`
`Library of Congress Cataloging-in-Publication Data
`Patterson, David A.
`Computer architecture : a quantitative approach / David A.
`Patterson, John L. Hennessy
`p.
`cm.
`Includes bibliographical references
`ISBN 1-55880- 069-8
`1. Computerarchitecture. 2. Electronic digital computers
`--Design and construction.
`I. Hennessy, John L.
`IT. Title.
`QA76.9.A73P377 1990
`004.2°2--dce20
`
`Morgan Kaufmann Publishers, Inc.
`Editorial Office: 2929 Campus Drive. San Mateo, CA 94403
`Order from: P.O. Box 50490, Palo Alto, CA 94303-9953
`
`©1990 by Morgan Kaufmann Publishers,Inc.
`All rights reserved,
`
`89-85227
`CIP
`
`Nopart of this publication may be reproduced, stored in a retrieval system, or transmitted
`in any form or by any means—lectronic. mechanical, recording, or otherwise-—-without
`the prior permission of the publisher.
`
`All instruction sets and other design information of the DLX computer system contained
`herein is copyrighted by the publisher and may not be incorporated in other publications
`or distributed by media without formal acknowledgement and written consent from the
`publisher. Use of the DLX in other publications for educational purposes is encouraged
`and application for permission is welcomed.
`ADVICE, PRAISE, & ERRORS: Any correspondencerelated to this publication or
`intended for the authors should be addressed to the editorial offices of Morgan Kaufmann
`Publishers, Inc., Dept. P&H APE. Information regarding errorsightings is encouraged.
`Anyerror sightings that are accepted for correction in subsequent printings will be
`rewarded by the authors with a payment of $1.00 (U.S.) per correction. Electronic mail
`can be sent to bugs@vsop.stanford.edu.
`INSTRUCTOR SUPPORT: For information on Classroom software and otherinstructor
`materials available to adopters, please contact the editorial offices of Morgan Kaufmann
`Publishers, Inc.
`
`
`
`Qualcomm, Ex. 1009, Page 2
`
`Qualcomm, Ex. 1009, Page 2
`
`

`

`408
`
`8.3 Caches
`
`Cache: a safe place for hiding or storing things.
`
`
`8.3 Caches
`
`
`
`
`Websier’s New World Dictionaryof the American La
`Secand Coilege Edition ~
`
`
`
`
`
`
`Cache is the namefirst chosen to represent the level of the memory hien
`between the CPU and main memory, and that is the dominant use ofthe
`While the concept of cuches is younger than the IBM 360 architecture, oo
`appear todayin every class of computer and in some computers more than ©
`In fact, the word has become so popular that it has replaced “buffer” in
`computer-sciencecircles.
`The general
`terms defined in the prior section can be used for
`although the word lize is often used instead of block. Figure 8.5 show)
`typical range of memory-hierarchy parameters for caches.
`
`
`||M
`
`|Block (line)size 4— 128 bytes
`
`
`| Hit time
`{ ~4 clock cycles (normally 1}
`iss penalty
`8 = 32 clock cycles
`ih
`(Access me)
`(6— 10 clock cycles)
`
`(Transfer time) a (2-22 clock cycles)
`
`Missrate
`1% -— 20%
`ps SS ———
`| Cachesize
`| KB —-256 KB
`
`
`
`
`
`
`
`
`(number of blocks in cache). »
`Ifa block can be placed anywhere in the cache, the cache is said to be ©
`
`|
`
`i
`
`FIGURE 8.5 Typical values of key memory-hierarchy parameters for caches in
`workstations and mintcomputers.
`
`
`
`
`
`Nowlet’s examine caches in more detail by answering the four memos
`hierarchy questions.
`
`QI: Where Can a Block Be Placed in a Cache?
`
`Restrictions on where a block is placed create three categories of
`organization:
`
`co”
`

`
`If each block has only one place it can appear in the cache, the cache is =
`to be direct mapped. The mapping is usually (block-frame address) mow
`
`associative.
`
`
`
`Qualcomm, Ex. 1009, Page 3
`
`Qualcomm, Ex. 1009, Page 3
`
`

`

` Mamery-Hierarchy Design
`
`
`
`
`wr
` Glock
`
`TPebeten7sta
`no. OL2345678S012394567808
`
`FIGURE 8.6 The cache has6 blocks, while memory has 32 blocks, The set-
`associalive organization has 4 sets with 2 blocks per sel, called two-way sel associative.
`(Real caches contain hundreds of blecks and real memories contain hundreds of thousands
`of blocks.) Assume |hat [here is nothing in the cache and that the block-frame address|n
`question identifies lower-level block 12. The three options far caches are shownleft to right.
`In fully associative, black 12 from) the lower level can go into any ol the 8 blocks of the
`cache. With direct mapped, block 12 can onty be placed into block 4 (12 modulo &). Set
`associative, which has some of both features. allows the block to be placed anywhere in set
`0 (12 modulo 4}. With two blocks per set. this means block 12 can be placed eitherin block
`0 or block 1 of the cache.
`
`Qualcomm, Ex. 1009, Page 4
`
`
`
`409
`
`a
`
`Ifa block can be placed in a restricted set of places in the cache, the cacheis
`said to be ser associative. A set is a group of two or more blocks in the cache.
`A blockis first mapped onto a set, and then the block can be placed anywhere
`within the set. The set is usually chosen bybit selection; thar is, (block-frame
`address) modulo (number of sets in cacbe). If there are n blocks in a set, the
`cache placement is called #-wayset associative.
`
`The range of caches from direct mapped to fully associative is really a
`continuum of levels of set associativity: Direct mapped is simply one-way sel
`associative and a fully associative eache with m blocks could be called m-way
`sel associative. Figure 8.6 shows where block 12 can be placed in a cache
`according to the block-placement policy.
`
`
`Fully aasesiative
`biook 12 can pe
`anyurrere
`
`Bock
`
`G1 234557
`
`Binck-iramne adoness
`
`Direci mapped
`block 12 can go
`only infa black 4
`(ft? mod a}
`Biok OPE R4567
`
`Seq aspocialive:
`blnok 12 can go
`anywhere i sero
`(12 moe 4}
`Bock O2234567
`
`Set Set Set Set
`o
`\
`2
`3
`
`Qualcomm, Ex. 1009, Page 4
`
`

`

`410
`
`8.3 Caches
`
`Q2: How Is a Block Found If It Is in the Cache?
`
`
`
`
`
`
`Caches include an address tag on each block that gives the block-frame
`The tag of every cache block that might contain the desired info
`checked to see if it matches the block-frame address from the CPL’. F -
`
`
`gives an example. Because speedis of the essence, all possible tags are
`in parallel; serial search would make set associativity Counterproducti\ =
`
`
`
`Fully agsocsative
`OT FI4567
`
`Oiract mapped
`Block O1289 4567
`no.
`
`Se! associative
`Block O1234567
`no.
`
`Block
`no.
`
`Data
`
`I
`
`
`FIGURE 8.7 In fully associative placement, the biock for block-frame adc
`appearin any of the 8 blocks; thus, all 8 tags must be searched. The desir==
`
`found in cache block 6 in thig example. In direct-mapped placementthere is any
`block where memory block 12 can be found. In set-associative placement, wiih =
`
`memory block 12 must be in set 0 {12 mod 4); thus. the tags of cache blocks [ =~
`checked, In this case the data is found in cache block 1. Speed of cache access =
`
`that searching must be pertormedin parallel for fully associative and set-asso0—
`
`mappings.
`
`
`
`There must be a way to knowthat a cache block does not b>
`information. The most common procedure is to add a valid bit to the
`
`whether or not this entry contains a valid address. If the bit is nov
`
`cannot be a match on this address.
`
`
`A common omission in finding the cost of caches is to forget the
`
`tag memory. One tag is required for each block. An advantage of =
`
`block sizes is that the tag overhead per cache entry becomes a smaller =
`
`the total cost of the cache.
`
`
`Before proceeding to the next question, let's explore the relatioe
`
`CPU address to the cache. Figure 8.8 shows howan address is divide’
`
`fields to find data in a set-associative cache: the block-offset field Use”
`
`the desired data from the block, the index field used to select the set.
`field used for the comparison. While the comparison could be maile = =
`the address than the tag, there is no need:
`
`Tag
`
`
`Se Set Sel Sat
`Oo
`1 28
`
`EY LUD
`
`|
`
`sew TUTTE TT
`
`
`
`Qualcomm, Ex. 1009, Page 5
`
`Qualcomm, Ex. 1009, Page 5
`
`

`

`
`
`4ii
`
`a Checking the index would be redundant, since it was used to select the set to
`be checked (an address stored in set 0, for example, must have 0 in the index
`field or it couldn’t be stored in set 0).
`
`s The offset is unnecessary in the comparison because all block offsets match
`and the entirc block is present or nol.
`
`If the total size is kept the same, increasing associativity increases the number of
`blocks per set, thereby decreasing the size of the index and increasing the size of
`the tag. That is, the tag/index boundary in Figure 8.8 moves to the right with
`increasing associativity.
`
`
`
`
`FIGURE 8.8 The 3 portions of an address in a set-associative or direct-mapped cache.
`The tag is used to checkall the blocks in the set and the index is used to select the sel. The
`block offset is the address of the desired data within the block.
`
` Memory-Hierarchy Design
`
`= Least-recently used (I.LRU)—To reduce the chance of throwing out informa-
`tion that will be needed soon, accesses to blocks are recorded. The block
`replaced is the one that hus been unusedfor the longest time. This makes use
`of a corollary of temporal locality: If recently used blocks are likely to be
`used again, then the best candidate for disposal is the least recently used.
`Figure 8.9 (page 412) shows which block is the least-recently used for a
`sequence of block-frame addresses in a fully associative memory hierarchy.
`
`Q3: Which Block Should Be Replaced on a Cache Miss?
`
`If the choice were between a block that has valid data and a block that doesn’t,
`then it would be easy to select which block to replace. Alas, the high hit rate of
`caches means that the overwhelming decision is between blocks that have valid
`data.
`
`is that hardware decisions are
`A benefit of direct-mapped placement
`simplified. In fact, so simple that there is no choice: Only one block is checked
`for a hit, and only that block can be replaced. With fully associative or set-
`associative placement, there are several blocks to choose from on a miss. There
`are two primary strategies employed for selecting which block to replace:
`
`a Random—tTospread allocation uniformly, candidate blocks are randomly
`selected. Some systems use a scheme for spreading data across a set of blocks
`in a pseudorandomized manner to get reproducible behavior, which is
`particularly useful during hardware debugging.
`
`Qualcomm, Ex. 1009, Page 6
`
`Qualcomm, Ex. 1009, Page 6
`
`

`

`412
`
`|
`
`|
`
`
`
`Block-frame addresses
`LRUblocknumber | 0
`
`;
`
`
`
`
`
`oe Ee
`
`3
`| 2
`] To lo
`
`| 0 | 0“lathe |3
`
`i
`
`8.3 Gaches
`
`
`A virtue of randomis that it is stmple to build in hardware. As the nun
`blocks to keep track of increases, LRU becomes increasingly expensive
`
`frequently only approximated. Figure 8.10 shows the difference in miss —
`
`between LRU and random replacement. Replacementpolicy plays a greate
`in smaller caches than in larger caches where there are more choices of
`
`replace.
`
`
`
`I
`
`
`
`0
`
`
`
`
`FIGURE 8.9 Least-recently used blocks for a sequence of block-frame addres
`a fully associative memory hierarchy. This assumesthat there are 4 blacks and ™
`
`the oeginning the LRU block is number 0. The LRU block numberis shown belaw esc
`block reference. Anotherpolicy, First-in—first-out (FIFO}, simply discards the block
`
`used N unique accesses before, incsependent ofits reference pattern in the last N — 7
`references. Random replacement generally outperforms FIFO andit is easier to imp
`
`
`
`§-way
`
`Rantios
`16 KB 4.39% 4.965 518% 5.69% 4.67% 5.29%
`
`
`
`
`Associativity:
`Size
`
`2-way
`Random
`
`LRU
`
`4-way
`Random
`
`LRU
`
`LRU
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
` 256 KB LU5% LAs 113% 1.13% 1.12% 12%
`
`
`
`
`
`
`
`
`
`
`
`
`fe KB 1.535 1.88% 2.01% 154% 1.66% 1.39%
`
`
`FIGURE 8.10 Miss rates comparing least-recently used versus random replac:
`for several sizes and associativities. This data was collected for a block size of 16 ©
`
`using one of the VAX traces containing user and Operating system code (SAVEO). Tha
`trace is included in the software supplementfor course use.. Thereis little difference
`between LAU and random fer larger size cachesin this trace.
`
`(4: What Happens ona Write?
`
`
`
`Reads dominate cache accesses. All instruction accesses are reads, and =
`
`instructions don’t write to memory. Figure 4.34 (page 181) suggests a mix of
`
`stores and 17% loads for four DLX programs, making wrires less than Lo ~
`
`the memory traffic. Making the common case fast means optimizing caches ©
`
`reads, but Amdahl’s Lawreminds us that high-performance designs ¢
`
`neglect the speed ofwrites,
`
`Fortunately, the common case is also the easy case to make fast. The bis
`
`
`can be read at the same time that the tag is read and compared, so the block »
`begins as soon as the block-frame address is available. If the read is a hit.
`
`
`block is passed on to the CPU immediately. If it is a miss, there is no benef
`but also no harm.
`
`Qualcomm, Ex. 1009, Page 7
`
`Qualcomm, Ex. 1009, Page 7
`
`

`

`Such is not the case for writes. The processor specifies the size of the write,
`usually between | and 8 bytes; only that portion of a block can be changed. In
`general this means a read-modify-write sequence of operations on the block:
`read the original block, modify one portion, and write the new block value.
`Moreover, modifying a block cannot begin until the tag is checked Lo see if it is
`a hit. Because tag checking cannot occur in parallel, then, writes normally take
`longer than reads.
`Thus, it is the write policies that distinguish many cache designs. There are
`two basic options when writing to the cache:
`
`a Write through (or store througl)—The information is written to both the
`block in the cache and to the block in the lower-level memory.
`
`a Write back (also called copy back or store in}—The information is written
`only to the block in the cache. The modified cache block is written to main
`memory only when i is replaced.
`
` Memory-Hierarchy Design
`
`Write-back cache blocks are called clean or dirty, depending on whether the
`information in the cache differs from that in lower-level memory. To reduce the
`frequency of writing back blocks on replacement, a feature called the dirtybit is
`commonlyused. This status bit indicates whether or not the block was modified
`while in the cache. If it wasn’t, the block is not written, since the lowerlevel has
`the same information as the cache.
`Both write back and write through have their advantages. With write back,
`writes occur at the speed of the cache memory, and multiple writes within a
`block require only one write to the lower-level memory. Since every write
`doesn’t go to memory, write back uses less memory bandwidth, making write
`back attractive in multiprocessors. With write through, read misses don’t result
`in writes to the lower levct, and write through is easier to implement than write
`back. Write through also has the advantage that main memory has the mast
`current copy of the data. This is important in multiprocessors and for I/O, which
`we shall examine in Section 8.8. Hence, multiprocessers want write back to
`reduce the memory traffic per processor and write through to keep the cache and
`memory consistent.
`When the CPU must wait for writes to complete during write throughs, the
`CPUis said to write stall. A common optimization to reduce write stalls is a
`write buffer, which allows the processor to continue while the memory is
`updated. As we shall see in Section 8.8, write stalls can occur even with write
`buffers.
`
`There are two options on a write miss:
`
`a Write allocate (also called fetch on write)}—The block is loaded, followed by
`the wnite-hit actions above. This is similar to a read miss.
`
`a No write atiocate (also called write around)—The block is modified in the
`lowerlevel und not loaded into the cache.
`
`Qualcomm, Ex. 1009, Page 8
`
`Qualcomm, Ex. 1009, Page 8
`
`

`

`
`
`414
`
`8.3 Caches
`
`While either write-miss policy could be used with write through or write back,
`generally write-back caches use write allocate (hoping that subsequent writes to
`that block will be captured by the cachc) and write-through caches often use nu
`write allocate (since subsequent writes to that block will still have to go to
`memory).
`
`An Example Cache: The VAX-11/780 Cache
`
`To give substance to these ideas, Figure 8.11 shows the organization of the
`cache on the VAX-11/780. The cache contains 8192 bytes of data in 8-byte
`blocks with two-way—set-associative placement, random replacement, write
`through with a one-word write buffer, and no write allocate on a write miss.
`Let’s trace a cache hit through the steps of a hit as laheled in Figure 8.11
`(The five steps are shown as circled numbers.) The address coming into the
`cache is divided into two fields: the 29-bit block-frame address and 3-bit block
`offset. The block-frame address is further divided into an address tag and cache
`index. Step 1 shows this division.
`The cache index selects the set to be tested to see if the block is in the cache
`(A set is one block from each bank in Figure 8.11.) The size of the inde»
`depends on cache size, block size, and set associativity. In this case, a 9-hy
`index results:
`
`Blocks _
`a
`Bank
`
`8192
`Cache size
`a ee ee Si OY
`§ * 2
`Block size * Set associativity
`
`In a two-way-Set-associative cache, the index is sent to both banks. This ts»
`step 2.
`After reading an address tag from each bank, the tag portion of the block-
`frame address is compared to the tags. This is step 3 in the figure. To be sure the
`tag contains valid information,
`the valid bit must be set, or the results of the
`compurison are ignored.
`Assuming one of the tags does match, a 2:1 multiplexer (step 4) is set Iv
`select the block from the matching set, Why can’t both tags match? It is the jo=
`of the replacement algorithm to make sure that an address appears in only on-
`block. To reduce the hit time, the data is read at the same timeas the addres
`tags; thus, by the time the block multiplexer is ready, the data is also ready.
`This step is needed in set-associative caches, but
`it can be omitted from
`direct-mapped caches since there is no selection to be made. The multiplexes”
`used in this step can be on the critical
`timing path, endangering the clock cyc™
`time of the CPU. (The example on pages 418-419 and the fallacy on page 45.
`explore the trade-off of lower miss rates and higher clock cycle time.)
`In the final step the word is sent to the CPU. All five steps occur within ©
`|
`single CPU clock cycle.
`What happens on a miss? The cache sendsastall signal to the CPU telling
`to wait, and two words (eight bytes} are read from memory. That takes 6 cles”
`cycles on the VAX-11/780 (ignoring bus interference), When the data arrive
`
`
`
`Qualcomm, Ex. 1009, Page 9
`
`Qualcomm, Ex. 1009, Page 9
`
`

`

`
`
`
`Memory-Hierarchy Design
`
`4i5
`
`the cacbe must pick a block to replace; the VAX-11/780 selects one of the two
`sets at random. Replacing a block means updating the data, the uddress tag, and
`the valid bit. Once this is done, the cache goes through a regular hit cycle and
`returns the data to the CPU.
`Writes are more complicated in the VAX-11/780, as they are in any cache. If
`the wordto be written is in the cache, the first four steps are the same. The next
`step is to write the data in ihe block, then write the changed-data portion into the
`
`[
`
`Block-trama Block
`oes bona
`|__Ta
`index!
`|
`
`Qo
`patho
`(512
`lacks)
`
`—
`
`Bank 1
`{512
`blocks)
`
`
`
`
`
`
`write back,
`'eNl writes to
`often use no
`ive to go to
`
`tion ofthe
`a
`in 8-byte
`Tent, write
`miss.
`Bure 8.1].
`2 into the
`\-bit block
`and cache
`
`he cache.
`he index
`a 9-be
`
`
`
`
`
`
`
`aS ae
`oa
`
`:
`
`
`| cpu
`
`address
`em Gn
`
`
`
`
`
`
`
`
`FIGURE 8.11 The organization of the VAX-11/780 cache. The 8-KB cacheis two-way
`set associative with 8-byte blocks. It has 512 sets with two blocks perset; the set is
`selected by the 9-bit index. The five steps of a read hit, shown as circled numbers in order
`of occurrence, label this organization. The line from memory to the cache is used on a miss
`to load the cache, Such multiplexing is not needed in a direct-mapped cache. Note that the
`offsel is connected to chip select of the data SRAMs to allow the proper words to be sent ta
`the 2:1 multiplexer.
`
`Qualcomm, Ex. 1009, Page 10
`
`Qualcomm, Ex. 1009, Page 10
`
`

`

`cache. The VAX-11/780 uses no write allocate. Consequently, on a write mo
`the CPU writes “around” the cache to lower-level memory and does nol afie>
`the cache.
`
`Since this is a write-through cache, the process isn’t yet over, The wor)
`also sent to a one-word write buffer. If the write huffer is empty, the word
`full address are written in the buffer, and we are finished. The CPU contin-—
`working while the write buffer writes the word to memory. If the buffer is
`the cache (and CPU) must wait until the buffer is empty.
`
`Cache Performance
`
`CPU time can be divided into the clock cycles the CPU spends executing ©
`program and the clock cycles the CPU spends waiting for the memory syst
`Thus,
`
`CPU time = (CPU-execution clock cycles + Memory-stall clock cycles ) * Clock cycle ©
`
`To simplify evaluation of cache alternatives, sometimes designers asso
`that all memory stalls are due to the cache. This is true for many machine ~
`machines where this is not true, the cache still dominates stalls that ure
`exclusively due to the cache. We use this simplifying assumption here, but ~~
`important to accountfor all memory stalls when calculating final performance —
`The formula above raises the question whether the clock cycles for a ca>
`access should be considered part of CPU-execution clock cycles or part of
`me
`ory-stall clock cycles. While either convention is defensible, the most woe
`accepted is to include hit clock cyeles in CPU-execution clock cycles.
`Memory-stall clock cycles can then be defined in terms of the number
`memory accesses per program, miss penalty (in clock cycles), and miss rate
`reads and writes:
`
`CPU time =IC * Ce eeet Miss rate * Miss penalty)* Clock cycle
`
`Read miss rate * Read miss penal
`
`Writes_
`Program
`
`* Write miss rate * Write miss penalty
`
`Wesimplify the complete formula by combining the reads and writes togethe
`
`Memory accessess
`Memory-stall clock cycles = a * Miss rate * Miss penaln
`
`Factoring instruction count ([C} from ¢xecution time and memary
`cycles, we now get a CPU-time formula that includes memory accesses:
`instruction, miss rate, and miss penalty:
`
`=
`
`Memoryaccesses
`
`Qualcomm, Ex. 1009, Page 11
`
`Qualcomm, Ex. 1009, Page 11
`
`

`

`
`
`| write mise
`5 not affecs
`
`he word jx
`2 Word ane
`' continue:
`ffer is full.
`
`culing the
`¥ SVstem.
`
`yele time
`+ D53500e
`
`lines; on
`are mo
`
`but it is
`Rance"
`a Cache
`f mem
`widely
`
`iber of
`ate for
`
`Memory-Hierarchy Design
`
`417
`
`Some designers prefer measuring miss rate as misses per instruction rather
`than misses per memory reference:
`
`__ Memoryaccesses
`Misses
`=,
`:
`Instruction
`Instruction
`
`« Miss rate
`
`is independent of the hardware
`it
`The advantage of this measure ig that
`implementation. For example,
`the VAX-11/780 instruction unit can make
`repeated references to a single byte (sce Section 8.7), which can artificially
`reduce the miss rate if measured as misses per memoryreference rather than per
`instruction executed. The drawback is that
`this measure is architecture
`
`is most popular with architects working with a single
`thus it
`dependent,
`computer family. They then use this version of the CPU-time formula:
`
`CPU ume = IC * (cel
`
`Execution
`
`_ Misses
`
`*
`
`iss
`
`ne
`
`*
`
`Instruction Miss penalty) Clock cycle time
`
`:
`
`Wecan now explore the consequences of caches on performance.
`
`Let's use the VAX-11/780 as a first example. The cache miss penalty is 6 clock
`cycles, and all instructions normally take 8.5 clock cycles G(gnoring memory
`stalls). Assume the miss rate is 11%, and there is an average of 3.0 memory
`references per instruction. What is the impact on performance when behavior of
`the cache is included?
`
`CPU time = IC * (cPr ;
`
`Execution
`
`4
`
`Memory-stall clock cycles
`Instruction
`
`ye Clock cycle
`
`Example
`
`Answer
`
`time
`
`The performance, including cache misses, is
`CPU time th ie IC * (8.54 3.0 # 11% * 6) * Clock cycle time
`
`= Instruction count * 10.5 * Clock cycle time
`
`The clock cycle time and instruction count are the same, with or without a
`cache, so CPU time increases with CPI from 8.5 to 10.5. Hence, the impact of
`the memoryhierarchyis to stretch the CPU time by 24%.
`
`
`Example
`
`Let's now calculate the impact on performance when behavior of the cache is
`included on a machine with a lower CPI. Assume that the cache miss penalty is
`10 eciock cycles and. on average, instructions take 1.5 clock cycles; the miss rate
`is 11%, and there is an average of |.4 memory references per instruction.
`
`Answer
`
`CPU time =IC * (crt
`
`sot
`
`Execution
`
`Memory-stall clock cycles
`Instruction
`
`;
`
`;
`
`) * Clock cycle time
`
`Qualcomm, Ex. 1009, Page 12
`
`Qualcomm, Ex. 1009, Page 12
`
`

`

`Making the same assumptions as in the previous example on cache hits, = =
`formance, including cache misses, 1s
`
`CPL UME ich ac IC # (1.5 + 1.4#11%#10) * Clock cycle ume
`
`= Instruction count*3.0#Clock cycle time
`
`The clock cycle time and instruction count are the same, with or with
`cache, so CPU time increases with CPI from 1.5 to 3.0. Including ©
`behavior doubles execution time.
`
`As these examples illustrate, cache-hehavior penalties range from sign ©
`to enormous. Furthermore, cache misses have a double-barreled impac
`CPUwith a low CPI and a fast clock:
`
`{. The lower the CPI, the more pronounced the imnpuct is.
`
`2.
`
`Independent of the CPL), main memories have similar memory-access ©
`since they are built from the same memory chips. When calculating CP
`cache miss penalty is measured in CPU clock cycles needed for =
`Therefore, a higher CPU clock rate leads to a larger miss penalty, eo _
`main memories are the same speed.
`
`The importance of the cache for CPUs with low CPI and high clock rates
`greater; and, consequently, greater is the danger of neglecting cache behay —
`assessing performance of such machines.
`While minimizing average memory-access time is a rcasonable goal an”
`will use it in much of this chapter, keep in mind that the final goal is to po
`CPU execution time.
`
`Average memory-access time = Hit time + Miss rate * Miss penalty
`
`Whatis the impact of two different cache organizations on the performarice
`CPU? Assumethat the CPI is normally 1.5 with a clock cycle time of 20 ns.
`there are 1.3 memory references per instruction, and that the size of both cas
`is 64 KB. One cache is direct mapped and the other is two-waysel associ
`Since the speed of the CPU is Lied directly to the speed of the caches, assum=
`CPU clock cycle time must be stretched 8.5% to accommodate the se)
`multiplexer of the set-associative cache (step 4 in Figure 8.11 on page 415 _
`the first approximation, the cache miss penalty is 200 ns for either co
`organization. (In practice it must he rounded up or downto an integer numbes ©
`clock cycles.) First, calculate the average memory-access time, and then
`performance.
`
`Example
`
`Figure 8.12 on page 42] shows that the miss rate of a direct-mapped 64-5 —
`cache is 3.9% and the miss rate for a two-way—set-associative cache ofthe
`size is 3.0%. Average memory-access timeis
`
`Qualcomm, Ex. 1009, Page 13
`
`Qualcomm, Ex. 1009, Page 13
`
`

`

`
`
`
`hout a
`cache
`
`The average memory-access lime is better for the two-way-set-associative
`cache.
`
`Substituting 200ns for (Miss penalty * Clock cycle time), the performance of
`each cache organization is
`
`
`
`CPU performanceis
`fi + Misses_MissCPUtime =IC * (CPI penalt ) * Clock cycle time
`
`
`
`
`IC:
`!
`=
`*
`.
`rf
`SSS
`a
`
`
`a ExecutionT(rstruction P ¥ ¥
`
`
`= IC * (CPlexecuion * Clock cycle time +
`Memory accesses
`* Miss rate * Miss penalty * Clock cycle time)
`Instruction
`
`Thus, the time for each organization is Average memory-access LIMC | way = 20 + .039*200 = 27.8 ns Average memory-access UMe2way = 20*1.085 + .030#200 = 27.7 ns
`
`
`
`na
`
`
`
`
`
`
`
`CPU time) way = IC*(1.5*20 + 1.3*0.039*200) = 40,1#1C
`CPU times. way = IC#C1.5*20%1.085 + 1.3%0.030200) = 40.4% IC
`
`
`
`
`
`Mes
`> the
`Mss,
`nif
`
`a
`
`
`
`
`2
`E
`i
`.
`
`r j
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`a Capacity—If the cache cannot contain 20 >= blacks needed Aa
`
`of a program, capacity misses Wi! @ccer due to blocks being
`
`
`later retrieved.
`
`and relative performanceis
`
`CPUtimes yay 40.4*Instruction count
`CPU timeyway 40.1 * Instruction count
`In contras! to the results of average access-time comparison. the direct-mapped
`cache leads to slightly better performance. Since CPU time is our bottom-line
`evaluation (and direct mapped is simpler to build). the preferred cache is direct
`mapped in this example. (See the fallacy on page 48! for more on this king o
`trade-off.)
`
`
`
`
`The Three Sources of Gache Misses: Compulsory,
`Capacity, and Conflicts
`Anintuitive model of cache behavior attributcs all misses to a=
`sources:
`
`ef thers
`
`*
`
`the block
`sacke
`is
`;
`cr
`=o
`the
`Dio
`so
`s Compulsory—The first aceess to a block is fot ia ie cache
`must be brought into the cache. These are also called
`opp misses OFfirst
`refercnee misses.
`
`
`
`Qualcomm, Ex. 1009, Page 14
`
`Qualcomm, Ex. 1009, Page 14
`
`

`

`4-way: from 8-wayassociative to 4-way associalive
`
`2-way: from 4-way associative to 2-way associative
`
`[-way: from 2-wayassociative to [-way associative (direct mappe
`
`ms=
`
`« Confficr—If the block-placement sirategy is scl aasoceetres of Geet
`conflict misses (in addition tc compulsory and capacity mexaecs)
`because a block can be discarded and later retrieved of too mummy Bie
`to its set, These are also called collision mustes
`
`Figure 8.12 shows the relative frequency of cache misses, braless
`the “three Cs." To show the benefit of associativity, conflict mies ae
`into misses caused by each decrease in associativity. The categor=)=
`#-way, meaning the misses caused by going to the lower level 7
`from the next one above. Here are the four vategories:
`
`8-way: from fully associative (10 conflicts) to 8-way associa.
`
`is directly contradictoryto the three C’s model.
`
`Figure 8.13 (page 422) presents the same data graphically. The top go)
`absolute miss rates; the bottom graph plots percentageof ail the misses)
`size,
`Having identified the three Cs, what can a computer designer doa
`Conceptually, conflicts are the easiest: Fully associative placement’ 9
`conflict misses. Associativity 1s expensive in hardware, however, and =
`access time (see the example ahove or the second fallacy in Sectos
`7)
`leading to lower overall performance. There is little to he done about —
`except to buy larger memory chips. If the upper-level memory is much
`than what is needed for a program, and a significant percentage of te
`spent moving data between twolevels in the hierarchy, the memory fy
`=
`said to thrash. Because so many replacements are required, thrashing
`machine runs close to the speed of the lower-level memory, or mayo
`slower due to the miss overhead. Making blocks larger reduces the
`compulsory misses, but it can increase conflict misses.
`The three C’s give insight into the cause of misses. but this simple
`its limits. For example, increasing cache size reduces conflict misses a
`capacity misses, since a larger cache spreads oat references. Thus, a tm
`move from one categoryto the other as parameters change. Three Co
`replacementpolicy, since it is difficult to model and since. in general, 11 9
`significance. In specific circumstances the replacement policy can actus~~
`to anomalous behavior, such as poorer miss rates for larger associativity.
`
`Qualcomm, Ex. 1009, Page 15
`
`Qualcomm, Ex. 1009, Page 15
`
`

`

`
`
`
`Memory-HierarchyDesign
`_ = an _ __ 421
`
`
`Cachesize
`
`Degree
`associative
`
`Total
`miss
`rate
`
`Miss-rate components (relative percent)
`(Sum = 100% of total miss rate)
`Capacity
`
`Conflict
`
`Compulsory
`
`cl Mapped,
`will occur
`plocks map
`
`down by
`are divided
`ire labeled
`soctativity
`
`{KB
`1 KB
`
`1KB
`
`l-way
`2-way
`
`4-way
`
`0.191
`0.161
`
`0.152
`
`KB B-way
`2 KB
`i-way
`
`2-way
`
`2KB
`
`0.122
`
`4KB
`
`l-way
`
`0.109
`
`0.115
`4-way
`2KB
`_2KB —&way0.113
`
`0.009
`0.009
`
`0,009
`
`0.009
`
`0.009
`0.009
`
`0.009
`
`5%
`6%
`
`6%
`
`1%
`
`8%
`
`8%
`
`85
`
`Q.141
`0.141
`
`0.141
`
`0.103
`
`0.103
`0.103
`
`0.073
`
`
`
`0.149 0.09% AE 4% 0.000 MH
`0.148
`0.009
`6%
`0.103
`TOG
`0.036
`24%
`
`73%
`o7%5
`
`92%
`
`0.042
`0.012
`
`0.003
`
`22%
`1%
`
`2%
`
`84%
`
`0.010
`
`8%
`
`2%
`0.003
`905:
`
`91% 001%
`
`67%
`
`0.027
`
`25%
`
`|
`
`|
`
`|
`
`4KB
`4KB
`
`2-way
`4-way
`
`0.095
`0.087
`
`0.609
`0.009
`
`9%
`10%
`
`0.073
`0.073
`
`TM
`849,
`
`
`0.013
`0.005
`
`14%
`6%
`
`87% ==.002, |
`
`—4KB f-way
`
`0.084 0.009
`
`11% 0.073,
`
`8 KB
`8KB
`
`8 KB
`
`1-way
`2-way
`
`4-way
`
`—§KB Bway
`'\6KB
`l-way
`
`16 KB
`
`2-way
`
`0.087
`0.069
`
`0.065
`
`0.063
`0.066
`
`0.054
`
`1.009
`0.009
`
`0,009
`
`0.009
`
`10%
`13%
`
`14%
`
`17%
`
`
`0.009 4% 00528 DH
`0.009
`14%
`0.038
`57%
`0.019
`29%
`
`0.052
`0.052
`
`0.052
`
`0.038
`
`60%
`TS%
`
`80%
`
`70%
`
`0.0

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket