`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`
`
`
`
`
`Computer Architecture
`A Quantitative Approach
`
`Fourth Edition
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`
`
`
`
`
`
`
`
`John L. Hennessy
` is the president of Stanford University, where he has been a member of the
`faculty since 1977 in the departments of electrical engineering and computer science. Hen-
`nessy is a Fellow of the IEEE and ACM, a member of the National Academy of Engineering and
`the National Academy of Science, and a Fellow of the American Academy of Arts and Sciences.
`Among his many awards are the 2001 Eckert-Mauchly Award for his contributions to RISC tech-
`nology, the 2001 Seymour Cray Computer Engineering Award, and the 2000 John von Neu-
`mann Award, which he shared with David Patterson. He has also received seven honorary
`doctorates.
`
`In 1981, he started the MIPS project at Stanford with a handful of graduate students. After com-
`pleting the project in 1984, he took a one-year leave from the university to cofound MIPS Com-
`puter Systems, which developed one of the first commercial RISC microprocessors. After being
`acquired by Silicon Graphics in 1991, MIPS Technologies became an independent company in
`1998, focusing on microprocessors for the embedded marketplace. As of 2006, over 500 million
`MIPS microprocessors have been shipped in devices ranging from video games and palmtop
`computers to laser printers and network switches.
`
`David A. Patterson
` has been teaching computer architecture at the University of California,
`Berkeley, since joining the faculty in 1977, where he holds the Pardee Chair of Computer Sci-
`ence. His teaching has been honored by the Abacus Award from Upsilon Pi Epsilon, the Distin-
`guished Teaching Award from the University of California, the Karlstrom Award from ACM, and
`the Mulligan Education Medal and Undergraduate Teaching Award from IEEE. Patterson re-
`ceived the IEEE Technical Achievement Award for contributions to RISC and shared the IEEE
`Johnson Information Storage Award for contributions to RAID. He then shared the IEEE John
`von Neumann Medal and the C & C Prize with John Hennessy. Like his co-author, Patterson is a
`Fellow of the American Academy of Arts and Sciences, ACM, and IEEE, and he was elected to the
`National Academy of Engineering, the National Academy of Sciences, and the Silicon Valley En-
`gineering Hall of Fame. He served on the Information Technology Advisory Committee to the
`U.S. President, as chair of the CS division in the Berkeley EECS department, as chair of the Com-
`puting Research Association, and as President of ACM. This record led to a Distinguished Service
`Award from CRA.
`
`At Berkeley, Patterson led the design and implementation of RISC I, likely the first VLSI reduced
`instruction set computer. This research became the foundation of the SPARC architecture, cur-
`rently used by Sun Microsystems, Fujitsu, and others. He was a leader of the Redundant Arrays
`of Inexpensive Disks (RAID) project, which led to dependable storage systems from many com-
`panies. He was also involved in the Network of Workstations (NOW) project, which led to cluster
`technology used by Internet companies. These projects earned three dissertation awards from
`the ACM. His current research projects are the RAD Lab, which is inventing technology for reli-
`able, adaptive, distributed Internet services, and the Research Accelerator for Multiple Proces-
`sors (RAMP) project, which is developing and distributing low-cost, highly scalable, parallel
`computers based on FPGAs and open-source hardware and software.
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Computer Architecture
`A Quantitative Approach
`
`Fourth Edition
`
`John L. Hennessy
`Stanford University
`
`David A. Patterson
`University of California at Berkeley
`
`With Contributions by
`Andrea C. Arpaci-Dusseau
`University of Wisconsin–Madison
`Remzi H. Arpaci-Dusseau
`University of Wisconsin–Madison
`Krste Asanovic
`Massachusetts Institute of Technology
`Robert P. Colwell
`R&E Colwell & Associates, Inc.
`Thomas M. Conte
`North Carolina State University
`José Duato
`Universitat Politècnica de València
` Simula
`and
`
`Diana Franklin
`California Polytechnic State University, San Luis Obispo
`David Goldberg
`Xerox Palo Alto Research Center
`Wen-mei W. Hwu
`University of Illinois at Urbana–Champaign
`Norman P. Jouppi
`HP Labs
`Timothy M. Pinkston
`University of Southern California
`John W. Sias
`University of Illinois at Urbana–Champaign
`David A. Wood
`University of Wisconsin–Madison
`
`Amsterdam • Boston • Heidelberg • London
`New York • Oxford • Paris • San Diego
`San Francisco • Singapore • Sydney • Tokyo
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Denise E. M. Penrose
`Publisher
`Dusty Friedman, The Book Company
`Project Manager
`Brandy Lilly
`In-house Senior Project Manager
`Nate McFadden
`Developmental Editor
`Kimberlee Honjo
`Editorial Assistant
`Elisabeth Beller and Ross Carron Design
`Cover Design
` Richard I’Anson’s Collection: Lonely Planet Images
`Cover Image
` Nancy Logan
`Composition
`Rebecca Evans & Associates
`Text Design:
` David Ruppe, Impact Publications
`Technical Illustration
`Ken Della Penta
`Copyeditor
`Jamie Thaman
`Proofreader
` Nancy Ball
`Indexer
`Maple-Vail Book Manufacturing Group
`Printer
`
`Morgan Kaufmann Publishers is an Imprint of Elsevier
`500 Sansome Street, Suite 400, San Francisco, CA 94111
`
`This book is printed on acid-free paper.
`
`© 1990, 1996, 2003, 2007 by Elsevier, Inc.
`All rights reserved.
`
`Published 1990. Fourth edition 2007
`
`Designations used by companies to distinguish their products are often claimed as trademarks or reg-
`istered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the
`product names appear in initial capital or all capital letters. Readers, however, should contact the
`appropriate companies for more complete information regarding trademarks and registration.
`
`Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in
`Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, e-mail: permissions@elsevier.
`com. You may also complete your request on-line via the Elsevier Science homepage (
`http://
`), by selecting “Customer Support” and then “Obtaining Permissions.”
`elsevier.com
`
`Library of Congress Cataloging-in-Publication Data
`
`Hennessy, John L.
` Computer architecture : a quantitative approach / John L. Hennessy, David
`A. Patterson ; with contributions by Andrea C. Arpaci-Dusseau . . . [et al.].
`—4th ed.
` p.cm.
` Includes bibliographical references and index.
` ISBN 13: 978-0-12-370490-0 (pbk. : alk. paper)
` ISBN 10: 0-12-370490-1 (pbk. : alk. paper) 1. Computer architecture. I.
`Patterson, David A. II. Arpaci-Dusseau, Andrea C. III. Title.
`
`QA76.9.A73P377 2006
`004.2'2—dc22
`
`2006024358
`
`For all information on all Morgan Kaufmann publications,
`visit our website at
`or
`www.mkp.com
` www.books.elsevier.com
`
`Printed in the United States of America
`06 07 08 09 10 5 4 3 2 1
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Contents
`
`Foreword
`Preface
`Acknowledgments
`
`Chapter 1
`
`Chapter 2
`
`Fundamentals of Computer Design
`1.1
`1.2
`1.3
`1.4
`1.5
`1.6
`1.7
`1.8
`1.9
`1.10
`1.11
`1.12
`1.13
`
`Introduction
`Classes of Computers
`Defining Computer Architecture
`Trends in Technology
`Trends in Power in Integrated Circuits
`Trends in Cost
`Dependability
`Measuring, Reporting, and Summarizing Performance
`Quantitative Principles of Computer Design
`Putting It All Together: Performance and Price-Performance
`Fallacies and Pitfalls
`Concluding Remarks
`Historical Perspectives and References
`Case Studies with Exercises by Diana Franklin
`
`Instruction-Level Parallelism and Its Exploitation
`2.1
`2.2
`2.3
`2.4
`2.5
`2.6
`2.7
`
`Instruction-Level Parallelism: Concepts and Challenges
`Basic Compiler Techniques for Exposing ILP
`Reducing Branch Costs with Prediction
`Overcoming Data Hazards with Dynamic Scheduling
`Dynamic Scheduling: Examples and the Algorithm
`Hardware-Based Speculation
`Exploiting ILP Using Multiple Issue and Static Scheduling
`
`ix
`xv
`xxiii
`
`2
`4
`8
`14
`17
`19
`25
`28
`37
`44
`48
`52
`54
`55
`
`66
`74
`80
`89
` 97
`104
`114
`
`xi
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`xii
`
`I
`
`Contents
`
`Chapter 3
`
`2.8
`
`2.9
`2.10
`2.11
`2.12
`2.13
`
`Exploiting ILP Using Dynamic Scheduling, Multiple Issue,
`and Speculation
`Advanced Techniques for Instruction Delivery and Speculation
`Putting It All Together: The Intel Pentium 4
`Fallacies and Pitfalls
`Concluding Remarks
`Historical Perspective and References
`Case Studies with Exercises by Robert P. Colwell
`
`Limits on Instruction-Level Parallelism
`3.1
`3.2
`3.3
`3.4
`3.5
`
`Introduction
`Studies of the Limitations of ILP
`Limitations on ILP for Realizable Processors
`Crosscutting Issues: Hardware versus Software Speculation
`Multithreading: Using ILP Support to Exploit
`Thread-Level Parallelism
`Putting It All Together: Performance and Efficiency in Advanced
`Multiple-Issue Processors
`Fallacies and Pitfalls
`Concluding Remarks
`Historical Perspective and References
`Case Study with Exercises by Wen-mei W. Hwu and
`John W. Sias
`
`3.6
`
`3.7
`3.8
`3.9
`
`Chapter 4
`
` Multiprocessors and Thread-Level Parallelism
`4.1
`4.2
`4.3
`4.4
`4.5
`4.6
`4.7
`4.8
`4.9
`4.10
`4.11
`
`Introduction
`Symmetric Shared-Memory Architectures
`Performance of Symmetric Shared-Memory Multiprocessors
`Distributed Shared Memory and Directory-Based Coherence
`Synchronization: The Basics
`Models of Memory Consistency: An Introduction
`Crosscutting Issues
`Putting It All Together: The Sun T1 Multiprocessor
`Fallacies and Pitfalls
`Concluding Remarks
`Historical Perspective and References
`Case Studies with Exercises by David A. Wood
`
`Chapter 5
`
`Memory Hierarchy Design
`5.1
`5.2
`5.3
`
`Introduction
`Eleven Advanced Optimizations of Cache Performance
` Memory Technology and Optimizations
`
`118
`121
`131
`138
`140
`141
`142
`
`154
`154
`165
`170
`
`172
`
`179
`183
`184
`185
`
`185
`
`196
`205
`218
`230
`237
`243
` 246
`249
`257
`262
`264
`264
`
`288
`293
`310
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Contents
`
`I
`
`xiii
`
`Chapter 6
`
`Appendix A
`
`5.4
`5.5
`5.6
`5.7
`5.8
`5.9
`
`Protection: Virtual Memory and Virtual Machines
`Crosscutting Issues: The Design of Memory Hierarchies
`Putting It All Together: AMD Opteron Memory Hierarchy
`Fallacies and Pitfalls
`Concluding Remarks
`Historical Perspective and References
`Case Studies with Exercises by Norman P. Jouppi
`
`Storage Systems
`6.1
`Introduction
`6.2
`Advanced Topics in Disk Storage
`6.3
`Definition and Examples of Real Faults and Failures
`6.4
`I/O Performance, Reliability Measures, and Benchmarks
`6.5 A Little Queuing Theory
`6.6
`Crosscutting Issues
`6.7 Designing and Evaluating an I/O System—The Internet
`Archive Cluster
`6.8
`Putting It All Together: NetApp FAS6000 Filer
`6.9
`Fallacies and Pitfalls
`6.10 Concluding Remarks
`6.11 Historical Perspective and References
`Case Studies with Exercises by Andrea C. Arpaci-Dusseau and
`Remzi H. Arpaci-Dusseau
`
`Pipelining: Basic and Intermediate Concepts
`A.1
`Introduction
`A.2
`The Major Hurdle of Pipelining—Pipeline Hazards
`A.3 How Is Pipelining Implemented?
`A.4 What Makes Pipelining Hard to Implement?
`A.5
`Extending the MIPS Pipeline to Handle Multicycle Operations
`A.6
`Putting It All Together: The MIPS R4000 Pipeline
`A.7 Crosscutting Issues
`A.8
`Fallacies and Pitfalls
`A.9 Concluding Remarks
`A.10 Historical Perspective and References
`
`Appendix B
`
`Instruction Set Principles and Examples
`B.1
`Introduction
`B.2
`Classifying Instruction Set Architectures
`B.3 Memory Addressing
`B.4
`Type and Size of Operands
`B.5 Operations in the Instruction Set
`
`315
`324
`326
`335
` 341
`342
`342
`
`358
`358
`366
`371
`379
`390
`
`392
`397
`399
`403
`404
`
`404
`
`A-2
`A-11
`A-26
`A-37
`A-47
`A-56
`A-65
`A-75
`A-76
`A-77
`
`B-2
`B-3
`B-7
`B-13
` B-14
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`xiv I Contents
`
`Appendix C
`
`Appendix D
`
`Appendix E
`
`Appendix F
`
`Appendix G
`
`Appendix H
`
`Appendix I
`
`B.6
`Instructions for Control Flow
`B.7
`Encoding an Instruction Set
`B.8
`Crosscutting Issues: The Role of Compilers
`B.9
`Putting It All Together: The MIPS Architecture
`B.10 Fallacies and Pitfalls
`B.11 Concluding Remarks
`B.12 Historical Perspective and References
`
`Review of Memory Hierarchy
`C.1
`Introduction
`C.2 Cache Performance
`C.3
`Six Basic Cache Optimizations
`C.4
`Virtual Memory
`C.5
`Protection and Examples of Virtual Memory
`C.6
`Fallacies and Pitfalls
`C.7 Concluding Remarks
`C.8 Historical Perspective and References
`
`Companion CD Appendices
`Embedded Systems
`Updated by Thomas M. Conte
`Interconnection Networks
`Revised by Timothy M. Pinkston and José Duato
`
`Vector Processors
`Revised by Krste Asanovic
`
`Hardware and Software for VLIW and EPIC
`Large-Scale Multiprocessors and Scientific Applications
`Computer Arithmetic
`by David Goldberg
`
`Appendix J
`
`Appendix K
`
`Survey of Instruction Set Architectures
`Historical Perspectives and References
`
`Appendix L
`
`Online Appendix (textbooks.elsevier.com/0123704901)
`Solutions to Case Study Exercises
`
`References
`Index
`
`B-16
`B-21
`B-24
`B-32
`B-39
`B-45
`B-47
`
`C-2
`C-15
`C-22
`C-38
`C-47
`C-56
`C-57
` C-58
`
`R-1
`I-1
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`5
`
`Memory Hierarchy
`Design
`
`Ideally one would desire an indefinitely large memory capacity such
`that any particular . . . word would be immediately available. . . . We
`are . . . forced to recognize the possibility of constructing a hierarchy of
`memories, each of which has greater capacity than the preceding but
`which is less quickly accessible.
`
`A. W. Burks, H. H. Goldstine,
`and J. von Neumann
`Preliminary Discussion of the
`Logical Design of an Electronic
`Computing Instrument
` (1946)
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`288
`
`I
`
`Chapter Five
`Memory Hierarchy Design
`
`5.1
`
`Introduction
`
`Computer pioneers correctly predicted that programmers would want unlimited
`amounts of fast memory. An economical solution to that desire is a
`memory hier-
` which takes advantage of locality and cost-performance of memory
`archy,
`technologies. The
` presented in the first chapter, says that
`principle of locality,
`most programs do not access all code or data uniformly. Locality occurs in time
`(
`) and in space (
`). This principle, plus the guide-
`temporal locality
`spatial locality
`line that smaller hardware can be made faster, led to hierarchies based on memo-
`ries of different speeds and sizes. Figure 5.1 shows a multilevel memory
`hierarchy, including typical sizes and speeds of access.
`Since fast memory is expensive, a memory hierarchy is organized into several
`levels—each smaller, faster, and more expensive per byte than the next lower
`level. The goal is to provide a memory system with cost per byte almost as low as
`the cheapest level of memory and speed almost as fast as the fastest level.
`Note that each level maps addresses from a slower, larger memory to a
`smaller but faster memory higher in the hierarchy. As part of address mapping,
`the memory hierarchy is given the responsibility of address checking; hence, pro-
`tection schemes for scrutinizing addresses are also part of the memory hierarchy.
`The importance of the memory hierarchy has increased with advances in per-
`formance of processors. Figure 5.2 plots processor performance projections
`against the historical performance improvement in time to access main memory.
`Clearly, computer architects must try to close the processor-memory gap.
`The increasing size and thus importance of this gap led to the migration of the
`basics of memory hierarchy into undergraduate courses in computer architecture,
`and even to courses in operating systems and compilers. Thus, we’ll start with a
`quick review of caches. The bulk of the chapter, however, describes more
`advanced innovations that address the processor-memory performance gap.
`When a word is not found in the cache, the word must be fetched from the
`memory and placed in the cache before continuing. Multiple words, called a
`
`Memory
`
`I/O bus
`
`I/O devices
`
`Memory
`bus
`
`Cache
`
`CPU
`Registers
`
`Register
`reference
`
`Cache
`reference
`
`Size:
`Speed:
`
`500 bytes
`250 ps
`
`64 KB
`1 ns
`
`Memory
`reference
`
`1 GB
`100 ns
`
`Disk
`memory
`reference
`
`1 TB
`10 ms
`
`Figure 5.1
`The levels in a typical memory hierarchy in embedded, desktop, and
`As we move farther away from the processor, the memory in the
`server computers.
`level below becomes slower and larger. Note that the time units change by factors of
`10—from picoseconds to milliseconds—and that the size units change by factors of
`1000—from bytes to terabytes.
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`5.1 Introduction
`
`I
`
`289
`
`Processor
`
`Memory
`
`1995
`Year
`
`2000
`
`2005
`
`2010
`
`100,000
`
`10,000
`
`1,000
`
`100
`
`10
`
`Performance
`
`1
`1980
`
`1985
`
`1990
`
`Figure 5.2
`Starting with 1980 performance as a baseline, the gap in performance
`Note that the vertical axis
`between memory and processors is plotted over time.
`must be on a logarithmic scale to record the size of the processor-DRAM performance
`gap. The memory baseline is 64 KB DRAM in 1980, with a 1.07 per year performance
`improvement in latency (see Figure 5.13 on page 313). The processor line assumes a
`1.25 improvement per year until 1986, and a 1.52 improvement until 2004, and a 1.20
`improvement thereafter; see Figure 1.1 in Chapter 1.
`
` (or
`), are moved for efficiency reasons. Each cache block includes a
`block
`line
`tag
`to see which memory address it corresponds to.
`A key design decision is where blocks (or lines) can be placed in a cache. The
`most popular scheme is
`, where a
` is a group of blocks in the
`set associative
`set
`cache. A block is first mapped onto a set, and then the block can be placed any-
`where within that set. Finding a block consists of first mapping the block address
`to the set, and then searching the set—usually in parallel—to find the block. The
`set is chosen by the address of the data:
`
`(Block address)
`MOD
`
`
`
`(Number of sets in cache)
`
`.
` blocks in a set, the cache placement is called
`If there are
`n-way set associative
`n
`The end points of set associativity have their own names. A
` cache
`direct-mapped
`has just one block per set (so a block is always placed in the same location), and a
` cache has just one set (so a block can be placed anywhere).
`fully associative
`Caching data that is only read is easy, since the copy in the cache and mem-
`ory will be identical. Caching writes is more difficult: how can the copy in the
`cache and memory be kept consistent? There are two main strategies. A
`write-
` cache updates the item in the cache
` writes through to update main
`through
`and
`memory. A
` cache only updates the copy in the cache. When the block
`write-back
`is about to be replaced, it is copied back to memory. Both write strategies can use
`a
` to allow the cache to proceed as soon as the data is placed in the
`write buffer
`buffer rather than wait the full latency to write the data into memory.
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`290
`
`I
`
`Chapter Five
`Memory Hierarchy Design
`
`One measure of the benefits of different cache organizations is miss rate.
`Miss
` is simply the fraction of cache accesses that result in a miss—that is, the
`rate
`number of accesses that miss divided by the number of accesses.
`To gain insights into the causes of high miss rates, which can inspire better
`cache designs, the three Cs model sorts all misses into three simple categories:
`
`I
`
`I
`
`I
`
`be in the cache, so the
`—The very first access to a block
`cannot
`Compulsory
`block must be brought into the cache. Compulsory misses are those that occur
`even if you had an infinite cache.
`—If the cache cannot contain all the blocks needed during execution
`Capacity
`of a program, capacity misses (in addition to compulsory misses) will occur
`because of blocks being discarded and later retrieved.
`—If the block placement strategy is not fully associative, conflict
`Conflict
`misses (in addition to compulsory and capacity misses) will occur because a
`block may be discarded and later retrieved if conflicting blocks map to its set.
`
`Figures C.8 and C.9 on pages C-23 and C-24 show the relative frequency of
`cache misses broken down by the “three C’s.” (Chapter 4 adds a fourth C, for
` misses due to cache flushes to keep multiple caches coherent in a mul-
`Coherency
`tiprocessor; we won’t consider those here.)
`Alas, miss rate can be a misleading measure for several reasons. Hence, some
`designers prefer measuring
` rather than misses per memory
`misses per instruction
`reference (miss rate). These two are related:
`
`×
`Misses
`Miss rate Memory accesses
`
`
`--------------------------
`=
`-----------------------------------------------------------------------
`
`
`Instruction
`Instruction count
`
`=
`
`Miss rate
`
`×
`
`Memory accesses
`------------------------------------------
`Instruction
`
`(It is often reported as misses per 1000 instructions to use integers instead of frac-
`tions.) For speculative processors, we only count instructions that commit.
`The problem with both measures is that they don’t factor in the cost of a miss.
`A better measure is the
`average memory access time:
`
`×
`Average memory access time = Hit time + Miss rate
` Miss penalty
`
`is the time to
` is the time to hit in the cache and
`where
`Miss penalty
`Hit time
`replace the block from memory (that is, the cost of a miss). Average memory
`access time is still an indirect measure of performance; although it is a better
`measure than miss rate, it is not a substitute for execution time. For example, in
`Chapter 2 we saw that speculative processors may execute other instructions dur-
`ing a miss, thereby reducing the effective miss penalty.
`If this material is new to you, or if this quick review moves too quickly, see
`Appendix C. It covers the same introductory material in more depth and includes
`examples of caches from real computers and quantitative evaluations of their
`effectiveness.
`Section C.3 in Appendix C also presents six basic cache optimizations, which
`we quickly review here. The appendix also gives quantitative examples of the bene-
`fits of these optimizations.
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`5.1 Introduction
`
`I
`
`291
`
`1.
`
`2.
`
`3.
`
`4.
`
`5.
`
`6.
`
`—The simplest way to reduce the miss
`Larger block size to reduce miss rate
`rate is to take advantage of spatial locality and increase the block size. Note
`that larger blocks also reduce compulsory misses, but they also increase the
`miss penalty.
`—The obvious way to reduce capacity
`Bigger caches to reduce miss rate
`misses is to increase cache capacity. Drawbacks include potentially longer hit
`time of the larger cache memory and higher cost and power.
`—Obviously, increasing associativity
`Higher associativity to reduce miss rate
`reduces conflict misses. Greater associativity can come at the cost of
`increased hit time.
`A difficult decision is whether to
`Multilevel caches to reduce miss penalty—
`make the cache hit time fast, to keep pace with the increasing clock rate of
`processors, or to make the cache large, to overcome the widening gap
`between the processor and main memory. Adding another level of cache
`between the original cache and memory simplifies the decision (see Figure
`5.3). The first-level cache can be small enough to match a fast clock cycle
`time, yet the second-level cache can be large enough to capture many
`accesses that would go to main memory. The focus on misses in second-level
`caches leads to larger blocks, bigger capacity, and higher associativity. If L1
`and L2 refer, respectively, to first- and second-level caches, we can redefine
`the average memory access time:
`×
` (Hit time
` + Miss rate
`
`L2
`L2
`
` + Miss rate
`
`Hit time
`L1
`L1
`
`×
`
` Miss penalty
`)
`L2
`
`
`
`—A write
`Giving priority to read misses over writes to reduce miss penalty
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`buffer is a good place to implement this optimization. Write buffers create
`hazards because they hold the updated value of a location needed on a read
`miss—that is, a read-after-write hazard through memory. One solution is to
`check the contents of the write buffer on a read miss. If there are no conflicts,
`and if the memory system is available, sending the read before the writes
`reduces the miss penalty. Most processors give reads priority over writes.
`Avoiding address translation during indexing of the cache to reduce hit
`—Caches must cope with the translation of a virtual address from the
`time
`processor to a physical address to access memory. (Virtual memory is cov-
`ered in Sections 5.4 and C.4.) Figure 5.3 shows a typical relationship between
`caches, translation lookaside buffers (TLBs), and virtual memory. A common
`optimization is to use the page offset—the part that is identical in both virtual
`and physical addresses—to index the cache. The virtual part of the address is
`translated while the cache is read using that index, so the tag match can use
`physical addresses. This scheme allows the cache read to begin immediately,
`and yet the tag comparison still uses physical addresses. The drawback of this
` optimization is that the size of the page
`virtually indexed, physically tagged
`limits the size of the cache. For example, a direct-mapped cache can be no
`bigger than the page size. Higher associativity can keep the cache index in the
`physical part of the address and yet still support a cache larger than a page.
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`292
`
`I
`
`Chapter Five
`Memory Hierarchy Design
`
`Virtual address <64>
`
`Virtual page number <51>
`
`Page offset <13>
`
`TLB tag compare address <43>
`
`TLB index <8>
`
`L1 cache index <7> Block offset <6>
`
`TLB tag <43>
`
`TLB data <27>
`
`L1 cache tag <43>
`
`L1 data <512>
`
`L1 tag compare address <27>
`
`=?
`
`=?
`
`Physical address <41>
`
`L2 tag compare address <19>
`
`L2 cache index <16> Block offset <6>
`
`To CPU
`
`To CPU
`
`To CPU
`
`L2 cache tag <19>
`
`
`
`
`
`
`
`
`
`
`L2 data <512>
`
`=?
`
`To L1 cache or CPU
`
`Figure 5.3
`The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache
`The page size is 8 KB. The TLB is direct mapped with 256 entries. The L1 cache is a direct-mapped 8 KB, and
`access.
`the L2 cache is a direct-mapped 4 MB. Both use 64-byte blocks. The virtual address is 64 bits and the physical address
`is 40 bits. The primary difference between this figure and a real memory hierarchy, as in Figure 5.18 on page 327, is
`higher associativity for caches and TLBs and a smaller virtual address than 64 bits.
`
`For example, doubling associativity while doubling the cache size maintains
`the size of the index, since it is controlled by this formula:
`
`2Index
`
`=
`
`Cache size
`----------------------------------------------------------------------
`×
`Block size Set associativity
`
`A seemingly obvious alternative is to just use virtual addresses to access the
`cache, but this can cause extra overhead in the operating system.
`
`Note that each of these six optimizations above has a potential disadvantage
`that can lead to increased, rather than decreased, average memory access time.
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`5.2 Eleven Advanced Optimizations of Cache Performance
`
`I
`
`293
`
`The rest of this chapter assumes familiarity with the material above, including
`Figure 5.3. To put cache ideas into practice, throughout this chapter (and Appen-
`dix C) we show examples from the memory hierarchy of the AMD Opteron
`microprocessor. Toward the end of the chapter, we evaluate the impact of this
`hierarchy on performance using the SPEC2000 benchmark programs.
`The Opteron is a microprocessor designed for desktops and servers. Even
`these two related classes of computers have different concerns in a memory hier-
`archy. Desktop computers are primarily running one application at a time on top
`of an operating system for a single user, whereas server computers may have
`hundreds of users running potentially dozens of applications simultaneously.
`These characteristics result in more context switches, which effectively increase
`miss rates. Thus, desktop computers are concerned more with average latency
`from the memory hierarchy, whereas server computers are also concerned about
`memory bandwidth.
`
`5.2
`
`Eleven Advanced Optimizations of Cache Performance
`
`The average memory access time formula above gives us three metrics for cache
`optimizations: hit time, miss rate, and miss penalty. Given the popularity of super-
`scalar processors, we add cache bandwidth to this list. Hence, we group 11
`advanced cache optimizations into the following categories:
`
`I
`
`I
`
`I
`
`I
`
`I
`
`Reducing the hit time: small and simple caches, way prediction, and trace
`caches
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Increasing cache bandwidth: pipelined caches, multibanked caches, and non-
`blocking caches
`Reducing the miss penalty: critical word first and merging write buffers
`Reducing the miss rate: compiler optimizations
`Reducing the miss penalty or miss rate via parallelism: hardware prefetching
`and compiler prefetching
`
`We will conclude with a summary of the implementation complexity and the per-
`formance benefits of the 11 techniques presented (Figure 5.11 on page 309).
`
`First Optimization: Small and Simple Caches to Reduce Hit Time
`
`A time-consuming portion of a cache hit is using the index portion of the address
`to read the tag memory and then compare it to the address. Smaller hardware can
`be faster, so a small cache can help the hit time. It is also critical to keep an L2
`cache small enough to fit on the same chip as the processor to avoid the time pen-
`alty of going off chip.
`The second suggestion is to keep the cache simple, such as using direct map-
`ping. One benefit of direct-mapped caches is that the designer can overlap the tag
`check with the transmission of the data. This effectively reduces hit time.
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`294
`
`I
`
`Chapter Five
`Memory Hierarchy Design
`
`Hence, the pressure of a fast clock cycle encourages small and simple cache
`designs for first-level caches. For lower-level caches, some designs strike a com-
`promise by keeping the tags on chip and the data off chip, promising a fast tag
`check, yet providing the greater capacity of separate memory chips.
`Although the amount of on-chip cache increased with new generations of
`microprocessors, the size of the L1 caches has recently not increased between
`generations. The L1 caches are the same size for three generations of AMD
`microprocessors: K6, Athlon, and Opteron. The emphasis is on fast clock rate
`while hiding L1 misses with dynamic execution and using L2 caches to avoid
`going to memory.
`One approach to determining the impact on hit time in advance of building a
`chip is to use CAD tools. CACTI is a program to estimate the access time of
`alternative cache structures on CMOS microprocessors within 10% of more
`detailed CAD tools. For a given minimum feature size, it estimates the hit time of
`caches as you vary cache size, associativity, and number of read/write ports. Fig-
`ure 5.4 shows the estimated impact on hit time as cache size and associativity are
`varied. Depending on cache size, for these parameters the model suggests that hit
`time for direct mapped is 1.2–1.5 times faster than two-way set associative; two-
`way is l.02–1.11 times faster than four-way; and four-way is 1.0–1.08 times
`faster than fully associative.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`1-way
`
`2-way
`
`4-way
`
`8-way
`
`16 KB
`
`32 KB
`
`64 KB
`
`128 KB
`Cache size
`
`256 KB
`
`512 KB
`
`1 MB
`
`2.50
`
`2.00
`
`1.50
`
`1.00
`
`0.50
`
`0
`
`Access time (ns)
`
`These data
`Figure 5.4
`Access times as size and associativity vary in a CMOS cache.
`are based on the CACTI model 4 4.0 by Tarjan, Thoziyoor, and Jouppi [2006]. They
`assumed 90 nm feature size, a
`single bank, and 64-byte blocks. The median ratios of
`
`access time relative to the direct-mapped caches are 1.32, 1.39, and 1.43 for 2-way, 4-
`way, and 8-way associative caches, respectively.
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`
`
`5.2 Eleven Advanced Optimizations of Cache Performance
`
`I
`
`295
`
`Example
`
`Assume that the hit time of a two-way set-associative first-level