throbber
www.dbeBooks.com - An Ebook Library
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`
`
`
`
`
`Computer Architecture
`A Quantitative Approach
`
`Fourth Edition
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`
`
`
`
`
`
`
`
`John L. Hennessy
` is the president of Stanford University, where he has been a member of the
`faculty since 1977 in the departments of electrical engineering and computer science. Hen-
`nessy is a Fellow of the IEEE and ACM, a member of the National Academy of Engineering and
`the National Academy of Science, and a Fellow of the American Academy of Arts and Sciences.
`Among his many awards are the 2001 Eckert-Mauchly Award for his contributions to RISC tech-
`nology, the 2001 Seymour Cray Computer Engineering Award, and the 2000 John von Neu-
`mann Award, which he shared with David Patterson. He has also received seven honorary
`doctorates.
`
`In 1981, he started the MIPS project at Stanford with a handful of graduate students. After com-
`pleting the project in 1984, he took a one-year leave from the university to cofound MIPS Com-
`puter Systems, which developed one of the first commercial RISC microprocessors. After being
`acquired by Silicon Graphics in 1991, MIPS Technologies became an independent company in
`1998, focusing on microprocessors for the embedded marketplace. As of 2006, over 500 million
`MIPS microprocessors have been shipped in devices ranging from video games and palmtop
`computers to laser printers and network switches.
`
`David A. Patterson
` has been teaching computer architecture at the University of California,
`Berkeley, since joining the faculty in 1977, where he holds the Pardee Chair of Computer Sci-
`ence. His teaching has been honored by the Abacus Award from Upsilon Pi Epsilon, the Distin-
`guished Teaching Award from the University of California, the Karlstrom Award from ACM, and
`the Mulligan Education Medal and Undergraduate Teaching Award from IEEE. Patterson re-
`ceived the IEEE Technical Achievement Award for contributions to RISC and shared the IEEE
`Johnson Information Storage Award for contributions to RAID. He then shared the IEEE John
`von Neumann Medal and the C & C Prize with John Hennessy. Like his co-author, Patterson is a
`Fellow of the American Academy of Arts and Sciences, ACM, and IEEE, and he was elected to the
`National Academy of Engineering, the National Academy of Sciences, and the Silicon Valley En-
`gineering Hall of Fame. He served on the Information Technology Advisory Committee to the
`U.S. President, as chair of the CS division in the Berkeley EECS department, as chair of the Com-
`puting Research Association, and as President of ACM. This record led to a Distinguished Service
`Award from CRA.
`
`At Berkeley, Patterson led the design and implementation of RISC I, likely the first VLSI reduced
`instruction set computer. This research became the foundation of the SPARC architecture, cur-
`rently used by Sun Microsystems, Fujitsu, and others. He was a leader of the Redundant Arrays
`of Inexpensive Disks (RAID) project, which led to dependable storage systems from many com-
`panies. He was also involved in the Network of Workstations (NOW) project, which led to cluster
`technology used by Internet companies. These projects earned three dissertation awards from
`the ACM. His current research projects are the RAD Lab, which is inventing technology for reli-
`able, adaptive, distributed Internet services, and the Research Accelerator for Multiple Proces-
`sors (RAMP) project, which is developing and distributing low-cost, highly scalable, parallel
`computers based on FPGAs and open-source hardware and software.
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Computer Architecture
`A Quantitative Approach
`
`Fourth Edition
`
`John L. Hennessy
`Stanford University
`
`David A. Patterson
`University of California at Berkeley
`
`With Contributions by
`Andrea C. Arpaci-Dusseau
`University of Wisconsin–Madison
`Remzi H. Arpaci-Dusseau
`University of Wisconsin–Madison
`Krste Asanovic
`Massachusetts Institute of Technology
`Robert P. Colwell
`R&E Colwell & Associates, Inc.
`Thomas M. Conte
`North Carolina State University
`José Duato
`Universitat Politècnica de València
` Simula
`and
`
`Diana Franklin
`California Polytechnic State University, San Luis Obispo
`David Goldberg
`Xerox Palo Alto Research Center
`Wen-mei W. Hwu
`University of Illinois at Urbana–Champaign
`Norman P. Jouppi
`HP Labs
`Timothy M. Pinkston
`University of Southern California
`John W. Sias
`University of Illinois at Urbana–Champaign
`David A. Wood
`University of Wisconsin–Madison
`
`Amsterdam • Boston • Heidelberg • London
`New York • Oxford • Paris • San Diego
`San Francisco • Singapore • Sydney • Tokyo
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Denise E. M. Penrose
`Publisher
`Dusty Friedman, The Book Company
`Project Manager
`Brandy Lilly
`In-house Senior Project Manager
`Nate McFadden
`Developmental Editor
`Kimberlee Honjo
`Editorial Assistant
`Elisabeth Beller and Ross Carron Design
`Cover Design
` Richard I’Anson’s Collection: Lonely Planet Images
`Cover Image
` Nancy Logan
`Composition
`Rebecca Evans & Associates
`Text Design:
` David Ruppe, Impact Publications
`Technical Illustration
`Ken Della Penta
`Copyeditor
`Jamie Thaman
`Proofreader
` Nancy Ball
`Indexer
`Maple-Vail Book Manufacturing Group
`Printer
`
`Morgan Kaufmann Publishers is an Imprint of Elsevier
`500 Sansome Street, Suite 400, San Francisco, CA 94111
`
`This book is printed on acid-free paper.
`
`© 1990, 1996, 2003, 2007 by Elsevier, Inc.
`All rights reserved.
`
`Published 1990. Fourth edition 2007
`
`Designations used by companies to distinguish their products are often claimed as trademarks or reg-
`istered trademarks. In all instances in which Morgan Kaufmann Publishers is aware of a claim, the
`product names appear in initial capital or all capital letters. Readers, however, should contact the
`appropriate companies for more complete information regarding trademarks and registration.
`
`Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in
`Oxford, UK: phone: (+44) 1865 843830, fax: (+44) 1865 853333, e-mail: permissions@elsevier.
`com. You may also complete your request on-line via the Elsevier Science homepage (
`http://
`), by selecting “Customer Support” and then “Obtaining Permissions.”
`elsevier.com
`
`Library of Congress Cataloging-in-Publication Data
`
`Hennessy, John L.
` Computer architecture : a quantitative approach / John L. Hennessy, David
`A. Patterson ; with contributions by Andrea C. Arpaci-Dusseau . . . [et al.].
`—4th ed.
` p.cm.
` Includes bibliographical references and index.
` ISBN 13: 978-0-12-370490-0 (pbk. : alk. paper)
` ISBN 10: 0-12-370490-1 (pbk. : alk. paper) 1. Computer architecture. I.
`Patterson, David A. II. Arpaci-Dusseau, Andrea C. III. Title.
`
`QA76.9.A73P377 2006
`004.2'2—dc22
`
`2006024358
`
`For all information on all Morgan Kaufmann publications,
`visit our website at
`or
`www.mkp.com
` www.books.elsevier.com
`
`Printed in the United States of America
`06 07 08 09 10 5 4 3 2 1
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Contents
`
`Foreword
`Preface
`Acknowledgments
`
`Chapter 1
`
`Chapter 2
`
`Fundamentals of Computer Design
`1.1
`1.2
`1.3
`1.4
`1.5
`1.6
`1.7
`1.8
`1.9
`1.10
`1.11
`1.12
`1.13
`
`Introduction
`Classes of Computers
`Defining Computer Architecture
`Trends in Technology
`Trends in Power in Integrated Circuits
`Trends in Cost
`Dependability
`Measuring, Reporting, and Summarizing Performance
`Quantitative Principles of Computer Design
`Putting It All Together: Performance and Price-Performance
`Fallacies and Pitfalls
`Concluding Remarks
`Historical Perspectives and References
`Case Studies with Exercises by Diana Franklin
`
`Instruction-Level Parallelism and Its Exploitation
`2.1
`2.2
`2.3
`2.4
`2.5
`2.6
`2.7
`
`Instruction-Level Parallelism: Concepts and Challenges
`Basic Compiler Techniques for Exposing ILP
`Reducing Branch Costs with Prediction
`Overcoming Data Hazards with Dynamic Scheduling
`Dynamic Scheduling: Examples and the Algorithm
`Hardware-Based Speculation
`Exploiting ILP Using Multiple Issue and Static Scheduling
`
`ix
`xv
`xxiii
`
`2
`4
`8
`14
`17
`19
`25
`28
`37
`44
`48
`52
`54
`55
`
`66
`74
`80
`89
` 97
`104
`114
`
`xi
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`xii
`
`I
`
`Contents
`
`Chapter 3
`
`2.8
`
`2.9
`2.10
`2.11
`2.12
`2.13
`
`Exploiting ILP Using Dynamic Scheduling, Multiple Issue,
`and Speculation
`Advanced Techniques for Instruction Delivery and Speculation
`Putting It All Together: The Intel Pentium 4
`Fallacies and Pitfalls
`Concluding Remarks
`Historical Perspective and References
`Case Studies with Exercises by Robert P. Colwell
`
`Limits on Instruction-Level Parallelism
`3.1
`3.2
`3.3
`3.4
`3.5
`
`Introduction
`Studies of the Limitations of ILP
`Limitations on ILP for Realizable Processors
`Crosscutting Issues: Hardware versus Software Speculation
`Multithreading: Using ILP Support to Exploit
`Thread-Level Parallelism
`Putting It All Together: Performance and Efficiency in Advanced
`Multiple-Issue Processors
`Fallacies and Pitfalls
`Concluding Remarks
`Historical Perspective and References
`Case Study with Exercises by Wen-mei W. Hwu and
`John W. Sias
`
`3.6
`
`3.7
`3.8
`3.9
`
`Chapter 4
`
` Multiprocessors and Thread-Level Parallelism
`4.1
`4.2
`4.3
`4.4
`4.5
`4.6
`4.7
`4.8
`4.9
`4.10
`4.11
`
`Introduction
`Symmetric Shared-Memory Architectures
`Performance of Symmetric Shared-Memory Multiprocessors
`Distributed Shared Memory and Directory-Based Coherence
`Synchronization: The Basics
`Models of Memory Consistency: An Introduction
`Crosscutting Issues
`Putting It All Together: The Sun T1 Multiprocessor
`Fallacies and Pitfalls
`Concluding Remarks
`Historical Perspective and References
`Case Studies with Exercises by David A. Wood
`
`Chapter 5
`
`Memory Hierarchy Design
`5.1
`5.2
`5.3
`
`Introduction
`Eleven Advanced Optimizations of Cache Performance
` Memory Technology and Optimizations
`
`118
`121
`131
`138
`140
`141
`142
`
`154
`154
`165
`170
`
`172
`
`179
`183
`184
`185
`
`185
`
`196
`205
`218
`230
`237
`243
` 246
`249
`257
`262
`264
`264
`
`288
`293
`310
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Contents
`
`I
`
`xiii
`
`Chapter 6
`
`Appendix A
`
`5.4
`5.5
`5.6
`5.7
`5.8
`5.9
`
`Protection: Virtual Memory and Virtual Machines
`Crosscutting Issues: The Design of Memory Hierarchies
`Putting It All Together: AMD Opteron Memory Hierarchy
`Fallacies and Pitfalls
`Concluding Remarks
`Historical Perspective and References
`Case Studies with Exercises by Norman P. Jouppi
`
`Storage Systems
`6.1
`Introduction
`6.2
`Advanced Topics in Disk Storage
`6.3
`Definition and Examples of Real Faults and Failures
`6.4
`I/O Performance, Reliability Measures, and Benchmarks
`6.5 A Little Queuing Theory
`6.6
`Crosscutting Issues
`6.7 Designing and Evaluating an I/O System—The Internet
`Archive Cluster
`6.8
`Putting It All Together: NetApp FAS6000 Filer
`6.9
`Fallacies and Pitfalls
`6.10 Concluding Remarks
`6.11 Historical Perspective and References
`Case Studies with Exercises by Andrea C. Arpaci-Dusseau and
`Remzi H. Arpaci-Dusseau
`
`Pipelining: Basic and Intermediate Concepts
`A.1
`Introduction
`A.2
`The Major Hurdle of Pipelining—Pipeline Hazards
`A.3 How Is Pipelining Implemented?
`A.4 What Makes Pipelining Hard to Implement?
`A.5
`Extending the MIPS Pipeline to Handle Multicycle Operations
`A.6
`Putting It All Together: The MIPS R4000 Pipeline
`A.7 Crosscutting Issues
`A.8
`Fallacies and Pitfalls
`A.9 Concluding Remarks
`A.10 Historical Perspective and References
`
`Appendix B
`
`Instruction Set Principles and Examples
`B.1
`Introduction
`B.2
`Classifying Instruction Set Architectures
`B.3 Memory Addressing
`B.4
`Type and Size of Operands
`B.5 Operations in the Instruction Set
`
`315
`324
`326
`335
` 341
`342
`342
`
`358
`358
`366
`371
`379
`390
`
`392
`397
`399
`403
`404
`
`404
`
`A-2
`A-11
`A-26
`A-37
`A-47
`A-56
`A-65
`A-75
`A-76
`A-77
`
`B-2
`B-3
`B-7
`B-13
` B-14
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`xiv I Contents
`
`Appendix C
`
`Appendix D
`
`Appendix E
`
`Appendix F
`
`Appendix G
`
`Appendix H
`
`Appendix I
`
`B.6
`Instructions for Control Flow
`B.7
`Encoding an Instruction Set
`B.8
`Crosscutting Issues: The Role of Compilers
`B.9
`Putting It All Together: The MIPS Architecture
`B.10 Fallacies and Pitfalls
`B.11 Concluding Remarks
`B.12 Historical Perspective and References
`
`Review of Memory Hierarchy
`C.1
`Introduction
`C.2 Cache Performance
`C.3
`Six Basic Cache Optimizations
`C.4
`Virtual Memory
`C.5
`Protection and Examples of Virtual Memory
`C.6
`Fallacies and Pitfalls
`C.7 Concluding Remarks
`C.8 Historical Perspective and References
`
`Companion CD Appendices
`Embedded Systems
`Updated by Thomas M. Conte
`Interconnection Networks
`Revised by Timothy M. Pinkston and José Duato
`
`Vector Processors
`Revised by Krste Asanovic
`
`Hardware and Software for VLIW and EPIC
`Large-Scale Multiprocessors and Scientific Applications
`Computer Arithmetic
`by David Goldberg
`
`Appendix J
`
`Appendix K
`
`Survey of Instruction Set Architectures
`Historical Perspectives and References
`
`Appendix L
`
`Online Appendix (textbooks.elsevier.com/0123704901)
`Solutions to Case Study Exercises
`
`References
`Index
`
`B-16
`B-21
`B-24
`B-32
`B-39
`B-45
`B-47
`
`C-2
`C-15
`C-22
`C-38
`C-47
`C-56
`C-57
` C-58
`
`R-1
`I-1
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`
`
`
`
`
`
`
`
`
`
`
`
`5
`
`Memory Hierarchy
`Design
`
`Ideally one would desire an indefinitely large memory capacity such
`that any particular . . . word would be immediately available. . . . We
`are . . . forced to recognize the possibility of constructing a hierarchy of
`memories, each of which has greater capacity than the preceding but
`which is less quickly accessible.
`
`A. W. Burks, H. H. Goldstine,
`and J. von Neumann
`Preliminary Discussion of the
`Logical Design of an Electronic
`Computing Instrument
` (1946)
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`288
`
`I
`
`Chapter Five
`Memory Hierarchy Design
`
`5.1
`
`Introduction
`
`Computer pioneers correctly predicted that programmers would want unlimited
`amounts of fast memory. An economical solution to that desire is a
`memory hier-
` which takes advantage of locality and cost-performance of memory
`archy,
`technologies. The
` presented in the first chapter, says that
`principle of locality,
`most programs do not access all code or data uniformly. Locality occurs in time
`(
`) and in space (
`). This principle, plus the guide-
`temporal locality
`spatial locality
`line that smaller hardware can be made faster, led to hierarchies based on memo-
`ries of different speeds and sizes. Figure 5.1 shows a multilevel memory
`hierarchy, including typical sizes and speeds of access.
`Since fast memory is expensive, a memory hierarchy is organized into several
`levels—each smaller, faster, and more expensive per byte than the next lower
`level. The goal is to provide a memory system with cost per byte almost as low as
`the cheapest level of memory and speed almost as fast as the fastest level.
`Note that each level maps addresses from a slower, larger memory to a
`smaller but faster memory higher in the hierarchy. As part of address mapping,
`the memory hierarchy is given the responsibility of address checking; hence, pro-
`tection schemes for scrutinizing addresses are also part of the memory hierarchy.
`The importance of the memory hierarchy has increased with advances in per-
`formance of processors. Figure 5.2 plots processor performance projections
`against the historical performance improvement in time to access main memory.
`Clearly, computer architects must try to close the processor-memory gap.
`The increasing size and thus importance of this gap led to the migration of the
`basics of memory hierarchy into undergraduate courses in computer architecture,
`and even to courses in operating systems and compilers. Thus, we’ll start with a
`quick review of caches. The bulk of the chapter, however, describes more
`advanced innovations that address the processor-memory performance gap.
`When a word is not found in the cache, the word must be fetched from the
`memory and placed in the cache before continuing. Multiple words, called a
`
`Memory
`
`I/O bus
`
`I/O devices
`
`Memory
`bus
`
`Cache
`
`CPU
`Registers
`
`Register
`reference
`
`Cache
`reference
`
`Size:
`Speed:
`
`500 bytes
`250 ps
`
`64 KB
`1 ns
`
`Memory
`reference
`
`1 GB
`100 ns
`
`Disk
`memory
`reference
`
`1 TB
`10 ms
`
`Figure 5.1
`The levels in a typical memory hierarchy in embedded, desktop, and
`As we move farther away from the processor, the memory in the
`server computers.
`level below becomes slower and larger. Note that the time units change by factors of
`10—from picoseconds to milliseconds—and that the size units change by factors of
`1000—from bytes to terabytes.
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`5.1 Introduction
`
`I
`
`289
`
`Processor
`
`Memory
`
`1995
`Year
`
`2000
`
`2005
`
`2010
`
`100,000
`
`10,000
`
`1,000
`
`100
`
`10
`
`Performance
`
`1
`1980
`
`1985
`
`1990
`
`Figure 5.2
`Starting with 1980 performance as a baseline, the gap in performance
`Note that the vertical axis
`between memory and processors is plotted over time.
`must be on a logarithmic scale to record the size of the processor-DRAM performance
`gap. The memory baseline is 64 KB DRAM in 1980, with a 1.07 per year performance
`improvement in latency (see Figure 5.13 on page 313). The processor line assumes a
`1.25 improvement per year until 1986, and a 1.52 improvement until 2004, and a 1.20
`improvement thereafter; see Figure 1.1 in Chapter 1.
`
` (or
`), are moved for efficiency reasons. Each cache block includes a
`block
`line
`tag
`to see which memory address it corresponds to.
`A key design decision is where blocks (or lines) can be placed in a cache. The
`most popular scheme is
`, where a
` is a group of blocks in the
`set associative
`set
`cache. A block is first mapped onto a set, and then the block can be placed any-
`where within that set. Finding a block consists of first mapping the block address
`to the set, and then searching the set—usually in parallel—to find the block. The
`set is chosen by the address of the data:
`
`(Block address)
`MOD
`
`
`
`(Number of sets in cache)
`
`.
` blocks in a set, the cache placement is called
`If there are
`n-way set associative
`n
`The end points of set associativity have their own names. A
` cache
`direct-mapped
`has just one block per set (so a block is always placed in the same location), and a
` cache has just one set (so a block can be placed anywhere).
`fully associative
`Caching data that is only read is easy, since the copy in the cache and mem-
`ory will be identical. Caching writes is more difficult: how can the copy in the
`cache and memory be kept consistent? There are two main strategies. A
`write-
` cache updates the item in the cache
` writes through to update main
`through
`and
`memory. A
` cache only updates the copy in the cache. When the block
`write-back
`is about to be replaced, it is copied back to memory. Both write strategies can use
`a
` to allow the cache to proceed as soon as the data is placed in the
`write buffer
`buffer rather than wait the full latency to write the data into memory.
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`290
`
`I
`
`Chapter Five
`Memory Hierarchy Design
`
`One measure of the benefits of different cache organizations is miss rate.
`Miss
` is simply the fraction of cache accesses that result in a miss—that is, the
`rate
`number of accesses that miss divided by the number of accesses.
`To gain insights into the causes of high miss rates, which can inspire better
`cache designs, the three Cs model sorts all misses into three simple categories:
`
`I
`
`I
`
`I
`
`be in the cache, so the
`—The very first access to a block
`cannot
`Compulsory
`block must be brought into the cache. Compulsory misses are those that occur
`even if you had an infinite cache.
`—If the cache cannot contain all the blocks needed during execution
`Capacity
`of a program, capacity misses (in addition to compulsory misses) will occur
`because of blocks being discarded and later retrieved.
`—If the block placement strategy is not fully associative, conflict
`Conflict
`misses (in addition to compulsory and capacity misses) will occur because a
`block may be discarded and later retrieved if conflicting blocks map to its set.
`
`Figures C.8 and C.9 on pages C-23 and C-24 show the relative frequency of
`cache misses broken down by the “three C’s.” (Chapter 4 adds a fourth C, for
` misses due to cache flushes to keep multiple caches coherent in a mul-
`Coherency
`tiprocessor; we won’t consider those here.)
`Alas, miss rate can be a misleading measure for several reasons. Hence, some
`designers prefer measuring
` rather than misses per memory
`misses per instruction
`reference (miss rate). These two are related:
`

`Misses
`Miss rate Memory accesses
`
`
`--------------------------
`=
`-----------------------------------------------------------------------
`
`
`Instruction
`Instruction count
`
`=
`
`Miss rate
`

`
`Memory accesses
`------------------------------------------
`Instruction
`
`(It is often reported as misses per 1000 instructions to use integers instead of frac-
`tions.) For speculative processors, we only count instructions that commit.
`The problem with both measures is that they don’t factor in the cost of a miss.
`A better measure is the
`average memory access time:
`

`Average memory access time = Hit time + Miss rate
` Miss penalty
`
`is the time to
` is the time to hit in the cache and
`where
`Miss penalty
`Hit time
`replace the block from memory (that is, the cost of a miss). Average memory
`access time is still an indirect measure of performance; although it is a better
`measure than miss rate, it is not a substitute for execution time. For example, in
`Chapter 2 we saw that speculative processors may execute other instructions dur-
`ing a miss, thereby reducing the effective miss penalty.
`If this material is new to you, or if this quick review moves too quickly, see
`Appendix C. It covers the same introductory material in more depth and includes
`examples of caches from real computers and quantitative evaluations of their
`effectiveness.
`Section C.3 in Appendix C also presents six basic cache optimizations, which
`we quickly review here. The appendix also gives quantitative examples of the bene-
`fits of these optimizations.
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`5.1 Introduction
`
`I
`
`291
`
`1.
`
`2.
`
`3.
`
`4.
`
`5.
`
`6.
`
`—The simplest way to reduce the miss
`Larger block size to reduce miss rate
`rate is to take advantage of spatial locality and increase the block size. Note
`that larger blocks also reduce compulsory misses, but they also increase the
`miss penalty.
`—The obvious way to reduce capacity
`Bigger caches to reduce miss rate
`misses is to increase cache capacity. Drawbacks include potentially longer hit
`time of the larger cache memory and higher cost and power.
`—Obviously, increasing associativity
`Higher associativity to reduce miss rate
`reduces conflict misses. Greater associativity can come at the cost of
`increased hit time.
`A difficult decision is whether to
`Multilevel caches to reduce miss penalty—
`make the cache hit time fast, to keep pace with the increasing clock rate of
`processors, or to make the cache large, to overcome the widening gap
`between the processor and main memory. Adding another level of cache
`between the original cache and memory simplifies the decision (see Figure
`5.3). The first-level cache can be small enough to match a fast clock cycle
`time, yet the second-level cache can be large enough to capture many
`accesses that would go to main memory. The focus on misses in second-level
`caches leads to larger blocks, bigger capacity, and higher associativity. If L1
`and L2 refer, respectively, to first- and second-level caches, we can redefine
`the average memory access time:

` (Hit time
` + Miss rate
`
`L2
`L2
`
` + Miss rate
`
`Hit time
`L1
`L1
`

`
` Miss penalty
`)
`L2
`
`
`
`—A write
`Giving priority to read misses over writes to reduce miss penalty
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`buffer is a good place to implement this optimization. Write buffers create
`hazards because they hold the updated value of a location needed on a read
`miss—that is, a read-after-write hazard through memory. One solution is to
`check the contents of the write buffer on a read miss. If there are no conflicts,
`and if the memory system is available, sending the read before the writes
`reduces the miss penalty. Most processors give reads priority over writes.
`Avoiding address translation during indexing of the cache to reduce hit
`—Caches must cope with the translation of a virtual address from the
`time
`processor to a physical address to access memory. (Virtual memory is cov-
`ered in Sections 5.4 and C.4.) Figure 5.3 shows a typical relationship between
`caches, translation lookaside buffers (TLBs), and virtual memory. A common
`optimization is to use the page offset—the part that is identical in both virtual
`and physical addresses—to index the cache. The virtual part of the address is
`translated while the cache is read using that index, so the tag match can use
`physical addresses. This scheme allows the cache read to begin immediately,
`and yet the tag comparison still uses physical addresses. The drawback of this
` optimization is that the size of the page
`virtually indexed, physically tagged
`limits the size of the cache. For example, a direct-mapped cache can be no
`bigger than the page size. Higher associativity can keep the cache index in the
`physical part of the address and yet still support a cache larger than a page.
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`292
`
`I
`
`Chapter Five
`Memory Hierarchy Design
`
`Virtual address <64>
`
`Virtual page number <51>
`
`Page offset <13>
`
`TLB tag compare address <43>
`
`TLB index <8>
`
`L1 cache index <7> Block offset <6>
`
`TLB tag <43>
`
`TLB data <27>
`
`L1 cache tag <43>
`
`L1 data <512>
`
`L1 tag compare address <27>
`
`=?
`
`=?
`
`Physical address <41>
`
`L2 tag compare address <19>
`
`L2 cache index <16> Block offset <6>
`
`To CPU
`
`To CPU
`
`To CPU
`
`L2 cache tag <19>
`
`
`
`
`
`
`
`
`
`
`L2 data <512>
`
`=?
`
`To L1 cache or CPU
`
`Figure 5.3
`The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache
`The page size is 8 KB. The TLB is direct mapped with 256 entries. The L1 cache is a direct-mapped 8 KB, and
`access.
`the L2 cache is a direct-mapped 4 MB. Both use 64-byte blocks. The virtual address is 64 bits and the physical address
`is 40 bits. The primary difference between this figure and a real memory hierarchy, as in Figure 5.18 on page 327, is
`higher associativity for caches and TLBs and a smaller virtual address than 64 bits.
`
`For example, doubling associativity while doubling the cache size maintains
`the size of the index, since it is controlled by this formula:
`
`2Index
`
`=
`
`Cache size
`----------------------------------------------------------------------

`Block size Set associativity
`
`A seemingly obvious alternative is to just use virtual addresses to access the
`cache, but this can cause extra overhead in the operating system.
`
`Note that each of these six optimizations above has a potential disadvantage
`that can lead to increased, rather than decreased, average memory access time.
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`5.2 Eleven Advanced Optimizations of Cache Performance
`
`I
`
`293
`
`The rest of this chapter assumes familiarity with the material above, including
`Figure 5.3. To put cache ideas into practice, throughout this chapter (and Appen-
`dix C) we show examples from the memory hierarchy of the AMD Opteron
`microprocessor. Toward the end of the chapter, we evaluate the impact of this
`hierarchy on performance using the SPEC2000 benchmark programs.
`The Opteron is a microprocessor designed for desktops and servers. Even
`these two related classes of computers have different concerns in a memory hier-
`archy. Desktop computers are primarily running one application at a time on top
`of an operating system for a single user, whereas server computers may have
`hundreds of users running potentially dozens of applications simultaneously.
`These characteristics result in more context switches, which effectively increase
`miss rates. Thus, desktop computers are concerned more with average latency
`from the memory hierarchy, whereas server computers are also concerned about
`memory bandwidth.
`
`5.2
`
`Eleven Advanced Optimizations of Cache Performance
`
`The average memory access time formula above gives us three metrics for cache
`optimizations: hit time, miss rate, and miss penalty. Given the popularity of super-
`scalar processors, we add cache bandwidth to this list. Hence, we group 11
`advanced cache optimizations into the following categories:
`
`I
`
`I
`
`I
`
`I
`
`I
`
`Reducing the hit time: small and simple caches, way prediction, and trace
`caches
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`Increasing cache bandwidth: pipelined caches, multibanked caches, and non-
`blocking caches
`Reducing the miss penalty: critical word first and merging write buffers
`Reducing the miss rate: compiler optimizations
`Reducing the miss penalty or miss rate via parallelism: hardware prefetching
`and compiler prefetching
`
`We will conclude with a summary of the implementation complexity and the per-
`formance benefits of the 11 techniques presented (Figure 5.11 on page 309).
`
`First Optimization: Small and Simple Caches to Reduce Hit Time
`
`A time-consuming portion of a cache hit is using the index portion of the address
`to read the tag memory and then compare it to the address. Smaller hardware can
`be faster, so a small cache can help the hit time. It is also critical to keep an L2
`cache small enough to fit on the same chip as the processor to avoid the time pen-
`alty of going off chip.
`The second suggestion is to keep the cache simple, such as using direct map-
`ping. One benefit of direct-mapped caches is that the designer can overlap the tag
`check with the transmission of the data. This effectively reduces hit time.
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`294
`
`I
`
`Chapter Five
`Memory Hierarchy Design
`
`Hence, the pressure of a fast clock cycle encourages small and simple cache
`designs for first-level caches. For lower-level caches, some designs strike a com-
`promise by keeping the tags on chip and the data off chip, promising a fast tag
`check, yet providing the greater capacity of separate memory chips.
`Although the amount of on-chip cache increased with new generations of
`microprocessors, the size of the L1 caches has recently not increased between
`generations. The L1 caches are the same size for three generations of AMD
`microprocessors: K6, Athlon, and Opteron. The emphasis is on fast clock rate
`while hiding L1 misses with dynamic execution and using L2 caches to avoid
`going to memory.
`One approach to determining the impact on hit time in advance of building a
`chip is to use CAD tools. CACTI is a program to estimate the access time of
`alternative cache structures on CMOS microprocessors within 10% of more
`detailed CAD tools. For a given minimum feature size, it estimates the hit time of
`caches as you vary cache size, associativity, and number of read/write ports. Fig-
`ure 5.4 shows the estimated impact on hit time as cache size and associativity are
`varied. Depending on cache size, for these parameters the model suggests that hit
`time for direct mapped is 1.2–1.5 times faster than two-way set associative; two-
`way is l.02–1.11 times faster than four-way; and four-way is 1.0–1.08 times
`faster than fully associative.
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`1-way
`
`2-way
`
`4-way
`
`8-way
`
`16 KB
`
`32 KB
`
`64 KB
`
`128 KB
`Cache size
`
`256 KB
`
`512 KB
`
`1 MB
`
`2.50
`
`2.00
`
`1.50
`
`1.00
`
`0.50
`
`0
`
`Access time (ns)
`
`These data
`Figure 5.4
`Access times as size and associativity vary in a CMOS cache.
`are based on the CACTI model 4 4.0 by Tarjan, Thoziyoor, and Jouppi [2006]. They
`assumed 90 nm feature size, a
`single bank, and 64-byte blocks. The median ratios of
`
`access time relative to the direct-mapped caches are 1.32, 1.39, and 1.43 for 2-way, 4-
`way, and 8-way associative caches, respectively.
`
`QUALCOMM EXHIBIT 2014
`Intel v. Qualcomm
`IPR2018-01334
`
`

`

`5.2 Eleven Advanced Optimizations of Cache Performance
`
`I
`
`295
`
`Example
`
`Assume that the hit time of a two-way set-associative first-level

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket