`14
`
`ele J . Goldberg
`
`rman Meyrowitz
`ries van Dam
`
`Netflix, Inc. - Ex. 1010, Page 000001
`
`IPR2021-01319 (Netflix, Inc. v. CA, Inc.)
`
`
`
`COMPUTING SURVEYS
`
`The Survey and Tutorial Journal of the ACM
`
`Volume 14
`
`Number 3
`
`September 1982
`
`Editor-in-Chief
`ADELE GOLDBERG
`Xerox PARC
`3333 Coyote Hill Road
`Palo Alto, CA 94304
`
`Executive Editor
`JANET G . BENTON
`ACM, 11 West 42nd Street
`New York, NY 1 0036
`
`Computing Surveys Associate Editors
`
`ALAN BORNING
`University of Washington
`Department of Computer
`Science FR-35
`Seattle, WA 98195
`
`HERBERT S. BRIGHT
`Computation Planning, Inc.
`7840 Aberdeen Road
`Bethesda, MD 20014
`
`ROY LEVIN
`Xerox PARC
`3333 Coyote Hill Road
`Palo Alto, CA 94304
`
`RICHARD R. MUNTZ
`Dept. Computer Science
`UCLA
`Los Angeles, CA 90024
`
`ANTHONY I. WASSERMAN
`Medical Info Science
`Room A-16
`University of California
`San Francisco, CA 94143
`
`BRUCE W. WEIDE
`Computer and Information
`Sciences
`Ohio State University
`Columbus, OH 43210
`
`ERIC A. WEISS
`Sun Company
`100 Matsonford Road
`Radnor, PA 19087
`
`Headquarters Editorial Staff
`Jane Carlton, Administrative Assistant
`Marilyn Salmansohn, Copy Chief
`Donald Werkheiser, Production Editor
`Catherine Yunque, Publications Production Coordinator
`Regular contributions or short notes should be sent to the Editor-in-Chief. Submit lour
`copies of manuscripts typed double spaced. Comments on papers previously published in
`this journal should be sent to the Editor-in-Chief to be considered for the Surveyors ·
`Forum . Correspondence about accepted papers should be addressed to the Executive
`Editor.
`Subscription ratH (annual): ACM members, $ 10 .00 (single copies $7 .00); nonmembers,
`$50.00 (single copies $14.00).
`When payment la Included with membership applications, renewal notices, or invoices,
`please use the address Association for Computing Machinery, P.O. Box 12115, Church
`Street Station, New York, NY 10249.
`NotlcH of add,_ change(cid:127) and reque(cid:127) t(cid:127) for membership applications and subscription
`info,mation should be addressed to Association for Computing Machinery, 11 West 42nd
`Street, New York, NY 10036. Allow 6 to 8 weeks for change of name and address or new
`membership to become effective. Send o1<1 ·Iabel with new address notification. To avoid
`service interruption, notify your local Post Office before change of residence: for a lee,
`the Post Office will forward 2d and 3d class periodicals. The expiration date of each
`subscription appears on the top line of the mailing label. Note the lour digit code used:
`yymm (the first two digits indicate the year and the last two the month of expiration).
`Computing Surveys (ISSN !f'001 0-4892) is published quarterly in March, June, September,
`and December at Mt. Royal & Guilford Aves., Baltimore, MO 21202 by the Association for
`Computing Machinery, Inc. Copyright © 1982 by the Association for Computing Machin(cid:173)
`ery, Inc. Copying without lee is permitted provided that the copies are not made or
`distributed for direct commercial advantage and credit to the source given. Abstracting
`with credit is permitted. For other copying of articles that carry a code at the bottom of the
`first page, copying is permitted provided that the per-copy lee indic ated in the code is
`paid through the Copyright Clearance Center, P.O. Box 765, Schenectady, NY 12301 .
`For oermission to republish write to: Director of Publications, Association for Computing
`Machinery. To copy otherwise, or republish, requires a fee and/ or specific permission.
`Po(cid:127) tma(cid:127) ter: Send Form 3579 to Computing Surveys, ACM, 11 West 42nd Street, New
`York, NY 1 0036.
`
`ASSOCIATION FOR COMPUTING
`MACHINERY
`Founded in 1947 as'the society of the
`computing community, the
`Association for Computing Machinery
`is dedicated to the development of
`information processing as a discipline,
`and to the responsible use of
`computers in an increasing diversity
`of applications.
`
`COUNCIL
`David H. Brandin, President
`Paul W. Abrahams, Vice-President
`Adele J . Goldberg, Secretary
`Aaron Finennan, Treasurer
`M . Stuart Lynn, Chairman,
`Publications Board
`Chairman, SIG Board (vacant)
`Peter J . Denning, Past President
`Members-at-Large
`Robert L. Ashenhurst (1982-84)
`Herbert R. J . Grosch (1980-84)
`David K. Hsiao (1980-84)
`John A. N. (JAN) Lee (1982-86)
`Toni Shetler (1982-86)
`Evelyn A. Swan (1982-86)
`Regional Representatives
`Thomas A. D'Auria (1982-85)
`Dennis Frailey (1981-84)
`Frank L. Friedman (1982-85)
`Marsha D. Hopwood (1982-85)
`Bryan S . Kocher (1981-84)
`Lanie Mischke (1980-83)
`Robert D. Parslow (1980-83)
`Robert E. Sargent (1981-84)
`Charles S. Williams (1980-83)
`David C. Wood (1980-83)
`Stephen N. Zilles (1981-84)
`Stuart H. Zweben (1982-85)
`
`EXECUTIVE DIRECTOR
`Sidney Weinstein
`DEPUTY EXECUTIVE
`DIRECTOR
`John McKegney
`EXECUTIVE SECRETARY
`Irene Hollister
`DIRECTOR OF PUBLICATIONS
`Mark Mandelbaum
`PUBLICATIONS BOARD
`M. Stuart Lynn, Chairman
`Robert L. Ashenhurst
`Michael D. Cooper
`John A. Gosden
`Susan L . Graham
`Raymond E. Miller
`Christine A. Montgomery
`John R. Rice
`Evelyn A . Swan
`Stephen N. Zilles
`
`OTHER ACM PUB LI CA TIO NS
`Communications of the ACM
`Computing Reviews
`Journal of the Association for
`Computing Machinery
`Transactions on Computer Systems
`Transactions on Database Systems
`Transactions on Graphics
`Transactions on Mathematical
`Software
`Transactions on Office Information
`Systems
`Transactions on Programming
`Languages and Systems
`ACM Guide to Computing Literature
`(published annually; Formerly
`Computing Reviews Bibliography
`and Subject Index of Current
`Computing Literature)
`Collected Algorithms from ACM
`ACM Monograph Series
`Proceedings of conferences
`Special publications
`
`Netflix, Inc. - Ex. 1010, Page 000002
`
`
`
`Cache Memories
`
`ALAN JAY SMITH
`
`University of California, Berkeley, California 94720
`
`Cache memories are used in modem, medium and high-speed CPUs to hold temporarily
`those portions of the contents of main memory which are (believed to be) currently in
`use. Since instructions and data in cache memories can usually be referenced in 10 to 25
`percent of the time required to access main memory, cache memories permit the
`execution rate of the machine to be substantially increased. In order to function
`effectively, cache memories must be carefully designed and implemented. In this paper,
`we explain the various aspects of cache memories and discuss in some detail the design
`features and trade-offs. A large number of original, trace-driven simulation results are
`presented. Consideration is given to practical implementation questions as well as to more
`abstract design issues.
`Specific aspects of cache memories that are investigated include: the cache fetch
`algorithm (demand versus prefetch), the placement and replacement algorithms, line size,
`store-through versus copy-back updating of main memory, cold-start versus warm-start
`miss ratios, multicache consistency, the effect of input/output through the cache, the
`behavior of split data/instruction caches, and cache size. Our discussion includes other
`aspects of memory system architecture, including translation lookaside buffers.
`Throughout the paper, we use as examples the implementation of the cache in the
`Amdahl 470V /6 and 470V /7, the 18M 3081, 3033, and 370/168, and the DEC VAX 11/780.
`An extensive bibliography is provided.
`
`Categories and Subject Descriptors: B.3.2 [Memory Structures]: Design Styles-cache
`memories; B.3.3 [Memory Structures]: Performance Analysis and Design Aids; C.O.
`[Computer Systems Organization]: General; C.4 [Computer Systems Organiza(cid:173)
`tion]: Performance of Systems
`General Terms: Design, Experimentation, Measurement, Performance
`Additional Key Words and Phrases: Buffer memory, paging, prefetching, TLB, store(cid:173)
`through, Amdahl 470, IBM 3033, BIAS
`
`INTRODUCTION
`
`Definition and Rationale
`Cache memories are small, high-speed
`buffer memories used in modern computer
`systems to hold temporarily those portions
`of the contents of main memory which are
`(believed to be) currently in use. Informa(cid:173)
`tion located in cache memory may be ac(cid:173)
`cessed in much less time than that located
`in main memory (for reasons discussed
`throughout this paper). Thus, a central
`processing unit (CPU) with a cache mem(cid:173)
`ory needs to spend far less time waiting for
`
`instructions and operands to be fetched
`and/or stored. For exaIP11le, in typical large,
`high-speed computers (e.g., Amdahl 470V /
`7, IBM 3033), main memory can be ac(cid:173)
`cessed in 300 to 600 nanoseconds; informa(cid:173)
`tion can be obtained from a cache, on the
`other hand, in 50 to 100 nanoseconds. Since
`the performance of such machines is al(cid:173)
`ready limited in instruction execution rate
`by cache memory access time, the absence
`of any cache memory at all would produce
`a very substantial decrease in execution
`speed.
`Virtually all modern large computer sys-
`
`Permission to copy without fee all or part of this material is granted provided that the copies are not made or
`distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its
`date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To
`copy otherwise, or to republish, requires a fee and/or specific permission.
`© 1982 ACM 0010-4892/82/0900-0473 $00.75
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Netflix, Inc. - Ex. 1010, Page 000003
`
`
`
`474
`
`•
`
`A.J.Smith
`
`CONTENTS
`
`INTRODUCTION
`Definition and Rationale
`Overview of Cache Design
`Cache Aspects
`1. DATA AND MEASUREMENTS
`1.1 Rationale
`1.2 Trace-Driven Simulation
`1.3 Simulation Evaluation
`1.4 The Traces
`1.5 Simulation Methods
`2. ASPECTS OF CACHE DESIGN AND OPERA(cid:173)
`TION
`2.1 Cache Fetch Algorithm
`2.2 Placement Algorithm
`2.3 Line Size
`2.4 Replacement Algorithm
`2.5 Write-Throµgh versus Copy-Back
`2.6 Effect of Multiprogramming: Cold-Start and
`Wann-Start
`2. 7 Multicache Conaistency
`2.8 Data/Instruction Cache
`2.9 Virtual Address Cache
`2.10 User/Supervisor Cache
`2.11 Input/Output through the Cache
`2.12 Cache Size
`2.13 Cache Bandwidth, Data Path Width, and Ac-
`cess Resolution
`2.14 Multilevel Cache
`2.15 Pipelining
`2.16 Translation Lookaside Buffer
`2.17 Translator
`2.18 Memory-Based Cache
`2.19 Specialized Caches and Cache Components
`3. DIRECTIONS FOR RESEARCH AND DEVEL(cid:173)
`OPMENT
`3.1 On-Chip Cache and Other Technology Ad-
`vances
`3.2 Multicache Conaistency
`Implementation Evaluation
`3.3
`3.4 Hit Ratio versus Size
`3.5 TLB Design
`3.6 Cache Parameters versus Architecture and
`Workload
`APPENDIX. EXPLANATION OF TRACE NAMES
`ACKNOWLEDGMENTS ·
`REFERENCES
`
`terns have cache memories; for example,
`the Amdahl 4 70, the IBM 3081 [IBM82,
`RE1L82, GusT82], 3033, 370/168, 360/195,
`the Univac 1100/80, and the Honeywell 66/
`80. Also, many medium and small size ma(cid:173)
`chines have cache memories; for example,
`the DEC VAX 11/780, 11/750 [ARMs81],
`and PDP-11/70 [STRE76, SNow78], and
`the Apollo, which uses a Motorolla 68000
`microprocessor. We believe that within
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`two to four years, circuit speed and density
`will progress sufficiently to permit cache
`memories in one chip microcomputers.
`(On-chip addressable memory is planned
`for the Texas Instruments 99000 [LAFF81,
`ELEC81].) Even microcomputers could
`benefit substantially from an on-chip cache,
`since on-chip access times are much smaller
`than off-chip access times. Thus, the ma(cid:173)
`terial presented in this paper should be
`relevant to almost the full range of com(cid:173)
`puter architecture implementations.
`The success of cache memories has been
`explained by reference to the "property of
`locality" [DENN72]. The property of local(cid:173)
`ity has two aspects, temporal and spatial.
`Over short periods of time, a program dis(cid:173)
`tributes its memory references nonuni(cid:173)
`formly over its address space, and which
`portions of the address space are favored
`remain largely the same for long periods of
`time. This first property, called temporal
`locality, or locality by time, means that the
`information which will be in use in the near
`future is likely to be in use already. This
`type of behavior can be expected from pro(cid:173)
`gram loops in which both data and instruc(cid:173)
`tions are reused. The second property, lo(cid:173)
`cality by space, means that portions of the
`address space which are in use generally
`consist of a fairly small number of individ(cid:173)
`ually contiguous segments of that address
`space. Locality by space, then, means that
`the loci of reference of the program in the
`near future are likely to be near the current
`loci of reference. This type of behavior can
`be expected from common knowledge of
`programs: related data items (variables, ar(cid:173)
`rays) are usually stored together, and in(cid:173)
`structions are mostly executed sequentially.
`Since the cache memory buffers segment.a
`of information that have been recently
`used, the property of locality implies that
`needed information is also likely to be
`found in the cache.
`Optimizing the design of a cache memory
`generally has four aspects:
`(1) Maximizing the probability of fincliilg a
`memory reference's target in the cache
`(the hit ratio),
`(2) minimizing the time to access informa(cid:173)
`tion that is indeed in the cache (access
`time),
`(3) minimizing the delay due to a miss, and
`
`Netflix, Inc. - Ex. 1010, Page 000004
`
`
`
`(4) minimizing the overheads of updating
`main memory, maintaining multicache
`consistency, etc.
`(All of these have to be accomplished
`under suitable cost constraints, of course.)
`There is also a trade-off between hit ratio
`and access time. This trade-off has not been
`sufficiently stressed in the literature and it
`is one of our major concerns in this paper.
`In this paper, each aspect of cache memo(cid:173)
`ries is discussed at length and, where avail(cid:173)
`able, measurement results are presented. In
`order for these detailed discussions to be
`meaningful, a familiarity with many of the
`aspects of cache design is required. In the
`remainder of this section, we explain the
`operation of a typical cache memory, and
`then we briefly discuss several aspects of
`cache memory design. These discussions
`are expanded upon in Section 2. At the end
`of this paper, there is an extensive bibliog(cid:173)
`raphy in which we have attempted to cite
`all relevant literature. Not all of the items
`in the bibliography are referenced in the
`paper, although we have referred to items
`there as appropriate. The reader may wish
`in particular to refer to BADE79, BARS72,
`GIBs67, and KAPL73 for other surveys of
`some aspects of cache design. CLAR81 is
`particularly interesting as it discusses the
`design details of a real cache. (See also
`LAMP80.)
`
`Overview of Cache Design
`
`Many CPUs can be partitioned, concep(cid:173)
`tually and sometimes physically, into three
`parts: the I-unit, the E-unit, and the S-unit.
`The I-unit (instruction) is responsible for
`instruction fetch and decode. It may have
`some local buffers for lookahead prefetch(cid:173)
`ing of instructions. The E-unit ( execution)
`does most of what is commonly referred to
`as executing an instruction, and it contains
`the logic for arithmetic and logical opera(cid:173)
`tions. The S-unit (storage) provides the
`memory interface between the I-unit and
`E-unit. (IBM calls the S-unit the PSCF, or
`processor storage control function.)
`The S-unit is the part of the CPU of
`primary interest in this paper. It contains
`several parts or functions, some of which
`are shown in Figure 1. The major compo(cid:173)
`nent of the S-unit is the cache memory;
`
`Cache Memories
`
`•
`
`475
`
`Main Memory
`
`.---~-__,..........--Coe he
`TLB
`.
`S U
`- nit
`Translator
`
`I-Unit E-Unit \ ~il~J
`
`{
`CPU
`
`~rite Through Buffert)
`
`Figure 1. A typical CPU design and the 8-unit.
`
`There is usually a translator, which trans(cid:173)
`lates virtual to real memory addresses, and
`a TLB (translation lookaside buffer) which
`buffers (caches) recently generated (virtual
`address, real address) pairs. Depending on
`machine design, there can be an ASIT (ad(cid:173)
`dress space identifier table), a BIAS (buffer
`invalidation address stack), and some write(cid:173)
`through buffers. Each of these is discussed
`in later sections of this paper.
`Figure 2 is a diagram of portions of a
`typical S-unit, showing only the more im(cid:173)
`portant parts and data paths, in particular
`the cache and the TLB. This design is
`typical of that used by IBM (in the 370/168
`and 3033)and by Amdahl(in the 470 series).
`Figure 3 is a flowchart that corresponds
`to the operation of the design in Figure
`2. A discussion of this flowchart follows.
`The operation of the cache commences
`with the arrival of a virtual address, gener(cid:173)
`ally from the CPU, and the appropriate
`control signal. The virtual address is passed
`to both the TLB and the cache storage.
`The TLB is a small associative memory
`which maps virtual to real addresses. It is
`often organized as shown, as a number of
`groups (sets) of elements, each consisting
`of a virtual address and a real address. The
`TLB accepts the virtual page number, ran(cid:173)
`domizes it, and uses that hashed number to
`select a set of elements. That set of ele(cid:173)
`ments is then searched associatively for a
`match to the virtual address. If a match is
`found, the corresponding real address is
`passed along to the comparator to deter(cid:173)
`mine whether the target line is in the cache.
`Finally, the replacement status of each en(cid:173)
`try in the TLB set is updated.
`If the TLB does not contain the ( virtual
`address, real address) pair needed for the
`translation, then the translator (not shown
`in Figure 2) is invoked. It uses the high(cid:173)
`order bits of the virtual address as an entry
`into the segment and page tables for the
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Netflix, Inc. - Ex. 1010, Page 000005
`
`
`
`To
`
`translator
`
`I Address Doto Line I
`
`r c~~y
`' . ,,..
`
`--.,
`
`Cache
`Address
`a
`Doto
`Arrays
`
`Dotoj
`
`476
`
`•
`
`A.J. Smith
`
`CPU
`•
`Vi rtuol Address
`
`Number
`
`"
`
`I
`I ,
`
`Line
`
`"
`
`I Line I Byte in
`_J
`
`From tronslotor
`Virtual Real
`Address Address
`
`½---1
`
`Number
`
`I Poge
`I
`
`!
`I Hash
`
`Function
`
`, /
`
`!Real Address Doto
`
`,,,.
`
`L.(s
`
`I
`
`:
`
`:
`
`I
`
`Translation
`Lookoside
`Buffer
`
`,s
`
`I
`
`; '
`
`I
`
`........
`
`l l l l
`
`Compare Virtual
`Addresses
`
`Real
`Address
`
`A
`
`S• select
`
`To Main Memory
`
`"
`
`l
`
`1 Compare Addresses a select
`!---Doto
`Byte Select a Align
`
`' Doto Out
`
`Figure 2. A typical cache and TLB design.
`
`CACHE OPERATION FLOW CHART
`
`Receive Virtual Address
`
`Hash Page Number
`
`yes
`
`Use Line Number
`to Select Set
`
`Read Out Address Tags
`
`Compare Addresses
`
`Send Vi rtuol Address
`to Translotor
`
`Update Replacement
`Status in TLB
`
`yes
`
`Use Page a Segment Tables
`to Translate Address
`
`Update Replacement
`Status
`
`Send Real Address
`to Main Memory
`
`Select
`Correct
`Line
`
`Receive Line from
`Moin Memory
`
`Store Line
`in Coche
`
`Select Correct
`Bytes from Line
`
`Read Out
`
`Figure 3. Cache operation flow chart.
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Netflix, Inc. - Ex. 1010, Page 000006
`
`
`
`process and then returns the address pair
`to the TLB (which retains it for possible
`future use), thus replacing an existing TLB
`entry.
`The virtual address is also passed along
`initially to a mechanism which uses the
`middle part of the virtual address (the line
`number) as an index to select a set of entries
`in the cache. Each entry consists primarily
`of a real address tag and a line of data (see
`Figure 4). The line is the quantum of stor(cid:173)
`age in the cache. The tags of the elements
`of all the selected set are read into a com(cid:173)
`parator and compared with the real address
`from the TLB. (Sometimes the cache stor(cid:173)
`age stores the data and address tags to(cid:173)
`gether, as shown in Figures 2 and 4. Other
`times, the address tags and data are stored
`separately in the "address array" and "data
`array/' respectively.) If a match is found,
`the line (or a part of it) containing the
`target locations is read into a shift register
`and the replacement status of the entries in
`the cache set are updated. The shift register
`is then shifted to select the target bytes,
`which are in turn transmitted to the source
`of the original data request.
`If a miss occurs (i.e., addresss tags in the
`cache do not match), then the real address
`of the desired line is transmitted to the
`main memory. The replacement status in(cid:173)
`formation is used to determine which line
`to remove from the cache to make room for
`the target line. If the line to be removed
`from the cache has been modified, and main
`memory has not yet been updated with the
`modification, then the line is copied back to
`main memory; otherwise, it is simply de(cid:173)
`leted from the cache. After some number of
`machine cycles, the target line arrives from
`main memory and is loaded into the cache
`storage. The line is also passed to the shift
`register for the target bytes to be selected.
`
`Cache Aspects
`
`The cache description given above is both
`simplified and specific; it does not show
`design alternatives. Below, we point out
`some of the design alternatives for the
`cache memory.
`Cache Fetch Algorithm. The cache fetch
`~gorit~ is .used to decide when to bring
`mformation mto the cache. Several possi-
`
`Cache Memories
`I Real Address Tog I Doto I Valid
`
`•
`
`477
`
`Cache Entry
`
`I Entry I I Entry z I ..... I Entry E I Replacement Status I
`
`Cache Set
`
`Figure 4. Structure of cache entry and cache set.
`
`bilities exist: information can be fetched on
`demand (when it is needed) or prefetched
`(before it is needed). Prefetch algorithms
`attempt to guess what information will soon
`be needed and obtain it in advance. It is
`also possible for the cache fetch algorithm
`to omit fetching some information (selec(cid:173)
`tive fetch) and designate some information,
`such as shared writeable code (sema(cid:173)
`phores), as unfetchable. Further, there may
`be no fetch-on-write in systems which use
`write-through (see below).
`Cache Placement Algorithm. Informa(cid:173)
`tion is generally retrieved from the cache
`associatively, and because large associative
`memories are usually very expensive and
`somewhat slow, the cache is generally or(cid:173)
`ganized as a group of smaller associative
`memories. Thus, only one of the associative
`memories has to be searched to determine
`whether the desired information is located
`in the cache. Each such (small) associative
`memory is called a set and the number of
`elements over which the associative search
`is conducted is called the set size. The
`placement algorithm is used to determine
`in which set a piece (line) of information
`will be placed. Later in this paper we con(cid:173)
`sider the problem of selecting the number
`of sets, the set size, and the placement
`algorithm in such a set-associative memory.
`Line Size. The fixed-size unit of infor(cid:173)
`mation transfer between the cache and
`main memory is called the line. The line
`the page,
`corresponds conceptually to
`which is the unit of transfer between the
`main memory and secondary storage. Se(cid:173)
`lecting the line size is an important part of
`the memory system design. (A line is also
`sometimes referred to as a block.)
`Replacement Algorithm. When infor(cid:173)
`mation is requested by the CPU from main
`memory and the cache is full, some infor(cid:173)
`mation in the cache must be selected for
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Netflix, Inc. - Ex. 1010, Page 000007
`
`
`
`478
`
`•
`
`A.J.Smith
`
`replacement. Various replacement algo(cid:173)
`rithms are possible, such as FIFO (first in,
`first out), LRU (least recently used), and
`random. Later, we consider the first two of
`these.
`Main Memory Update Algorithm. When
`the CPU performs a write (store) to mem(cid:173)
`ory, that operation can actually be reflected
`in the cache and main memories in a num(cid:173)
`ber of ways. For example, the cache mem(cid:173)
`ory can receive the write and the main
`memory can be updated when that line is
`replaced in the cache. This ·strategy is
`known as copy-back. Copy-back may also
`require that the line be fetched if it is absent
`from the cache (i.e., fetch-on-write). An(cid:173)
`other strategy, known as write-through, im(cid:173)
`mediately updates main memory when a
`write occurs. Write-through may specify
`that if the information is in the cache, the
`cache be either updated or purged from
`main memory. If the information is not in
`the cache, it may or may not be fetched.
`The choice between copy-back and write(cid:173)
`through strategies is also influenced by the
`need to maintain consistency among the
`cache memories in a tightly coupled multi(cid:173)
`processor system. This requirement is dis(cid:173)
`cussed later.
`Cold-Start versus Wann-Start Miss Ra(cid:173)
`tios and Multiprogramming. Most com(cid:173)
`puter systems with cache memories are
`multiprogrammed; many processes run on
`the CPU, though only one can run at a
`time, and they alternate every few millisec(cid:173)
`onds. This means that a significant fraction
`of the cache miss ratio is due to loading
`data and instructions for a new process,
`rather than to a single process which has
`been running for some time. Miss ratios
`that are measured when starting with an
`empty cache are called cold-start miss ra(cid:173)
`tios, and those that are measured from the
`time the cache becomes full are called
`warm-start miss ratios. Our simulation
`studies consider this multiprogramming en(cid:173)
`vironment.
`User/Supervisor Cache. The frequent
`switching between user and supervisor
`state in most systems results in high miss
`ratios because the cache is often reloaded
`(i.e., cold-start). One way to address this is
`to incorporate two cache memories, and
`allow the supervisor to use one cache and
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`the user programs to use the other. Poten(cid:173)
`tially, this could result in both the super(cid:173)
`visor and the user programs more fre(cid:173)
`quently finding upon initiation what they
`need in the cache.
`Multicache Consistency. A multiproces(cid:173)
`sor system with multiple caches faces the
`problem of making sure that all copies of a
`given piece of information (which poten(cid:173)
`tially could exist in every cache, as well as
`in the main memory) are the same. A mod(cid:173)
`ification of any one of these copies should
`somehow be reflected in all others. A num(cid:173)
`ber of solutions to this problem are possible.
`The three most popular solutions are essen(cid:173)
`tially: (1) to transmit all stores to all caches
`and memories, so that all copies are up(cid:173)
`dated; (2) to transmit the addresses of all
`stores to all other caches, ijnd purge the
`corresponding lines from all other caches;
`or (3) to permit data that are writeable
`(page or line flagged to permit modifica(cid:173)
`tion) to be in only one cache at a time. A
`centralized or distributed directory may be
`used to control making and updating of
`copies.
`Input/Output. Input/output (from and
`to 1/0 devices) is an additional source of
`references to information in memory. It is
`important that an output request stream
`reference the most current values for the
`information transferred. Similarly, it is also
`important that input data be immediately
`reflected in any and all copies of those lines
`in memory. Several solutions to this prob(cid:173)
`lem are possible. One is to direct the 1/0
`stream through the cache itself (in a single
`processor system); another is to use a write(cid:173)
`through policy and broadcast all writes so
`as to update or invalidate the target line
`wherever found. In the latter case, the
`channel accesses main memory rather than
`the cache.
`Data/ Instruction Cache. Another cache
`design strategy is to split the cache into two
`parts: one for data and one for instructions.
`This has the advantages that the band(cid:173)
`width of the cache is increased and the
`access time (for reasons discussed later) can
`be decreased. Several problems occur: the
`overall miss ratio may increase, the two
`caches must be kept consistent, and self(cid:173)
`modifying code and execute instructions
`must be accommodated.
`
`Netflix, Inc. - Ex. 1010, Page 000008
`
`
`
`Virtual versus Real Addressing. In com(cid:173)
`puter systems with virtual memory, the
`cache may potentially be accessed either
`with a real address (real address cache) or
`a virtual address (virtual address cache). If
`real addresses are to be used, the virtual
`addresses generated by the processor must
`first be translated as in the example above
`(Figure 2); this is generally done by a TLB.
`The TLB is itself a cache memory which
`stores recently used address translation in(cid:173)
`formation, so that translation can occur
`quickly. Direct virtual address access is
`faster (since no translation is needed), but
`causes some problems. In a virtual address
`cache, inverse mapping (real to virtual ad(cid:173)
`dress) is sometimes needed; this can be
`done by an RTB (reverse translation buffer).
`Cache Size. It is obvious that the larger
`the cache, the higher the probability of
`finding the needed information in it. Cache
`sizes cannot be expanded without limit,
`however, for several reasons: cost (the most
`important reason in many machines, espe(cid:173)
`cially small ones), physical size (the cache
`must fit on the boards and in the cabinets),
`and access time. (The larger the cache, the
`slower it may become. Reasons for this are
`discussed in Section 2.12.). Later, we ad(cid:173)
`dress the question of how large is large
`enough.
`Multilevel Cache. As the cache grows in
`size, there comes a point where it may be
`usefully split into two levels: a small, high(cid:173)
`level cache, which is faster, smaller, and
`more expensive per byte, and a larger, sec(cid:173)
`ond-level cache. This two-level cache struc(cid:173)
`ture solves some of the problems that afflict
`caches when they become too large.
`Cache Bandwidth. The cache bandwidth
`is the rate at which data can be read from
`and written to the cache. The bandwidth
`must be sufficient to support the proposed
`rate of instruction execution and 1/0.
`Bandwidth can be improved by increasing
`the width of the data path, interleaving the
`cache and decreasing access time.
`
`1. DATA AND MEASUREMENTS
`
`1.1 Rationale
`As noted earlier, our in-depth studies of
`some aspects of cache design and optimi(cid:173)
`zation are based on extensive trace-driven
`
`Cache Memories
`
`•
`
`479
`
`simulation. In this section, we explain the
`importance of this approach, and then dis(cid:173)
`cuss the presentation of our results.
`One difficulty in providing definitive
`statements about aspects of cache opera(cid:173)
`tion is that the effectiveness of a cache
`memory depends on the workload of the
`computer system; further, to our knowl(cid:173)
`edge, there has never been any (public)
`effort to characterize that workload with
`respect to its effect on the cache memory.
`Along the same lines, there is no generally
`accepted model for program behavior, and
`still less is there one for its effect on the
`uppermost level of the memory hierarchy.
`(But see ARoR72 for some measurements,
`and LEHM78 and LEHM80, in which a model
`is used.)
`For these reasons, we believe that it is
`possible for many aspects of cache design
`to make statements about relative perform(cid:173)
`ance only when those statements are based
`on trace-driven simulation or direct mea(cid:173)
`surement. We have therefore tried through(cid:173)
`out, when examining certain aspects of
`cache memories, to present a large number
`of simulation results and, if possible, to
`generalize from those measurements. We
`have also made an effort to locate and
`reference other measurement and trace(cid:173)
`driven simulation results reported in the
`literature. The reader may wish, for exam(cid:173)
`ple, to read WIND73, in which that author
`discusses the set of data used for his simu(cid:173)
`lations.
`
`1.2 Trace-Driven Simulation
`Trace-driven simulation is an effective
`method for evaluating the behavior of a
`memory hierarchy. A trace is usually gath(cid:173)
`ered by interpretively executing a program
`and recording every main memory location
`referenced by the program during its exe(cid:173)
`cution. (Each address may be tagged in any
`way desired, e.g., instructio