`
`•
`
`A. J. Smith
`
`VARY MP QUANTUM SIZE
`
`8:b~~
`0.080
`0.070
`lSl 0.060
`........ 0.050
`I-
`< 0.040
`0.030
`
`~
`
`(f)
`(f)
`........
`:I:
`
`0.020
`
`8:Bb~
`
`0
`
`CG01 ... 3, PG02
`64 SETS, 32 BYTE LINES
`
`.... --
`
`. ..... -.. ..... ·-
`
`Q-100000
`
`Q-25~0~~ / - - - - - - - - - ,
`
`'
`
`60000
`
`40000
`20000
`MEM0RY CAPACITY
`Figure 23. Miss ratio versus memory capacity for range of multiprogramming intervals Q.
`
`VARY MP QUANTUM SIZE
`
`WFV-APL-WTX-FFT
`64 SETS, 32 BYTE LINES
`
`0.100
`
`lSl 0.050
`........
`I(cid:173)
`<
`a:
`(f)
`(f)
`........
`L
`
`0.010
`
`0.005
`
`0
`
`40000
`20000
`MEM0RY CAPACITY
`Figure 24. Miss ratio versus memory capacity for variety of multiprogramming intervals Q.
`
`60000
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Netflix, Inc. - Ex. 1010, Page 000034
`
`
`
`Each processor may have zero, one, or sev(cid:173)
`eral caches. Unfortunately, in such a mul(cid:173)
`tiple processor system, a given piece of in(cid:173)
`formation may exist in several places at a
`given time, and it is important that all
`processors have access (as necessary) to the
`same, unique (at a given time) value. Sev(cid:173)
`eral solutions exist and/ or have been pro(cid:173)
`posed for this problem. In this section, we
`discuss many of these solutions; the in(cid:173)
`terested reader should refer to BEAN79,
`CENS78, DRIM8la, DuBo82,
`J0NE76,
`JoNE77a, MAZA77, McW177, NGAI81, and
`TANG76 for additional explanation.
`As a basis for discussion, consider a single
`CPU with a cache and with a main memory
`behind the cache. The CPU reads an item
`into the cache and then modifies it. A sec(cid:173)
`ond CPU (of similar design and using the
`same main memory) also reads the item
`and modifies it. Even if the CPUs were
`using store-through, the modification per(cid:173)
`formed by the second CPU would not be
`reflected in the cache of the first CPU
`unless special steps were taken. There are
`several possible special steps.
`
`1. Shared Cache. All processors in the
`system can use the same cache. In general,
`this solution is infeasible because the band(cid:173)
`width of a single cache usually is not suffi(cid:173)
`cient to support more than one CPU, and
`because additional access time delays may
`be incurred because the cache may not be
`physically close enough to both ( or all)
`processors. This solution is employed suc(cid:173)
`cessfully in the Amdahl 4 70 computers,
`where the CPU and the channels all use
`the same cache; the 4 70 series does not,
`however, permit tightly coupled CPUs. The
`UNIV AC 1100/80 [BoRG79] permits two
`CPUs to share one cache.
`2. Broadcast Writes. Every time a CPU
`performs a write to the cache, it also sends
`that write to all other caches in the system.
`If the target of the store is found in some
`other cache, it may be either updated or
`invalidated. Invalidation may be less likely
`to create inconsistencies, since updates can
`possibly "cross," such that CPU A updates
`its own cache and then B's cache. CPU B
`simultaneously updates its own cache and
`then A's. Updates also require more data
`transfer. The IBM 370/168 and 3033 proc(cid:173)
`essors use invalidation. A store by a CPU
`
`Cache Memories
`
`•
`
`505
`
`or channel is broadcast to all caches sharing
`the same main memory. This broadcast
`store is placed in the buffer invalidation
`address stack (BIAS) which is a list of
`addresses to be invalidated in the cache.
`The buffer invalidation address stack has a
`high priority for cache cycles, and if the
`target line is found in that cache, it is in(cid:173)
`validated.
`The major difficulty with broadcasting
`store addresses is that every cache memory
`in the system is forced to surrender a cycle
`for invalidation lookup to any processor
`which performs a write. The memory inter(cid:173)
`ference that occurs is generally acceptable
`for two processors (e.g., IBM's current MP
`systems), but significant performance deg(cid:173)
`radation is likely with more than two proc(cid:173)
`essors. A clever way to minimize this prob(cid:173)
`lem appears in a recent patent [BEAN79].
`In that patent, a BIAS Filter Memory
`(BFM) is proposed. A BFM is associated
`with each cache in a tightly coupled MP
`system. This filter memory works by filter(cid:173)
`ing out repeated requests to invalidate the
`same block in a cache.
`3. Software Control. If a system is being
`written from scratch and the architecture
`can be designed to support it, then software
`control can be used to guarantee consist(cid:173)
`ency. Specifically, certain information can
`be designated noncacheable, and can be
`accessed only from main memory. Such
`items are usually semaphores and perhaps
`data structures such as the job queue. For
`efficiency, some shared writeable data has
`to be cached. The CPU must therefore be
`equipped with commands that permit it to
`purge any such information from its own
`cache as necessary. Access to shared write(cid:173)
`able cacheable data is possible only within
`critical sections, protected by noncacheable
`semaphores. Within the critical regions, the
`code is responsible for restoring all modified
`items to main memory before releasing the
`lock. Just such a scheme is intended for the
`S-1 multiprocessor system under construc(cid:173)
`tion at the Lawrence Livermore Laboratory
`[HAIL 79, McW177]. The Honeywell Series
`66 machines use a similar mechanism. In
`some cases, the simpler alternative of mak(cid:173)
`ing shared writeable information noncache(cid:173)
`able may be acceptable.
`4. Directory Methods. It is possible to
`keep a centralized and/ or distributed direc-
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Netflix, Inc. - Ex. 1010, Page 000035
`
`
`
`506
`
`•
`
`A. J. Smith
`
`tory of all main memory lines, and use it to
`ensure that no lines are write-shared. One
`such scheme is as follows, though several
`variants are possible. The main memory
`maintains k + 1 bits for each line in main
`memory, when there are k caches in the
`system. Bit i, i = 1, ... , k is set to 1 if the
`corresponding cache contains the line. The
`(k + l)th bit is 1 if the line is being or has
`been modified in a cache, and otherwise is
`0. If the (k + l)th bit is on, then exactly
`one of the other bits is on. Each CPU has
`associated with each line in its cache a
`single bit (called the private bit). I( that bit
`is on, that CPU has the only valid copy of
`that line. If the bit is off, other caches and
`main memory may also contain current
`copies. Exactly one private bit is set if and
`only if the main memory directory bit k +
`1 is set.
`A CPU can do several things which pro(cid:173)
`voke activity in this directory system. If a
`CPU attempts to read a line which is not in
`its cache, the main memory directory is
`queried. There are two possibilities: either
`but k + 1 is off, in which case the line is
`transferred to the requesting cache and the
`corresponding bit set to indicate this; or, bit
`k + 1 is on, in which case the main memory
`directory must recall the line from the
`cache which contains the modified copy,
`update main memory, invalidate the copy
`in the cache that modified it, send the line
`to the requesting CPU/ cache and finally
`update itself to reflect these changes. (Bit
`k + 1 is then set to zero, since the request
`was a read.)
`An attempt to perform a write causes one
`of three possible actions. If the line is al(cid:173)
`ready in cache and has already been modi(cid:173)
`fied, the private bit is on and the write
`takes place immediately. If the line is not
`in cache, then the main memory directory
`must be queried. If the line is in any other
`cache, it must be invalidated (in all other
`caches), and main memory must be up(cid:173)
`dated if necessary. The niain memory di(cid:173)
`rectory is then set to reflect the fact that
`the new cache contains the modified copy
`of the data, the line is transmitted to the
`requesting cache, and the private bit is set.
`The third possibility is that the cache al(cid:173)
`ready contains the line but that it does not
`have its private bit set. In this case, per(cid:173)
`mission must be requested from the main
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`memory directory to perform the write.
`The main memory directory invalidates
`any other copies of the line in the system,
`marks its own directory suitably, and then
`gives permission to modify the data.
`The performance implications of this
`method are as follows. The cost of a, miss
`may increase significantly due to the need
`to query the main memory directory and
`possibly .retrieve data from other caches.
`The use of shared writeable information
`becomes expensive due to the high miss
`ratios that are likely to be associated with
`such information. In CENS78, there is some
`attempt at a quantitative analysis of these
`performance problems.
`Another problem is that 1/0 overruns
`may occur. Specifically, an 1/0 data stream
`may be delayed while directory operations
`take place. In the meantime, some 1/0 data
`are lost. Care must be taken to avoid this
`problem. Either substantial 1/0 buffering
`or write-through is clearly needed.
`Other variants of method 4 are possible.
`(1) The purpose of the ·central directory is
`to minimize the queries to the caches of
`other CPUs. The central direct<;>ry is not
`logically necessary; sufficient information
`exists in the individual caches. It is also
`possible to transmit information from cache
`to cache, without going through main mem(cid:173)
`ory. (2) If the number of main memory
`directory bits is felt to be too high, locking
`could be on a page instead of on a line basis.
`(3) Store-through may be used instead of
`copy-back; thus main memory always has
`a valid copy and data do not have to be
`fetched from the other caches, but can sim(cid:173)
`ply be invalidated in these other caches.
`The IBM 3081D, which contains two
`CPUs, essentially uses
`the directory
`scheme described. The higher performance
`3081K functions similarly, but passes the
`necessary information from cache to cache
`rather than going through main memory.
`Another version of the directory method
`is called the broadcast search [DRIM81b].
`In this case, a miss is sent not only to the
`main memory but to all caches. Whichever
`memories (cache or main) contain the de(cid:173)
`sired information send it to the requesting
`processor.
`Liu [Lrn82] proposes a multicache
`scheme to minimize the overhead of direc(cid:173)
`tory operations. He suggests that all CPUs
`
`Netflix, Inc. - Ex. 1010, Page 000036
`
`
`
`have two caches, only one of which can
`contain shared data. The overhead of direc(cid:173)
`tory access and maintenance would thus
`only be incurred when the shared data
`cache is accessed.
`There are two practical methods among
`the above alternatives: method 4 and the
`BIAS Filter Memory version of method 2.
`Method 4 is quite general, but is potentially
`very complex. It may also have· perform(cid:173)
`ance problems. No detailed comparison ex(cid:173)
`ists, and other and better designs may yet
`remain to be discovered. For new machines,
`it is not known whether software control is
`better than hardware control; clearly, for
`existing architectures and software, hard(cid:173)
`ware control is required.
`
`2.8 Data/Instruction Cache
`Two aspects of the cache having to do with
`its performance are cache bandwidth and
`access time. Both of these can be improved
`by splitting the cache into two parts, one
`for data and one for instructions. This dou(cid:173)
`bles the bandwidth since the cache can now
`service two requests in the time it formerly
`required for one. In addition, the two re(cid:173)
`quests served are generally complementary.
`Fast computers are pipelined, which means
`that several instructions are simultaneously
`in the process of being decoded and exe(cid:173)
`cuted. Typically, there are several stages in
`a pipeline, including instruction fetch, in(cid:173)
`struction decode, operand address genera(cid:173)
`tion, operand fetch, execution, and trans(cid:173)
`mission of the results to their destination
`(e.g., to a register). Therefore, while one
`instruction is being fetched (from the in(cid:173)
`struction cache), another can be having its
`operands fetched from the operand cache.
`In addition, the logic that arbitrates priority
`between instruction and data accesses to
`the cache can be simplified or eliminated.
`A split instruction/data cache also pro(cid:173)
`vides access time advantages. The CPU of
`a high-speed computer typically contains
`(exclusive of the S-unit) more than 100,000
`logic gates and is physically large. Further,
`the logic having to do with instruction fetch
`and decode has little to do with operand
`fetch and store except for execute in(cid:173)
`structions and possibly for the targets of
`branches. With a single cache system, it is
`not always possible simultaneously to place
`the cache immediately adjacent to all of the
`
`Cache Memories
`
`•
`
`507
`
`logic which will access it. A split cache, on
`the other hand, can have each of its halves
`placed in the physical location which is
`most useful, thereby saving from a fraction
`of a nanosecond to several nanoseconds.
`There are, of course, some problems in(cid:173)
`troduced by the split cache organization.
`First, there is the problem of consistency.
`Two copies now exist of information which
`formerly existed only in one place. Specifi(cid:173)
`cally, instructions can be modified, and this
`modification must be reflected before the
`instructions are executed. Further, it is pos(cid:173)
`sible that even if the programs are not se!f(cid:173)
`modifying, both data and instructions may
`coexist in the same line, either for reasons
`of storage efficiency or because of immedi(cid:173)
`ate operands. The solutions for this prob(cid:173)
`lem are the same as those discussed in the
`section on multicache consistency (Section
`2. 7), and they work here as well. It is im(cid:173)
`perative that they be implemented in such
`a way so as to not impair the access time
`advantage given by this organization.
`Another problem of this cache organiza(cid:173)
`tion is that it results in inefficient use of the
`cache memory. The size of the working set
`of a program varies constantly, and in par(cid:173)
`ticular, the fraction devoted to data and to
`instructions also varies. (A dynamically
`split design is suggested by FA VR78.) If the
`instructions and data are not stored to(cid:173)
`gether, they must each exist within their
`own memory, and be unable to share a
`larger amount of that resource. The extent
`of this problem has been studied both ex(cid:173)
`perimentally and analytically. In SHED76,
`a set of formulas are provided which can be
`used to estimate the performance of the
`unified cache from the performance of the
`individual ones. The experimental results
`were not found to agree with the mathe(cid:173)
`matical ones, although the reason was not
`investigated. We believe that the nonsta(cid:173)
`tionarity of the workload was the major
`problem.
`We compared the miss ratio of the split
`cache to that of the unified cache for each
`of the sets of traces; some of the results
`appear in Figures 25-28. (See BELL74 and
`THAK78 for additional results.) We note
`that there are several possible ways to split
`and manage the cache and the various al(cid:173)
`ternatives have been explored. One could
`split the cache in two equal parts (labeled
`"SPLIT EQUAL"), or the observed miss
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Netflix, Inc. - Ex. 1010, Page 000037
`
`
`
`508
`
`•
`
`A. J. Smith
`I/0 CACHES VS. UNIFIED CACHE
`
`0.100
`
`0.050
`
`0.010
`
`0.005
`
`ls;)
`1--1
`I-
`<
`0:::
`(f)
`(f)
`1--1
`L
`
`CG01 ... 3, PG02
`Q-10000, 64 SETS, 32 BYTES/LINE
`
`UNIFIED--
`SPLIT 0PT · - · -
`
`·
`
`SPLIT EQUAL
`
`INSTRUCTI0NS
`
`0
`
`40000
`20000
`MEM0RY CAPACITY
`Figure 25. Miss ratio versus memory capacity for unified cache, cache split equally between
`instruction and data halves, cache split according to static optimum partition between instruc(cid:173)
`tion and data halves, and miss ratios individually for instruction and data halves.
`
`60000
`
`I/0 CACHES VS. UNIFIED CACHE
`
`CG01 ... 3, PG02
`Q-10000, 64 SETS, 32 BYTES/LINE
`N0 DUPLICATES
`
`\ .
`\ \
`I ,_
`·-~~ SPLIT EQUAL · -
`
`· -
`
`· -
`
`·
`
`UNIFIED
`
`INSTRUCTI0NS
`
`ls;)
`1--1
`I-
`<
`0:::
`(f)
`(f)
`1--1
`L
`
`0.100
`
`0.050
`
`0.010
`
`0.005
`
`0
`
`40000
`20000
`MEM0RY CAPACITY
`Figure 26. Miss ratio versus memory capacity for instruction/data cache and for unified cache.
`
`60000
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Netflix, Inc. - Ex. 1010, Page 000038
`
`
`
`Cache Memories
`I/0 CACHES VS. UNIFIED CACHE"
`
`•
`
`509
`
`F001 ... 4
`Q-10000, 64 SETS, 32 BYTES/LINE
`
`lS} -I-
`(/) -L
`
`<
`ct:
`(/)
`
`0.100
`
`0.050
`
`0.010
`
`0.005
`
`0
`
`40000
`20000
`MEM0RY CAPACITY
`Figure 27. Miss ratio versus memory capacity for instruction/data cache and for unified cache.
`
`60000
`
`I/0 CACHES VS. UNIFIED CACHE
`
`FG01 ... 4
`Q-10000, 64 SETS, 32 BYTES/LINE
`N0 DUPLICATES
`
`SPLIT EQUAL · -
`
`· -
`
`·
`
`DATA
`--✓---
`
`lS} -I-
`
`<
`ct:
`(/)
`(/)
`H
`L
`
`0.100
`
`~
`
`0.050
`
`0.010
`
`0.005
`
`0
`
`40000
`20000
`MEM0RY CAPACITY
`Figure 28. Miss ratio versus memory capacity for instruction/data cache and for unified cache.
`
`60000
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Netflix, Inc. - Ex. 1010, Page 000039
`
`
`
`510
`
`•
`
`A. J. Smith
`
`ratios could be used (from this particular
`run) to determine the optimal static split
`("SPLIT OPT"). Also, when a line is found
`to be in the side of the cache other than the
`one currently accessed, it could either be
`duplicated in the remaining side of the
`cache or it could be moved; this latter case
`is labeled "NO DUPLICATES." (No spe(cid:173)
`cial label is shown if duplicates are permit(cid:173)
`ted.) The results bearing on these distinc(cid:173)
`tions are given in Figures 25-28.
`We observe in Figures 25 and 26 that the
`unified cache, the equally split cache, and
`the cache split unequally (split optimally)
`all perform about equally well with respect
`to miss ratio. Note that the miss ratio for
`the instruction and data halves of the cache
`individually are also shown. Further, com(cid:173)
`paring Figures 25 and 26, it seems that
`barring duplicate lines has only a small
`negative effect on the miss ratio.
`In sharp contrast to the measurements of
`Figures 25 and 26 are those of Figures 27
`and 28. In Figure 27, although the equally
`and optimally split cache are comparable,
`the unified cache is significantly better. The
`unified cache is better by an order of mag(cid:173)
`nitude when duplicates are not permitted
`(Figure 28), because the miss ratio is
`sharply increased by the constant move(cid:173)
`ment of lines between the two halves. It
`appears that lines sharing instruction and
`data are very common in programs com(cid:173)
`piled with IBM's FORTRAN G compiler
`and are not common in programs compiled
`using the IBM COBOL or PL/I compiler.
`(Results similar to the FORTRAN case
`have been found for the two other sets of
`IBM traces, all of which include FOR(cid:173)
`TRAN code but are not shown.)
`Based on these experimental results, we
`can say that the miss ratio may increase
`significantly if the caches are split, but that
`the effect depends greatly on the workload.
`Presumably, compilers can be designed to
`minimize this effect by ensuring that data
`and instructions are in separate lines, and
`perhaps even in separate pages.
`Despite the possible miss ratio penalty of
`splitting the cache, there are at least two
`experimental machines and two commer(cid:173)
`cial ones which do so. The S-1 [HAIL 79,
`McW177] at Lawrence Livermore Labora(cid:173)
`tory is being built with just such a cache; it
`relies on (new) software to minimize the
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`problems discussed here. The 801 minicom(cid:173)
`puter, built at IBM Research (Yorktown
`Heights) [ELEC76, RADI82] also has a
`split cache. The Hitachi H200 and Itel
`AS/6 [Ross79] both have a split data/in(cid:173)
`struction cache. No measurements have
`been publicly reported for any of these
`machines.
`
`2.9 Virtual Address Cache
`
`Most cache memories address the cache
`using the real address (see Figure 2). As the
`reader recalls, we discussed (Introduction,
`Section 2.3) the fact that the virtual address
`was translated by the TLB to the real ad(cid:173)
`dress, and that the line lookup and readout
`could not be completed until the real ad(cid:173)
`dress was available. This suggests that the
`cache access time could be significantly re(cid:173)
`duced if the translation step could be elim(cid:173)
`inated. The way to do this is to address the
`cache directly with the virtual address. We
`call a cache organized this way a virtual
`address cache. The MU-5 [IBBE77] uses
`this organization for its name store. The
`S-1, the IBM 801, and the ICL 2900 series
`machines also use this idea. It is discussed
`in BEDE79. See also OLBE79.
`There are some additional considerations
`in building a virtual address cache, and
`there is one serious problem. First, all ad(cid:173)
`dresses must be tagged with the identifier
`of the address space with which they are
`associated, or else the cache must be purged
`every time task switching occurs. Tagging
`is not a problem, but the address tag in the
`cache must be extended to include the ad(cid:173)
`dress space ID. Second, the translation
`mechanism must still exist and must still be
`efficient, since virtual addresses must be
`translated to real addresses whenever main
`memory is accessed, specifically for misses
`and for writes in a write-through cache.
`Thus the TLB cannot be eliminated.
`The most serious problem is that of
`"synonyms," two or more virtual addresses
`that map to the same real address. Syn(cid:173)
`onyms occur whenever two address spaces
`share code or data. (Since the lines have
`address space tags, the virtual addresses
`are different even if the line occurs in the
`same place in both address spaces.) Also,
`the supervisor may exist in the address
`space of each user, and it is important that
`
`Netflix, Inc. - Ex. 1010, Page 000040
`
`
`
`only one copy of supervisor tables be kept.
`The only way to detect synonyms is to take
`the virtual address, map it into a real ad(cid:173)
`dress, and then see if any other virtual
`addresses in the cache map into the same ,,
`real address. For this to be feasible, the
`inverse mapping must be available for ~ver'/
`line in the cache; this inverse mappmg 18
`accessed on real address and indicates all
`virtual addresses associated with that real
`address. Since this inverse mapping is the
`opposite of the TLB, we choose to call t~e
`inverse mapping buffer (if a separate on~ 18
`used)
`the RTB or reverse translation
`buffer. When a miss occurs, the virtual ad(cid:173)
`dress is translated into the real address by
`the TLB. The access to main memory for
`the miss is overlapped with a similar search
`of the RTB to see if that line is already in
`the cache under a different name ( virtual
`address). If it is, it must be renamed and
`moved to its new location, since multiple
`copies of the same line in the cache ?Ie
`clearly undesirable for reasons of conslSt(cid:173)
`ency.
`The severity of the synonym problem can
`be decreased if shared information can be
`forced to have the same location in all
`address spaces. Such information can be
`given a unique address space identifier, and
`the lookup algorithm always considers such
`a tag to match the current address space
`ID. A scheme like this is feasible only for
`the supervisor since other shared code
`could . not conceivably be so allocated.
`Shared supervisor code does have a unique
`location in all address spaces in IBM's MVS
`operating system.
`The RTB may or may not be a simple
`structure, depending on the structure of the
`rest of the cache. In one case it is fairly
`simple: if the bits used to select the set of
`the cache are the same for the real and
`virtual address (i.e., if none of the bits used
`to select the set undergo translation), the
`RTB can be implemented by associating
`with each cache line two address tags
`[BEDE79]. If a match is not found on the
`virtual address, then a search is made in
`that set on the real address. If that search
`finds the line, then the virtual address tag
`is changed to the current virtual address. A
`more complex design woul~ involve a sep(cid:173)
`arate mapping buffer for the reverse trans(cid:173)
`lation.
`
`Cache Memories
`
`•
`
`511
`
`2.1 O User /Supervisor Cache
`It was suggested earlier that a significant
`fraction of the miss ratio is due to task(cid:173)
`switching. A possible sol~tion to this pro~(cid:173)
`lem is to use a cache which has been split
`into two parts, one of which is used o~y b_y
`the supervisor and the other of which 18
`used primarily by the user state programs.
`If the scheduler were programmed to re(cid:173)
`start, when possible, the same user program
`that was running before an interrupt, then
`the user state miss ratio would drop appre(cid:173)
`ciably. Further, if the same interrupts recur
`frequently, the supervisor state miss ratio
`may also drop. In particular, neither the
`user nor the supervisor would purge the
`cache of the other's lines. (See PEUT77 for
`some data relevant to this problem.) The
`supervisor cache may have a high miss ratio
`in any case due to its large working set.
`(See MILA75 for an example.)
`Despite the appealing rationale of the
`above comments, there are a number of
`problems with the user/supervisor cache.
`First, it is actually unlikely to cut down the
`miss ratio. Most misses occur in supervisor
`state [MILA 75] and a supervisor cache half
`as large as the unified cache is likely ~ ~e
`worse since the maximum cache capacity 18
`no longer available to the supervisor. Fur(cid:173)
`ther, it is not clear what fraction of the time
`the scheduling algorithm can restart the
`same program. Second, the information
`used by the user and the supervisor are not
`entirely distinct, and cross-access must be
`permitted. This overlap introduces the
`problem of consistency.
`We are aware of only one evaluation of
`the split user/supervisor cache [Ross80].
`In that case, an experiment was run on an
`Hitatchi M180. The results seemed to show
`that the split cache performed about as well
`as a unified one, but poor experimental
`design makes the results questionable. We
`do not expect that a split cache will prove
`to be useful.
`
`2.11 Input/Output through
`the Cache
`In Section 2. 7, the problem of multicache
`consistency was discussed. We noted that
`if all accesses to main memory use the same
`cache, then there would be no consistency
`problem. Precisely this approach has been
`
`Computing Surveys, Vol. 14, No. 3, Sept.ember 1982
`
`Netflix, Inc. - Ex. 1010, Page 000041
`
`
`
`512
`
`•
`
`A. J. Smith
`
`used in one manufacturer's computers
`(Amdahl Corporation).
`
`2.11.1 Overruns
`While putting all input/ output-through the
`cache solves the consistency problem, it
`introduces other difficulties. First, there is
`the overrun problem. An overrun occurs
`when for some reason the 1/0 stream can(cid:173)
`not be properly transmitted between the
`memory ( cache or main) and the 1/0 de(cid:173)
`vice. Transmitting the 1/0 through the
`cache can cause an overrun when the line
`accessed by the 1/0 stream is not in the
`cache (and is thus a miss) and for some
`reason cannot be obtained quickly enough.
`Most 1/0 devices involve physical move(cid:173)
`ment, and when the buffering capacity
`embedded in the 1/0 path is exhausted, the
`transfer must be aborted and then re(cid:173)
`started. Overruns can be provoked when:
`(1) the cache is already processing one or
`more misses and cannot process the current
`(additional) one quickly enough; (2) more
`than one 1/0 transfer is in progress, and
`more active (in use) lines map into one set
`than the set size can accommodate; or (3)
`the cache bandwidth is not adequate to
`handle the current burst of simultaneous
`1/0 from several devices. Overruns can be
`minimized if the set size of the cache is
`large enough,
`the bandwidth
`is high
`enough, and the ability to process misses is
`sufficient. Sufficient buffering should also
`be provided in the 1/0 paths to the devices.
`
`2. 11.2 Miss Ratio
`
`Directing the input/ output data streams
`through the cache also has an effect on the
`miss ratio. This 1/0 data occupies some
`fraction of the space in the cache, and this
`increases the miss ratio for the other users
`of the cache. Some experiments along these
`lines were run by the author and results are
`shown in · Figures 29-32. IORATE is the
`ratio of the rate ofl/0 accesses to the cache
`to the rate of CPU accesses. (1/0 activity
`is simulated by a purely sequential syn(cid:173)
`thetic address stream referencing a distinct
`address space from the other programs.)
`The miss ratio as a function of memory size
`and 1/0 transfer rate is shown in Figures
`29 and 30 for two of the sets of traces. The
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`data has been rearranged to show more
`directly the effect on the miss ratio in Fig-.
`ures 31 and 32. The results displayed here
`show no clear mathematical pattern, and
`we were unable to derive a useful and ver(cid:173)
`ifiable formula to predict the effect on the
`miss ratio by an 1/0 stream.
`Examination of the results presented in
`Figures 29-32 suggests that for reasonable
`1/0 rates (less than 0.05; see PowE77 for
`some 1/0 rate data) the miss ratio is not
`affected to any large extent. This observa(cid:173)
`tion is consistent with the known perform(cid:173)
`ance of the Amdahl computers, which are
`not seriously degraded· by high 1/0 rates.
`
`2.12 Cache Size
`Two very important questions when select(cid:173)
`ing a cache design are how large should the
`cache be and what kind of performance can
`we expect. The cache size is usually dictated
`by a number of criteria having to do with
`the cost and performance of the machine.
`The cache should not be so large that it
`represents an expense out of proportion to
`the added performance, nor should it oc(cid:173)
`cupy an unreasonable fraction of the phys(cid:173)
`ical space within the processor. A very large
`cache may also require more access cir(cid:173)
`cuitry, which may increase access time.
`Aside from the warnings given in the
`paragraph above, one can generally assume
`that the larger the cache, the higher the hit
`ratio, and therefore the better the perform(cid:173)
`ance. The issue is then one of the relation
`between cache size and hit ratio. This is a
`very difficult problem, since the cache hit
`ratio varies with the workload and the ma(cid:173)
`chine architecture. A cache that might yield
`a 99.8 percent hit ratio on a PDP-11 pro(cid:173)
`gram could result in a 90 percent or lower
`hit ratio for IBM (MVS) supervisor state
`code. This problem cannot be usefully stud(cid:173)
`ied using trace-driven simulation because
`the miss ratio varies tremendously from
`program to program and only a small num(cid:173)
`ber of traces can possibly be analyzed. Typ(cid:173)
`ical trace-driven simulation results appear
`throughout this paper, however, and the
`reader may wish to scan that data for in(cid:173)
`sight. There is also a variety of data avail(cid:173)
`able in the literature and the reader may
`wish to inspect the results presented in
`ALsA78, BELL74, BERG76, GIBs67, LEE69,
`
`Netflix, Inc. - Ex. 1010, Page 000042
`
`
`
`Cache Memories
`VARY I/0_ RA TE
`
`•
`
`513
`
`CG01 ... 3, PG02
`Q•l0000, 64 SETS
`32 BYTE LINES
`
`I0RATE•.10
`' , , ✓I0RA TE•.20
`--
`·- - -::.~-- - - - - -
`
`I0RATE•.02
`I0RATE•.0l
`I0RATE•0
`I0RATE•.05
`
`lSl
`......
`I-
`<
`0:::
`CJ)
`CJ)
`......
`l:
`
`8:bi
`
`0.08
`0.07
`0.06
`0.05
`0.04
`
`0.03
`
`0.02
`
`0
`
`40000
`20000
`MEM0RY CAPACITY
`Figure 29. Miss ratio versus memory capacity while 1/0 occurs through cache at specified
`rate. I0RA TE refers to fraction of all memory references due to 1/0.
`
`60000
`
`VARY I/0 RA TE
`
`WFV-APL-WTX-FFT
`Q-10000, 64 SETS
`32 BYTE LINES
`
`0.100
`
`0.050
`
`lS)
`1---1
`I-
`<
`0:::
`CJ)
`CJ)
`1---1
`L
`
`0.010
`
`I0RATE- .01
`
`····· .... ....
`
`0.005
`
`I0RATE- .025
`
`0
`
`40000
`~0000
`MEM0RY CAPACITY
`Figure 30. Miss ratio versus memory capacity while 1/0 occurs through cache.
`
`60000
`
`Comput ing Surveys, Vol. 14, No. 3, September 1982
`
`Netflix, Inc. - Ex. 1010, Page 000043
`
`
`
`514
`
`•
`
`A. J. Smith
`MISS RA TI0 VS. I/0 RA TE
`
`0.04
`
`CSI 0.03
`1----4
`I(cid:173)
`<
`Ct'.
`(/)
`(f) 0.02
`1----4
`l:
`
`0.01
`
`16K
`................. .
`········
`····
`24K~------
`.,,
`..,, /
`
`.,,,,,,.
`
`(
`.
`..
`.·
`
`0
`
`0.05
`
`0.1
`
`0.15
`
`0.05
`0
`0.2
`I/0 RATE
`Figure 31. Miss ratio versus 1/0 rate for variety of memory capacities.
`
`0.1
`
`0.15
`
`0.2
`
`INCREASE IN M