throbber
504
`
`•
`
`A. J. Smith
`
`VARY MP QUANTUM SIZE
`
`8:b~~
`0.080
`0.070
`lSl 0.060
`........ 0.050
`I-
`< 0.040
`0.030
`
`~
`
`(f)
`(f)
`........
`:I:
`
`0.020
`
`8:Bb~
`
`0
`
`CG01 ... 3, PG02
`64 SETS, 32 BYTE LINES
`
`.... --
`
`. ..... -.. ..... ·-
`
`Q-100000
`
`Q-25~0~~ / - - - - - - - - - ,
`
`'
`
`60000
`
`40000
`20000
`MEM0RY CAPACITY
`Figure 23. Miss ratio versus memory capacity for range of multiprogramming intervals Q.
`
`VARY MP QUANTUM SIZE
`
`WFV-APL-WTX-FFT
`64 SETS, 32 BYTE LINES
`
`0.100
`
`lSl 0.050
`........
`I(cid:173)
`<
`a:
`(f)
`(f)
`........
`L
`
`0.010
`
`0.005
`
`0
`
`40000
`20000
`MEM0RY CAPACITY
`Figure 24. Miss ratio versus memory capacity for variety of multiprogramming intervals Q.
`
`60000
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Netflix, Inc. - Ex. 1010, Page 000034
`
`

`

`Each processor may have zero, one, or sev(cid:173)
`eral caches. Unfortunately, in such a mul(cid:173)
`tiple processor system, a given piece of in(cid:173)
`formation may exist in several places at a
`given time, and it is important that all
`processors have access (as necessary) to the
`same, unique (at a given time) value. Sev(cid:173)
`eral solutions exist and/ or have been pro(cid:173)
`posed for this problem. In this section, we
`discuss many of these solutions; the in(cid:173)
`terested reader should refer to BEAN79,
`CENS78, DRIM8la, DuBo82,
`J0NE76,
`JoNE77a, MAZA77, McW177, NGAI81, and
`TANG76 for additional explanation.
`As a basis for discussion, consider a single
`CPU with a cache and with a main memory
`behind the cache. The CPU reads an item
`into the cache and then modifies it. A sec(cid:173)
`ond CPU (of similar design and using the
`same main memory) also reads the item
`and modifies it. Even if the CPUs were
`using store-through, the modification per(cid:173)
`formed by the second CPU would not be
`reflected in the cache of the first CPU
`unless special steps were taken. There are
`several possible special steps.
`
`1. Shared Cache. All processors in the
`system can use the same cache. In general,
`this solution is infeasible because the band(cid:173)
`width of a single cache usually is not suffi(cid:173)
`cient to support more than one CPU, and
`because additional access time delays may
`be incurred because the cache may not be
`physically close enough to both ( or all)
`processors. This solution is employed suc(cid:173)
`cessfully in the Amdahl 4 70 computers,
`where the CPU and the channels all use
`the same cache; the 4 70 series does not,
`however, permit tightly coupled CPUs. The
`UNIV AC 1100/80 [BoRG79] permits two
`CPUs to share one cache.
`2. Broadcast Writes. Every time a CPU
`performs a write to the cache, it also sends
`that write to all other caches in the system.
`If the target of the store is found in some
`other cache, it may be either updated or
`invalidated. Invalidation may be less likely
`to create inconsistencies, since updates can
`possibly "cross," such that CPU A updates
`its own cache and then B's cache. CPU B
`simultaneously updates its own cache and
`then A's. Updates also require more data
`transfer. The IBM 370/168 and 3033 proc(cid:173)
`essors use invalidation. A store by a CPU
`
`Cache Memories
`
`•
`
`505
`
`or channel is broadcast to all caches sharing
`the same main memory. This broadcast
`store is placed in the buffer invalidation
`address stack (BIAS) which is a list of
`addresses to be invalidated in the cache.
`The buffer invalidation address stack has a
`high priority for cache cycles, and if the
`target line is found in that cache, it is in(cid:173)
`validated.
`The major difficulty with broadcasting
`store addresses is that every cache memory
`in the system is forced to surrender a cycle
`for invalidation lookup to any processor
`which performs a write. The memory inter(cid:173)
`ference that occurs is generally acceptable
`for two processors (e.g., IBM's current MP
`systems), but significant performance deg(cid:173)
`radation is likely with more than two proc(cid:173)
`essors. A clever way to minimize this prob(cid:173)
`lem appears in a recent patent [BEAN79].
`In that patent, a BIAS Filter Memory
`(BFM) is proposed. A BFM is associated
`with each cache in a tightly coupled MP
`system. This filter memory works by filter(cid:173)
`ing out repeated requests to invalidate the
`same block in a cache.
`3. Software Control. If a system is being
`written from scratch and the architecture
`can be designed to support it, then software
`control can be used to guarantee consist(cid:173)
`ency. Specifically, certain information can
`be designated noncacheable, and can be
`accessed only from main memory. Such
`items are usually semaphores and perhaps
`data structures such as the job queue. For
`efficiency, some shared writeable data has
`to be cached. The CPU must therefore be
`equipped with commands that permit it to
`purge any such information from its own
`cache as necessary. Access to shared write(cid:173)
`able cacheable data is possible only within
`critical sections, protected by noncacheable
`semaphores. Within the critical regions, the
`code is responsible for restoring all modified
`items to main memory before releasing the
`lock. Just such a scheme is intended for the
`S-1 multiprocessor system under construc(cid:173)
`tion at the Lawrence Livermore Laboratory
`[HAIL 79, McW177]. The Honeywell Series
`66 machines use a similar mechanism. In
`some cases, the simpler alternative of mak(cid:173)
`ing shared writeable information noncache(cid:173)
`able may be acceptable.
`4. Directory Methods. It is possible to
`keep a centralized and/ or distributed direc-
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Netflix, Inc. - Ex. 1010, Page 000035
`
`

`

`506
`
`•
`
`A. J. Smith
`
`tory of all main memory lines, and use it to
`ensure that no lines are write-shared. One
`such scheme is as follows, though several
`variants are possible. The main memory
`maintains k + 1 bits for each line in main
`memory, when there are k caches in the
`system. Bit i, i = 1, ... , k is set to 1 if the
`corresponding cache contains the line. The
`(k + l)th bit is 1 if the line is being or has
`been modified in a cache, and otherwise is
`0. If the (k + l)th bit is on, then exactly
`one of the other bits is on. Each CPU has
`associated with each line in its cache a
`single bit (called the private bit). I( that bit
`is on, that CPU has the only valid copy of
`that line. If the bit is off, other caches and
`main memory may also contain current
`copies. Exactly one private bit is set if and
`only if the main memory directory bit k +
`1 is set.
`A CPU can do several things which pro(cid:173)
`voke activity in this directory system. If a
`CPU attempts to read a line which is not in
`its cache, the main memory directory is
`queried. There are two possibilities: either
`but k + 1 is off, in which case the line is
`transferred to the requesting cache and the
`corresponding bit set to indicate this; or, bit
`k + 1 is on, in which case the main memory
`directory must recall the line from the
`cache which contains the modified copy,
`update main memory, invalidate the copy
`in the cache that modified it, send the line
`to the requesting CPU/ cache and finally
`update itself to reflect these changes. (Bit
`k + 1 is then set to zero, since the request
`was a read.)
`An attempt to perform a write causes one
`of three possible actions. If the line is al(cid:173)
`ready in cache and has already been modi(cid:173)
`fied, the private bit is on and the write
`takes place immediately. If the line is not
`in cache, then the main memory directory
`must be queried. If the line is in any other
`cache, it must be invalidated (in all other
`caches), and main memory must be up(cid:173)
`dated if necessary. The niain memory di(cid:173)
`rectory is then set to reflect the fact that
`the new cache contains the modified copy
`of the data, the line is transmitted to the
`requesting cache, and the private bit is set.
`The third possibility is that the cache al(cid:173)
`ready contains the line but that it does not
`have its private bit set. In this case, per(cid:173)
`mission must be requested from the main
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`memory directory to perform the write.
`The main memory directory invalidates
`any other copies of the line in the system,
`marks its own directory suitably, and then
`gives permission to modify the data.
`The performance implications of this
`method are as follows. The cost of a, miss
`may increase significantly due to the need
`to query the main memory directory and
`possibly .retrieve data from other caches.
`The use of shared writeable information
`becomes expensive due to the high miss
`ratios that are likely to be associated with
`such information. In CENS78, there is some
`attempt at a quantitative analysis of these
`performance problems.
`Another problem is that 1/0 overruns
`may occur. Specifically, an 1/0 data stream
`may be delayed while directory operations
`take place. In the meantime, some 1/0 data
`are lost. Care must be taken to avoid this
`problem. Either substantial 1/0 buffering
`or write-through is clearly needed.
`Other variants of method 4 are possible.
`(1) The purpose of the ·central directory is
`to minimize the queries to the caches of
`other CPUs. The central direct<;>ry is not
`logically necessary; sufficient information
`exists in the individual caches. It is also
`possible to transmit information from cache
`to cache, without going through main mem(cid:173)
`ory. (2) If the number of main memory
`directory bits is felt to be too high, locking
`could be on a page instead of on a line basis.
`(3) Store-through may be used instead of
`copy-back; thus main memory always has
`a valid copy and data do not have to be
`fetched from the other caches, but can sim(cid:173)
`ply be invalidated in these other caches.
`The IBM 3081D, which contains two
`CPUs, essentially uses
`the directory
`scheme described. The higher performance
`3081K functions similarly, but passes the
`necessary information from cache to cache
`rather than going through main memory.
`Another version of the directory method
`is called the broadcast search [DRIM81b].
`In this case, a miss is sent not only to the
`main memory but to all caches. Whichever
`memories (cache or main) contain the de(cid:173)
`sired information send it to the requesting
`processor.
`Liu [Lrn82] proposes a multicache
`scheme to minimize the overhead of direc(cid:173)
`tory operations. He suggests that all CPUs
`
`Netflix, Inc. - Ex. 1010, Page 000036
`
`

`

`have two caches, only one of which can
`contain shared data. The overhead of direc(cid:173)
`tory access and maintenance would thus
`only be incurred when the shared data
`cache is accessed.
`There are two practical methods among
`the above alternatives: method 4 and the
`BIAS Filter Memory version of method 2.
`Method 4 is quite general, but is potentially
`very complex. It may also have· perform(cid:173)
`ance problems. No detailed comparison ex(cid:173)
`ists, and other and better designs may yet
`remain to be discovered. For new machines,
`it is not known whether software control is
`better than hardware control; clearly, for
`existing architectures and software, hard(cid:173)
`ware control is required.
`
`2.8 Data/Instruction Cache
`Two aspects of the cache having to do with
`its performance are cache bandwidth and
`access time. Both of these can be improved
`by splitting the cache into two parts, one
`for data and one for instructions. This dou(cid:173)
`bles the bandwidth since the cache can now
`service two requests in the time it formerly
`required for one. In addition, the two re(cid:173)
`quests served are generally complementary.
`Fast computers are pipelined, which means
`that several instructions are simultaneously
`in the process of being decoded and exe(cid:173)
`cuted. Typically, there are several stages in
`a pipeline, including instruction fetch, in(cid:173)
`struction decode, operand address genera(cid:173)
`tion, operand fetch, execution, and trans(cid:173)
`mission of the results to their destination
`(e.g., to a register). Therefore, while one
`instruction is being fetched (from the in(cid:173)
`struction cache), another can be having its
`operands fetched from the operand cache.
`In addition, the logic that arbitrates priority
`between instruction and data accesses to
`the cache can be simplified or eliminated.
`A split instruction/data cache also pro(cid:173)
`vides access time advantages. The CPU of
`a high-speed computer typically contains
`(exclusive of the S-unit) more than 100,000
`logic gates and is physically large. Further,
`the logic having to do with instruction fetch
`and decode has little to do with operand
`fetch and store except for execute in(cid:173)
`structions and possibly for the targets of
`branches. With a single cache system, it is
`not always possible simultaneously to place
`the cache immediately adjacent to all of the
`
`Cache Memories
`
`•
`
`507
`
`logic which will access it. A split cache, on
`the other hand, can have each of its halves
`placed in the physical location which is
`most useful, thereby saving from a fraction
`of a nanosecond to several nanoseconds.
`There are, of course, some problems in(cid:173)
`troduced by the split cache organization.
`First, there is the problem of consistency.
`Two copies now exist of information which
`formerly existed only in one place. Specifi(cid:173)
`cally, instructions can be modified, and this
`modification must be reflected before the
`instructions are executed. Further, it is pos(cid:173)
`sible that even if the programs are not se!f(cid:173)
`modifying, both data and instructions may
`coexist in the same line, either for reasons
`of storage efficiency or because of immedi(cid:173)
`ate operands. The solutions for this prob(cid:173)
`lem are the same as those discussed in the
`section on multicache consistency (Section
`2. 7), and they work here as well. It is im(cid:173)
`perative that they be implemented in such
`a way so as to not impair the access time
`advantage given by this organization.
`Another problem of this cache organiza(cid:173)
`tion is that it results in inefficient use of the
`cache memory. The size of the working set
`of a program varies constantly, and in par(cid:173)
`ticular, the fraction devoted to data and to
`instructions also varies. (A dynamically
`split design is suggested by FA VR78.) If the
`instructions and data are not stored to(cid:173)
`gether, they must each exist within their
`own memory, and be unable to share a
`larger amount of that resource. The extent
`of this problem has been studied both ex(cid:173)
`perimentally and analytically. In SHED76,
`a set of formulas are provided which can be
`used to estimate the performance of the
`unified cache from the performance of the
`individual ones. The experimental results
`were not found to agree with the mathe(cid:173)
`matical ones, although the reason was not
`investigated. We believe that the nonsta(cid:173)
`tionarity of the workload was the major
`problem.
`We compared the miss ratio of the split
`cache to that of the unified cache for each
`of the sets of traces; some of the results
`appear in Figures 25-28. (See BELL74 and
`THAK78 for additional results.) We note
`that there are several possible ways to split
`and manage the cache and the various al(cid:173)
`ternatives have been explored. One could
`split the cache in two equal parts (labeled
`"SPLIT EQUAL"), or the observed miss
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Netflix, Inc. - Ex. 1010, Page 000037
`
`

`

`508
`
`•
`
`A. J. Smith
`I/0 CACHES VS. UNIFIED CACHE
`
`0.100
`
`0.050
`
`0.010
`
`0.005
`
`ls;)
`1--1
`I-
`<
`0:::
`(f)
`(f)
`1--1
`L
`
`CG01 ... 3, PG02
`Q-10000, 64 SETS, 32 BYTES/LINE
`
`UNIFIED--
`SPLIT 0PT · - · -
`

`
`SPLIT EQUAL
`
`INSTRUCTI0NS
`
`0
`
`40000
`20000
`MEM0RY CAPACITY
`Figure 25. Miss ratio versus memory capacity for unified cache, cache split equally between
`instruction and data halves, cache split according to static optimum partition between instruc(cid:173)
`tion and data halves, and miss ratios individually for instruction and data halves.
`
`60000
`
`I/0 CACHES VS. UNIFIED CACHE
`
`CG01 ... 3, PG02
`Q-10000, 64 SETS, 32 BYTES/LINE
`N0 DUPLICATES
`
`\ .
`\ \
`I ,_
`·-~~ SPLIT EQUAL · -
`
`· -
`
`· -
`

`
`UNIFIED
`
`INSTRUCTI0NS
`
`ls;)
`1--1
`I-
`<
`0:::
`(f)
`(f)
`1--1
`L
`
`0.100
`
`0.050
`
`0.010
`
`0.005
`
`0
`
`40000
`20000
`MEM0RY CAPACITY
`Figure 26. Miss ratio versus memory capacity for instruction/data cache and for unified cache.
`
`60000
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Netflix, Inc. - Ex. 1010, Page 000038
`
`

`

`Cache Memories
`I/0 CACHES VS. UNIFIED CACHE"
`
`•
`
`509
`
`F001 ... 4
`Q-10000, 64 SETS, 32 BYTES/LINE
`
`lS} -I-
`(/) -L
`
`<
`ct:
`(/)
`
`0.100
`
`0.050
`
`0.010
`
`0.005
`
`0
`
`40000
`20000
`MEM0RY CAPACITY
`Figure 27. Miss ratio versus memory capacity for instruction/data cache and for unified cache.
`
`60000
`
`I/0 CACHES VS. UNIFIED CACHE
`
`FG01 ... 4
`Q-10000, 64 SETS, 32 BYTES/LINE
`N0 DUPLICATES
`
`SPLIT EQUAL · -
`
`· -
`

`
`DATA
`--✓---
`
`lS} -I-
`
`<
`ct:
`(/)
`(/)
`H
`L
`
`0.100
`
`~
`
`0.050
`
`0.010
`
`0.005
`
`0
`
`40000
`20000
`MEM0RY CAPACITY
`Figure 28. Miss ratio versus memory capacity for instruction/data cache and for unified cache.
`
`60000
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`Netflix, Inc. - Ex. 1010, Page 000039
`
`

`

`510
`
`•
`
`A. J. Smith
`
`ratios could be used (from this particular
`run) to determine the optimal static split
`("SPLIT OPT"). Also, when a line is found
`to be in the side of the cache other than the
`one currently accessed, it could either be
`duplicated in the remaining side of the
`cache or it could be moved; this latter case
`is labeled "NO DUPLICATES." (No spe(cid:173)
`cial label is shown if duplicates are permit(cid:173)
`ted.) The results bearing on these distinc(cid:173)
`tions are given in Figures 25-28.
`We observe in Figures 25 and 26 that the
`unified cache, the equally split cache, and
`the cache split unequally (split optimally)
`all perform about equally well with respect
`to miss ratio. Note that the miss ratio for
`the instruction and data halves of the cache
`individually are also shown. Further, com(cid:173)
`paring Figures 25 and 26, it seems that
`barring duplicate lines has only a small
`negative effect on the miss ratio.
`In sharp contrast to the measurements of
`Figures 25 and 26 are those of Figures 27
`and 28. In Figure 27, although the equally
`and optimally split cache are comparable,
`the unified cache is significantly better. The
`unified cache is better by an order of mag(cid:173)
`nitude when duplicates are not permitted
`(Figure 28), because the miss ratio is
`sharply increased by the constant move(cid:173)
`ment of lines between the two halves. It
`appears that lines sharing instruction and
`data are very common in programs com(cid:173)
`piled with IBM's FORTRAN G compiler
`and are not common in programs compiled
`using the IBM COBOL or PL/I compiler.
`(Results similar to the FORTRAN case
`have been found for the two other sets of
`IBM traces, all of which include FOR(cid:173)
`TRAN code but are not shown.)
`Based on these experimental results, we
`can say that the miss ratio may increase
`significantly if the caches are split, but that
`the effect depends greatly on the workload.
`Presumably, compilers can be designed to
`minimize this effect by ensuring that data
`and instructions are in separate lines, and
`perhaps even in separate pages.
`Despite the possible miss ratio penalty of
`splitting the cache, there are at least two
`experimental machines and two commer(cid:173)
`cial ones which do so. The S-1 [HAIL 79,
`McW177] at Lawrence Livermore Labora(cid:173)
`tory is being built with just such a cache; it
`relies on (new) software to minimize the
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`problems discussed here. The 801 minicom(cid:173)
`puter, built at IBM Research (Yorktown
`Heights) [ELEC76, RADI82] also has a
`split cache. The Hitachi H200 and Itel
`AS/6 [Ross79] both have a split data/in(cid:173)
`struction cache. No measurements have
`been publicly reported for any of these
`machines.
`
`2.9 Virtual Address Cache
`
`Most cache memories address the cache
`using the real address (see Figure 2). As the
`reader recalls, we discussed (Introduction,
`Section 2.3) the fact that the virtual address
`was translated by the TLB to the real ad(cid:173)
`dress, and that the line lookup and readout
`could not be completed until the real ad(cid:173)
`dress was available. This suggests that the
`cache access time could be significantly re(cid:173)
`duced if the translation step could be elim(cid:173)
`inated. The way to do this is to address the
`cache directly with the virtual address. We
`call a cache organized this way a virtual
`address cache. The MU-5 [IBBE77] uses
`this organization for its name store. The
`S-1, the IBM 801, and the ICL 2900 series
`machines also use this idea. It is discussed
`in BEDE79. See also OLBE79.
`There are some additional considerations
`in building a virtual address cache, and
`there is one serious problem. First, all ad(cid:173)
`dresses must be tagged with the identifier
`of the address space with which they are
`associated, or else the cache must be purged
`every time task switching occurs. Tagging
`is not a problem, but the address tag in the
`cache must be extended to include the ad(cid:173)
`dress space ID. Second, the translation
`mechanism must still exist and must still be
`efficient, since virtual addresses must be
`translated to real addresses whenever main
`memory is accessed, specifically for misses
`and for writes in a write-through cache.
`Thus the TLB cannot be eliminated.
`The most serious problem is that of
`"synonyms," two or more virtual addresses
`that map to the same real address. Syn(cid:173)
`onyms occur whenever two address spaces
`share code or data. (Since the lines have
`address space tags, the virtual addresses
`are different even if the line occurs in the
`same place in both address spaces.) Also,
`the supervisor may exist in the address
`space of each user, and it is important that
`
`Netflix, Inc. - Ex. 1010, Page 000040
`
`

`

`only one copy of supervisor tables be kept.
`The only way to detect synonyms is to take
`the virtual address, map it into a real ad(cid:173)
`dress, and then see if any other virtual
`addresses in the cache map into the same ,,
`real address. For this to be feasible, the
`inverse mapping must be available for ~ver'/
`line in the cache; this inverse mappmg 18
`accessed on real address and indicates all
`virtual addresses associated with that real
`address. Since this inverse mapping is the
`opposite of the TLB, we choose to call t~e
`inverse mapping buffer (if a separate on~ 18
`used)
`the RTB or reverse translation
`buffer. When a miss occurs, the virtual ad(cid:173)
`dress is translated into the real address by
`the TLB. The access to main memory for
`the miss is overlapped with a similar search
`of the RTB to see if that line is already in
`the cache under a different name ( virtual
`address). If it is, it must be renamed and
`moved to its new location, since multiple
`copies of the same line in the cache ?Ie
`clearly undesirable for reasons of conslSt(cid:173)
`ency.
`The severity of the synonym problem can
`be decreased if shared information can be
`forced to have the same location in all
`address spaces. Such information can be
`given a unique address space identifier, and
`the lookup algorithm always considers such
`a tag to match the current address space
`ID. A scheme like this is feasible only for
`the supervisor since other shared code
`could . not conceivably be so allocated.
`Shared supervisor code does have a unique
`location in all address spaces in IBM's MVS
`operating system.
`The RTB may or may not be a simple
`structure, depending on the structure of the
`rest of the cache. In one case it is fairly
`simple: if the bits used to select the set of
`the cache are the same for the real and
`virtual address (i.e., if none of the bits used
`to select the set undergo translation), the
`RTB can be implemented by associating
`with each cache line two address tags
`[BEDE79]. If a match is not found on the
`virtual address, then a search is made in
`that set on the real address. If that search
`finds the line, then the virtual address tag
`is changed to the current virtual address. A
`more complex design woul~ involve a sep(cid:173)
`arate mapping buffer for the reverse trans(cid:173)
`lation.
`
`Cache Memories
`
`•
`
`511
`
`2.1 O User /Supervisor Cache
`It was suggested earlier that a significant
`fraction of the miss ratio is due to task(cid:173)
`switching. A possible sol~tion to this pro~(cid:173)
`lem is to use a cache which has been split
`into two parts, one of which is used o~y b_y
`the supervisor and the other of which 18
`used primarily by the user state programs.
`If the scheduler were programmed to re(cid:173)
`start, when possible, the same user program
`that was running before an interrupt, then
`the user state miss ratio would drop appre(cid:173)
`ciably. Further, if the same interrupts recur
`frequently, the supervisor state miss ratio
`may also drop. In particular, neither the
`user nor the supervisor would purge the
`cache of the other's lines. (See PEUT77 for
`some data relevant to this problem.) The
`supervisor cache may have a high miss ratio
`in any case due to its large working set.
`(See MILA75 for an example.)
`Despite the appealing rationale of the
`above comments, there are a number of
`problems with the user/supervisor cache.
`First, it is actually unlikely to cut down the
`miss ratio. Most misses occur in supervisor
`state [MILA 75] and a supervisor cache half
`as large as the unified cache is likely ~ ~e
`worse since the maximum cache capacity 18
`no longer available to the supervisor. Fur(cid:173)
`ther, it is not clear what fraction of the time
`the scheduling algorithm can restart the
`same program. Second, the information
`used by the user and the supervisor are not
`entirely distinct, and cross-access must be
`permitted. This overlap introduces the
`problem of consistency.
`We are aware of only one evaluation of
`the split user/supervisor cache [Ross80].
`In that case, an experiment was run on an
`Hitatchi M180. The results seemed to show
`that the split cache performed about as well
`as a unified one, but poor experimental
`design makes the results questionable. We
`do not expect that a split cache will prove
`to be useful.
`
`2.11 Input/Output through
`the Cache
`In Section 2. 7, the problem of multicache
`consistency was discussed. We noted that
`if all accesses to main memory use the same
`cache, then there would be no consistency
`problem. Precisely this approach has been
`
`Computing Surveys, Vol. 14, No. 3, Sept.ember 1982
`
`Netflix, Inc. - Ex. 1010, Page 000041
`
`

`

`512
`
`•
`
`A. J. Smith
`
`used in one manufacturer's computers
`(Amdahl Corporation).
`
`2.11.1 Overruns
`While putting all input/ output-through the
`cache solves the consistency problem, it
`introduces other difficulties. First, there is
`the overrun problem. An overrun occurs
`when for some reason the 1/0 stream can(cid:173)
`not be properly transmitted between the
`memory ( cache or main) and the 1/0 de(cid:173)
`vice. Transmitting the 1/0 through the
`cache can cause an overrun when the line
`accessed by the 1/0 stream is not in the
`cache (and is thus a miss) and for some
`reason cannot be obtained quickly enough.
`Most 1/0 devices involve physical move(cid:173)
`ment, and when the buffering capacity
`embedded in the 1/0 path is exhausted, the
`transfer must be aborted and then re(cid:173)
`started. Overruns can be provoked when:
`(1) the cache is already processing one or
`more misses and cannot process the current
`(additional) one quickly enough; (2) more
`than one 1/0 transfer is in progress, and
`more active (in use) lines map into one set
`than the set size can accommodate; or (3)
`the cache bandwidth is not adequate to
`handle the current burst of simultaneous
`1/0 from several devices. Overruns can be
`minimized if the set size of the cache is
`large enough,
`the bandwidth
`is high
`enough, and the ability to process misses is
`sufficient. Sufficient buffering should also
`be provided in the 1/0 paths to the devices.
`
`2. 11.2 Miss Ratio
`
`Directing the input/ output data streams
`through the cache also has an effect on the
`miss ratio. This 1/0 data occupies some
`fraction of the space in the cache, and this
`increases the miss ratio for the other users
`of the cache. Some experiments along these
`lines were run by the author and results are
`shown in · Figures 29-32. IORATE is the
`ratio of the rate ofl/0 accesses to the cache
`to the rate of CPU accesses. (1/0 activity
`is simulated by a purely sequential syn(cid:173)
`thetic address stream referencing a distinct
`address space from the other programs.)
`The miss ratio as a function of memory size
`and 1/0 transfer rate is shown in Figures
`29 and 30 for two of the sets of traces. The
`
`Computing Surveys, Vol. 14, No. 3, September 1982
`
`data has been rearranged to show more
`directly the effect on the miss ratio in Fig-.
`ures 31 and 32. The results displayed here
`show no clear mathematical pattern, and
`we were unable to derive a useful and ver(cid:173)
`ifiable formula to predict the effect on the
`miss ratio by an 1/0 stream.
`Examination of the results presented in
`Figures 29-32 suggests that for reasonable
`1/0 rates (less than 0.05; see PowE77 for
`some 1/0 rate data) the miss ratio is not
`affected to any large extent. This observa(cid:173)
`tion is consistent with the known perform(cid:173)
`ance of the Amdahl computers, which are
`not seriously degraded· by high 1/0 rates.
`
`2.12 Cache Size
`Two very important questions when select(cid:173)
`ing a cache design are how large should the
`cache be and what kind of performance can
`we expect. The cache size is usually dictated
`by a number of criteria having to do with
`the cost and performance of the machine.
`The cache should not be so large that it
`represents an expense out of proportion to
`the added performance, nor should it oc(cid:173)
`cupy an unreasonable fraction of the phys(cid:173)
`ical space within the processor. A very large
`cache may also require more access cir(cid:173)
`cuitry, which may increase access time.
`Aside from the warnings given in the
`paragraph above, one can generally assume
`that the larger the cache, the higher the hit
`ratio, and therefore the better the perform(cid:173)
`ance. The issue is then one of the relation
`between cache size and hit ratio. This is a
`very difficult problem, since the cache hit
`ratio varies with the workload and the ma(cid:173)
`chine architecture. A cache that might yield
`a 99.8 percent hit ratio on a PDP-11 pro(cid:173)
`gram could result in a 90 percent or lower
`hit ratio for IBM (MVS) supervisor state
`code. This problem cannot be usefully stud(cid:173)
`ied using trace-driven simulation because
`the miss ratio varies tremendously from
`program to program and only a small num(cid:173)
`ber of traces can possibly be analyzed. Typ(cid:173)
`ical trace-driven simulation results appear
`throughout this paper, however, and the
`reader may wish to scan that data for in(cid:173)
`sight. There is also a variety of data avail(cid:173)
`able in the literature and the reader may
`wish to inspect the results presented in
`ALsA78, BELL74, BERG76, GIBs67, LEE69,
`
`Netflix, Inc. - Ex. 1010, Page 000042
`
`

`

`Cache Memories
`VARY I/0_ RA TE
`
`•
`
`513
`
`CG01 ... 3, PG02
`Q•l0000, 64 SETS
`32 BYTE LINES
`
`I0RATE•.10
`' , , ✓I0RA TE•.20
`--
`·- - -::.~-- - - - - -
`
`I0RATE•.02
`I0RATE•.0l
`I0RATE•0
`I0RATE•.05
`
`lSl
`......
`I-
`<
`0:::
`CJ)
`CJ)
`......
`l:
`
`8:bi
`
`0.08
`0.07
`0.06
`0.05
`0.04
`
`0.03
`
`0.02
`
`0
`
`40000
`20000
`MEM0RY CAPACITY
`Figure 29. Miss ratio versus memory capacity while 1/0 occurs through cache at specified
`rate. I0RA TE refers to fraction of all memory references due to 1/0.
`
`60000
`
`VARY I/0 RA TE
`
`WFV-APL-WTX-FFT
`Q-10000, 64 SETS
`32 BYTE LINES
`
`0.100
`
`0.050
`
`lS)
`1---1
`I-
`<
`0:::
`CJ)
`CJ)
`1---1
`L
`
`0.010
`
`I0RATE- .01
`
`····· .... ....
`
`0.005
`
`I0RATE- .025
`
`0
`
`40000
`~0000
`MEM0RY CAPACITY
`Figure 30. Miss ratio versus memory capacity while 1/0 occurs through cache.
`
`60000
`
`Comput ing Surveys, Vol. 14, No. 3, September 1982
`
`Netflix, Inc. - Ex. 1010, Page 000043
`
`

`

`514
`
`•
`
`A. J. Smith
`MISS RA TI0 VS. I/0 RA TE
`
`0.04
`
`CSI 0.03
`1----4
`I(cid:173)
`<
`Ct'.
`(/)
`(f) 0.02
`1----4
`l:
`
`0.01
`
`16K
`................. .
`········
`····
`24K~------
`.,,
`..,, /
`
`.,,,,,,.
`
`(
`.
`..
`.·
`
`0
`
`0.05
`
`0.1
`
`0.15
`
`0.05
`0
`0.2
`I/0 RATE
`Figure 31. Miss ratio versus 1/0 rate for variety of memory capacities.
`
`0.1
`
`0.15
`
`0.2
`
`INCREASE IN M

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket