throbber
Ulllted States Patent [19]
`Sharma et al.
`
`US006055605A
`[11] Patent Number:
`[45] Date of Patent:
`
`6,055,605
`Apr. 25, 2000
`
`[54] TECHNIQUE FOR REDUCING LATENCY ()F
`INTER_REFERENCE ORDERING USING
`COMMIT SIGNALS IN A MULTIPROCESSOR
`
`SYSTEM HAVING SHARED CACHES
`
`_
`.
`[75] Inventors‘ Madhfmfltra Shanna’ Shrewsbury>
`Mass» Slmon C- Steely, Jr» Hudson’
`N.H.; Kourosh Gharachorloo,
`Stanford, Calif.; Stephen R. Van
`Daren’ Northborough’ Mass‘
`[73] Assigneez Compaq Computer Corporation,
`Houston, TeX.
`
`[21] App] NO; 08/957,544
`
`[22] Filed:
`
`Oct. 24, 1997
`
`[51] Int. c1.7 ........................... .. G06F 13/00; G06F 12/00
`[52] US. Cl. ........................ .. 711/130; 711/147; 711/151;
`711/152; 712/10
`711/130 147
`[58] Field of Search
`711/10 1’ 4 16
`’
`’
`’
`’
`’
`’
`References Cited
`
`[56]
`
`U-S~ PATENT DOCUMENTS
`5 182 808 1/1993 Bagnoli et al
`571937167
`3/1993
`' """"""""""
`5Z197Z148
`3/1993
`
`710/119
`
`5,504,900
`5,546,582
`5,551,005
`5,555,382
`5,761,731
`
`4/1996 RaZ ......................................... .. 707/10
`8/1996 Brockmeyer etal. .
`. 709/300
`8/1996 Sarangdhar et a1.
`711/145
`9/1996 Thaller et al.
`710/113
`6/1998 Van Doren et al. .................. .. 711/155
`
`5,787,480
`7/1998 Scales et a1. .......................... .. 711/148
`5,829,051 10/1998 Steely, Jr. et al.
`711/216
`5,841,973 11/1998 Kessler et al. ........................ .. 709/250
`
`OTHER PUBLICATIONS
`Shared Memory Consistency Models: A Tutorial, Sarita V.
`Adve, et al., Western Research Laboratory, 1995, pp. 1—28.
`Primary Examiner—Eddie P. Chan
`Assistant Examiner—Hong Kim
`Attorney, Agent, or Firm—Cesari and McKenna
`
`[57]
`
`ABSTRACT
`
`A technique reduces the latency of inter-reference Ordering
`between Sets Of memory reference Operations in a multipro
`cessor system having a shared memory that is distributed
`among a plurality of processors that share a cache. Accord
`ing to the technique, each processor sharing a cache inherits
`a commit-signal that is generated by control logic of the
`multiprocessor system in response to a memory reference
`operation issued by another processor sharing that cache.
`The commit-signal facilitates serialization among the pro
`cessors and shared memory entities of the multiprocessor
`system by indicating the apparent completion of the memory
`reference operation to those entities of the system.
`
`
`
`5,313,591 5,390,316
`
`
`
`5/1994 2/1995 Cramer et a1. ........................ .. 709/201
`
`17 Claims, 11 Drawing Sheets
`
`202
`
`204
`
`206
`
`208
`
`100
`/
`
`P
`
`I
`
`]
`
`P
`
`P
`
`P
`
`l
`
`SWITCH
`
`300
`
`IOP
`130
`
`CACHE
`1 32
`
`1 4O
`
`COHERENCE
`CONTROLLER
`180
`
`ARB
`Bus
`Jwo
`
`DTAG
`1 6O
`
`r
`
`SHARED 1 50
`MEMORY
`
`NETAPP, INC. EXHIBIT 1002
`Page 1 of 22
`
`

`
`U.S. Patent
`
`Apr. 25,2000
`
`Sheet 1 0f 11
`
`6,055,605
`
`00?
`
`o
`
`w
`
`om wow
`
`om?
`
`F 6E
`
`wOZmmmIOO
`
`EwIIOEPZOO
`
`ow?
`
`Q9
`
`09.
`
`NETAPP, INC. EXHIBIT 1002
`Page 2 of 22
`
`

`
`U.S. Patent
`
`Apr. 25,2000
`
`Sheet 2 0f 11
`
`6,055,605
`
`SHARED CACHE 250
`
`FIG 2
`
`200 \
`
`NETAPP, INC. EXHIBIT 1002
`Page 3 of 22
`
`

`
`U.S. Patent
`
`Apr. 25,2000
`
`Sheet 3 0f 11
`
`6,055,605
`
`mom
`
`wow wow wow
`
`om_.
`
`m. .QE
`
`vmm
`
`mmtmmz
`
`wOZmmwIOO
`
`EMIIOEPZOO
`
`om:
`
`_ _ _ _
`
`NETAPP, INC. EXHIBIT 1002
`Page 4 of 22
`
`

`
`U.S. Patent
`
`Apr. 25,2000
`
`Sheet 4 0f 11
`
`6,055,605
`
`w .GE
`
`F F .GE
`
`02.32560
`
`1/ 8:
`
`NETAPP, INC. EXHIBIT 1002
`Page 5 of 22
`
`

`
`U.S. Patent
`
`Apr. 25,2000
`
`Sheet 5 0f 11
`
`6,055,605
`
`N
`
`mmm
`
`
`
`mmjOmFZOO wIO<O
`
`Omm<Im
`
`mIO<O
`
`0mm
`
`m .GE
`
`ma
`
`EPZO
`
`NE
`
`E
`
`EFZO mom
`
`can
`
`NETAPP, INC. EXHIBIT 1002
`Page 6 of 22
`
`

`
`U.S. Patent
`
`Apr. 25,2000
`
`Sheet 6 0f 11
`
`6,055,605
`
`cow
`
`wmw
`
`mmm
`
`@5200
`
`mum
`
`wwmmnnz GZ< GEO
`
`m5 ~
`
`m @E
`
`NETAPP, INC. EXHIBIT 1002
`Page 7 of 22
`
`

`
`U.S. Patent
`
`Apr. 25,2000
`
`Sheet 7 0f 11
`
`6,055,605
`
`00w
`
`
`
`“Em “Em
`
`
`
`M502 M502
`
`05 N5
`
`x .UE
`
`NETAPP, INC. EXHIBIT 1002
`Page 8 of 22
`
`

`
`U.S. Patent
`oom
`
`Apr. 25,2000
`
`Sheet 8 0f 11
`
`6,055,605
`
`.SmZ
`
`mwnEDm
`
`%
`
`m .GE
`
`.SmZ_
`
`EmEmDm
`
`w'wu.
`
`NETAPP, INC. EXHIBIT 1002
`Page 9 of 22
`
`

`
`U.S. Patent
`
`Apr. 25,2000
`
`Sheet 9 0f 11
`
`6,055,605
`
`mO<uEmP2_
`
`05
`
`@620209 N5
`
`om?
`
`wOZwmwIOO
`
`EMJJOEFZOO
`
`om?
`
`wow
`
`wow
`
`wow
`
`wow
`
`oom
`
`NETAPP, INC. EXHIBIT 1002
`Page 10 of 22
`
`

`
`U.S. Patent
`
`Apr. 25,2000
`
`Sheet 10 0f 11
`
`6,055,605
`
`2. .GE
`
`NETAPP, INC. EXHIBIT 1002
`Page 11 of 22
`
`

`
`U.S. Patent
`
`M
`
`MH.m_h__S
`
`5096
`
`5m
`
`mmmmnmn5%.I
`
`NQIIIIIIIIIII
`5:-I-ml-II-Illll-I!
`2-I-I-II--I-ll-II:
`
`r.-I-I--|--I--ml-I'll-Ill-=MrIII-In--I--I--I-=ml--II-=
`_|_fla_._....I......II........I......._I.......I.......I......
`-I-I-I---I-=-I---In--II--=
`
`59N»GE
`
`
`
`
`
`mm_u_u5m_mmrrsmmmtamHmmmuim
`
`|«om_
`
`
`
`
`
`5&2.._.Dn_Z_._.Dn_Z_llorfiS12.
`
`
`
`
`
`NETAPP, INC. EXHIBIT 1002
`
`Page 12 of 22
`
`NETAPP, INC. EXHIBIT 1002
`Page 12 of 22
`
`

`
`6,055,605
`
`1
`TECHNIQUE FOR REDUCING LATENCY OF
`INTER-REFERENCE ORDERING USING
`COMMIT SIGNALS IN A MULTIPROCESSOR
`SYSTEM HAVING SHARED CACHES
`
`CROSS-REFERENCE TO RELATED
`APPLICATION
`
`This invention is related to the US. patent application Ser.
`No. 08/957,097 titled, Method and Apparatus for Reducing
`Latency of Inter-Reference Ordering in a Multiprocessor
`System by Sharma et al., Which Was ?led on even date
`hereWith and assigned to the assignee of the present
`invention, and Which application is hereby incorporated by
`reference as though fully set forth herein.
`
`10
`
`15
`
`FIELD OF THE INVENTION
`
`The invention relates to multiprocessor systems and, more
`particularly, to the ef?cient ordering of memory reference
`operations issued by a processor of a multiprocessor system
`having a shared cache.
`
`20
`
`BACKGROUND OF THE INVENTION
`
`Multiprocessing systems, such as symmetric multi
`processors, provide a computer environment Wherein soft
`Ware applications may operate on a plurality of processors
`using a single address space or shared memory abstraction.
`In a shared memory system, each processor can access any
`data item Without a programmer having to Worry about
`Where the data is or hoW to obtain its value; this frees the
`programmer to focus on program development, e.g.,
`algorithms, rather than managing partitioned data sets and
`communicating values. Interprocessor synchronization is
`typically accomplished in a shared memory system betWeen
`processors performing read and Write operations to “syn
`chronization variables” either before and after accesses to
`“data variables”.
`For instance, consider the case of a processor P1 updating
`a data structure and processor P2 reading the updated
`structure after synchronization. Typically, this is accom
`plished by P1 updating data values and subsequently setting
`a semaphore or ?ag variable to indicate to P2 that the data
`values have been updated. P2 checks the value of the ?ag
`variable and, if set, subsequently issues read operations
`(requests) to retrieve the neW data values. Note the signi?
`cance of the term “subsequently” used above; if P1 sets the
`?ag before it completes the data updates or if P2 retrieves the
`data before it checks the value of the ?ag, synchronization
`is not achieved. The key is that each processor must indi
`vidually impose an order on its memory references for such
`synchronization techniques to Work. The order described
`above is referred to as a processor’s inter-reference order.
`Commonly used synchronization techniques require that
`each processor be capable of imposing an inter-reference
`order on its issued memory reference operations.
`
`P1
`
`Store
`Store
`
`Data, New-value
`Flag, 0
`
`P2
`
`L1:
`
`Load
`BNZ
`Load
`
`Flag
`L1
`Data
`
`25
`
`30
`
`35
`
`40
`
`45
`
`55
`
`60
`
`The inter-reference order imposed by a processor is
`de?ned by its memory reference ordering model or, more
`commonly, its consistency model. The consistency model
`
`65
`
`2
`for a processor architecture speci?es, in part, a means by
`Which the inter-reference order is speci?ed. Typically, the
`means is realized by inserting a special memory reference
`ordering instruction, such as a Memory Barrier (MB) or
`“fence”, betWeen sets of memory reference instructions.
`Alternatively, the means may be implicit in other opcodes,
`such as in “test-and-set”. In addition, the model speci?es the
`precise semantics (meaning) of the means. TWo commonly
`used consistency models include sequential consistency and
`Weak-ordering, although those skilled in the art Will recog
`nize that there are other models that may be employed, such
`as release consistency.
`Sequential Consistency
`In a sequentially consistent system, the order in Which
`memory reference operations appear in an execution path of
`the program (herein referred to as the “I-stream order”) is the
`inter-reference order. Additional instructions are not
`required to denote the order simply because each load or
`store instruction is considered ordered before its succeeding
`operation in the I-stream order.
`Consider the program example beloW. The program per
`forms as expected on a sequentially consistent system
`because the system imposes the necessary inter-reference
`order. That is, Pl’s ?rst store instruction is ordered before
`Pl’s store-to-?ag instruction. Similarly, P2’s load ?ag
`instruction is ordered before P2’s load data instruction.
`Thus, if the system imposes the correct inter-reference
`ordering and P2 retrieves the value 0 for the ?ag, P2 Will also
`retrieve the neW value for data.
`Weak Ordering
`In a Weakly-ordered system, an order is imposed betWeen
`selected sets of memory reference operations, While other
`operations are considered unordered. One or more MB
`instructions are used to indicate the required order. In the
`case of an MB instruction de?ned by the Alpha® 21264
`processor instruction set, the MB denotes that all memory
`reference instructions above the MB (i.e., pre-MB
`instructions) are ordered before all reference instructions
`after the MB (i.e., post-MB instructions). HoWever, no order
`is required betWeen reference instructions that are not sepa
`rated by an MB.
`
`P1:
`
`Store
`Store
`MB
`Store
`
`Data1, NeW—value1
`Data2, NeW—value2
`
`Flag, 0
`
`P2:
`
`L1:
`
`Load
`
`Flag
`
`BNZ
`MB
`Load
`Load
`
`L1
`
`Data1
`Data2
`
`In above example, the MB instruction implies that each of
`Pl’s tWo pre-MB store instructions are ordered before Pl’s
`store-to-?ag instruction. HoWever, there is no logical order
`required betWeen the tWo pre-MB store instructions.
`Similarly, P2’s tWo post-MB load instructions are ordered
`after the Load ?ag; hoWever, there is no order required
`betWeen the tWo post-MB loads. It can thus be appreciated
`that Weak ordering reduces the constraints on logical order
`ing of memory references, thereby alloWing a processor to
`gain higher performance by potentially executing the unor
`dered sets concurrently.
`The prior art includes other types of barriers as described
`in literature and as implemented on commercial processors.
`For example, a Write-MB (WMB) instruction on an Alpha
`
`NETAPP, INC. EXHIBIT 1002
`Page 13 of 22
`
`

`
`6,055,605
`
`3
`microprocessor requires only that pre-WMB store instruc
`tions be logically ordered before any post-WMB stores. In
`other Words, the WMB instruction places no constraints at
`all on load instructions occurring before or after the WMB.
`In order to increase performance, modern processors do
`not execute memory reference instructions one at a time. It
`is desirable that a processor keep a large number of memory
`references outstanding and issue, as Well as complete,
`memory reference operations out-of-order. This is accom
`plished by vieWing the consistency model as a “logical
`order”, i.e., the order in Which memory reference operations
`appear to happen, rather than the order in Which those
`references are issued or completed. More precisely, a con
`sistency model de?nes only a logical order on memory
`references; it alloWs for a variety of optimiZations in imple
`mentation. It is thus desired to increase performance by
`reducing latency and alloWing (on average) a large number
`of outstanding references, While preserving the logical order
`implied by the consistency model.
`In prior systems, a memory barrier instruction is typically
`contingent upon “completion” of an operation. For example,
`When a source processor issues a read operation, the opera
`tion is considered complete When data is received at the
`source processor. When executing a store instruction, the
`source processor issues a memory reference operation to
`acquire exclusive oWnership of the data; in response to the
`issued operation, system control logic generates “probes” to
`invalidate old copies of the data at other processors and to
`request forWarding of the data from the oWner processor to
`the source processor. Here the operation completes only
`When all probes reach their destination processors and the
`data is received at the source processor.
`Broadly stated, these prior systems rely on completion to
`impose inter-reference ordering. For instance, in a Weakly
`ordered system employing MB instructions, all pre-MB
`operations must be complete before the MB is passed and
`post-MB operations may be considered. Essentially,
`“completion” of an operation requires actual completion of
`all activity, including receipt of data and acknowledgments
`for probes, corresponding to the operation. Such an arrange
`ment is inef?cient and, in the context of inter-reference
`ordering, adversely affects latency.
`
`SUMMARY OF THE INVENTION
`
`The invention relates to a technique for reducing the
`latency of inter-reference ordering in a multiprocessor sys
`tem Wherein processor modules comprise at least tWo pro
`cessors sharing a cache. According to the technique, each
`processor sharing a cache selectively inherits a commit
`signal that is generated by control logic of the multiproces
`sor system in response to a memory reference operation
`issued by the other processor sharing that cache. The
`commit-signal facilitates inter-reference ordering;
`moreover, the commit signal indicates the apparent comple
`tion of the memory reference operation, rather than actual
`completion of the operation. Notably, the apparent comple
`tion of an operation occurs substantially sooner than the
`actual completion of an operation, thereby improving per
`formance of the multiprocessor system.
`In the illustrative embodiment, inter-reference ordering
`may be imposed by a memory barrier (MB) instruction
`inserted betWeen memory reference instructions of a pro
`gram executed by a processor. Execution of these instruc
`tions Within the processor may cause out-of-order issuance
`and completion of external memory reference operations
`due to operational latencies throughout the system. To
`
`4
`ensure correct implementation of the consistency model,
`prior systems inhibit program execution past the MB
`instruction until actual completion of all pre-MB operations
`have been con?rmed to the processor. According to the
`present invention, hoWever, program execution may proceed
`past the MB instruction once the commit signals for the
`operations have been received by the processor.
`As described herein, the multiprocessing system may
`comprise
`a symmetric multiprocessing (SMP) node
`Wherein the processor and shared memory entities are inter
`connected by a local sWitch or (ii) a SMP system Wherein a
`plurality of nodes are interconnected by a hierarchical
`sWitch. Each processor preferably has a private cache for
`storing data and changes to the data as a result of the
`memory reference operations are re?ected among the enti
`ties via the transmission of probe commands in accordance
`With a conventional cache coherence protocol. Notably,
`associated With the system is an ordering point. Speci?cally,
`in the SMP node, the ordering point is associated With the
`local sWitch. In the SMP system, the ordering point is
`associated With the hierarchical sWitch.
`As an example of a SMP node With an oWnership-based,
`Write-invalidate cache coherence protocol, a requesting pro
`cessor issues a memory reference operation to the system
`requesting particular data. Upon determining Which entity is
`the oWner of the data and Which entities have valid copies of
`the data, the ordering point totally orders the memory
`reference operation With respect to the other issued refer
`ences using, e.g., a conventional arbitration or prioritiZation
`policy. Thereafter, the ordering point generates probes to
`invalidate any copies of the data at appropriate processors
`and/or to request forWarding of the data from the oWner
`processor to the requesting processor, as required by the
`memory operation. Signi?cantly, the ordering point also
`generates the commit-signal at this time for transmission to
`the requesting processor. Probes and commit signals are
`generated atomically for transmission to the appropriate
`processors. The net result is that all memory operations
`appear totally ordered.
`Ordering of the requested memory reference operation
`With respect to memory references issued by other proces
`sors of the system constitutes a commit-event for the
`requested operation. For the SMP node embodiment, the
`commit-event is the point at Which the memory reference
`operation is ordered at the local sWitch, Whereas for the SMP
`system the commit-event occurs When the memory reference
`operation is ordered at the hierarchical sWitch. In accordance
`With the present invention, the commit-signal is preferably
`transmitted to the requesting processor upon the occurrence
`of, or after, such a commit-event.
`Speci?cally, issuance of the memory reference operation
`by the requesting processor preferably increments a counter
`and receipt by the processor of the commit-signal responsive
`to the issued memory reference decrements the counter.
`When program execution reaches the MB instruction and the
`counter realiZes a value of Zero, the previously issued
`operations are considered committed and execution of the
`program may proceed past the MB. In accordance With the
`present invention Where at least tWo processors share a
`cache, each processor has its oWn counter and each selec
`tively inherits the other’s commit-signal before it may
`proceed past the MB instruction.
`As an example using an oWnership-based, Write
`invalidate cache coherence protocol, a ?rst processor
`accesses the shared cache for a data item. If the data is not
`present in the cache, the ?rst processor issues a memory
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`NETAPP, INC. EXHIBIT 1002
`Page 14 of 22
`
`

`
`6,055,605
`
`5
`reference operation to the system requesting the data and
`increments its counter. The issued memory reference opera
`tion is ordered by the ordering point, a request for the data
`is forWarded (if necessary) to the appropriate entity and that
`entity returns the data to the shared cache. Substantially
`simultaneously, a commit-signal is generated by the order
`ing point in response to the issued operation and returned to
`the ?rst processor. MeanWhile, a second processor accesses
`the shared cache for the data item and “hits” on the cache
`since the data has returned. If the commit-signal for the ?rst
`processor’s outstanding operation has yet to return, the
`second processor increments its counter to ensure that the
`inter-reference ordering imposed by the MB instruction is
`maintained. Once the commit-signal is received, each pro
`cessor decrements its counter signifying that the operation
`has committed.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The above and further advantages of the invention may be
`better understood by referring to the folloWing description in
`conjunction With the accompanying draWings in Which like
`reference numbers indicate identical or functionally similar
`elements:
`FIG. 1 is a schematic block diagram of a ?rst multipro
`cessor node embodiment comprising a plurality of processor
`modules coupled to an input/output processor and a memory
`by a local sWitch;
`FIG. 2 is a schematic diagram of a processor module
`comprising at least tWo processors coupled to a shared
`cache;
`FIG. 3 is a schematic block diagram of the local sWitch
`comprising a plurality of ports coupled to the respective
`processor modules of FIG. 1;
`FIG. 4 is a schematic diagram of an embodiment of a
`commit-signal implemented as a special probe-type,
`commit-signal packet;
`FIG. 5 is a schematic diagram of an illustrative embodi
`ment of a processor module including a miss address table
`(MAT) used for implementing a novel commit-signal inher
`iting technique in accordance With the invention;
`FIG. 6 is a schematic diagram of the MAT in accordance
`With the invention;
`FIG. 7 is a schematic block diagram of a second multi
`processing system embodiment comprising a plurality of
`multiprocessor nodes interconnected by a hierarchical
`sWitch Which may be advantageously used With the present
`invention;
`FIG. 8 is a schematic block diagram of the hierarchical
`sWitch;
`FIG. 9 is a schematic block diagram of an augmented
`multiprocessor node comprising a plurality of processors
`interconnected With a shared memory, an IOP and a global
`port interface via a local sWitch;
`FIG. 10 illustrates an embodiment of a IJoopComSig
`table;
`FIG. 11 is a schematic diagram of an incoming command
`packet modi?ed With a multicast-vector; and
`FIG. 12 is a schematic block diagram illustrating a total
`ordering property of an illustrative embodiment of the
`hierarchical sWitch.
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`DETAILED DESCRIPTION OF THE
`ILLUSTRAT IVE EMBODIMENTS
`As described herein, a hierarchical symmetric multi
`processing (SMP) system includes a number of SMP nodes
`
`65
`
`6
`interconnected via a high performance sWitch. Each SMP
`node thus functions as a building block in the SMP system.
`BeloW, the structure and operation of an SMP node embodi
`ment that may be advantageously used With the present
`invention is ?rst described, folloWed by a description of the
`SMP system embodiment.
`SMP Node
`FIG. 1 is a schematic block diagram of a ?rst multipro
`cessing system embodiment, such as a small SMP node 100,
`comprising a plurality of processor modules (P) 202—208
`coupled to an input/output (I/O) processor 130 and a
`memory 150 by a local sWitch 300. The memory 150 is
`preferably organiZed as a single address space that is shared
`by the processors and apportioned into a number of blocks,
`each of Which may include, e.g., 64 bytes of data. The U0
`processor, or IOP 130, controls the transfer of data betWeen
`external devices (not shoWn) and the system via an I/O bus
`140. Data is transferred betWeen the components of the SMP
`node in the form of packets. As used herein, the term
`“system” refers to all components of the SMP node exclud
`ing the processors and IOP. In an embodiment of the
`invention, the I/O bus may operate according to the con
`ventional Peripheral Computer Interconnect (PCI) protocol.
`FIG. 2 is a schematic diagram of a processor module 200
`comprising at least tWo processors P1 and P2 coupled to a
`shared cache 250. Each processor is a modern processor
`comprising a central processing unit (CPU) that preferably
`incorporates a traditional reduced instruction set computer
`(RISC) load/store architecture. In the illustrative embodi
`ment described herein, the CPUs are Alpha® 21264 proces
`sor chips manufactured by Digital Equipment
`Corporation®, although other types of processor chips may
`be advantageously used. For example in an alternate
`embodiment, each processor may comprise a multi-threaded
`modern processor capable of executing multiple threads of
`instructions.
`The load/store instructions executed by the processors are
`issued to the system as memory reference, e.g., read and
`Write, operations. Each operation may comprise a series of
`commands (or command packets) that are exchanged
`betWeen the processors and the system. As described further
`herein, characteristics of modern processors include the
`ability to issue memory reference operations out-of-order, to
`have more than one memory reference outstanding at a time
`and to accommodate completion of the memory reference
`operations in arbitrary order.
`In addition, each processor module 200 employs a cache
`250 that is shared among its processors (or threads) and that
`stores data determined likely to be accessed in the future.
`The shared cache 250 is preferably organiZed as a Write-back
`cache apportioned into, e.g., 64-byte cache lines accessible
`by the processors; it should be noted, hoWever, that other
`cache organiZations, such as Write-through caches, may be
`used in connection With the principles of the invention. It
`should be further noted that memory reference operations
`issued by the processors are preferably directed to a 64-byte
`cache line granularity. Since the processors may update data
`from their shared cache Without updating shared memory
`150, a cache coherence protocol is utiliZed to maintain
`consistency among the caches.
`The cache coherence protocol of the illustrative embodi
`ment is preferably a conventional Write-invalidate,
`oWnership-based protocol. “Write-Invalidate” implies that
`When a processor modi?es a cache line, it invalidates stale
`copies in other processors’ caches rather than updating them
`With the neW value. The protocol is termed an “oWnership
`
`NETAPP, INC. EXHIBIT 1002
`Page 15 of 22
`
`

`
`6,055,605
`
`7
`protocol” because there is always an identi?able oWner for
`a cache line, Whether it is shared memory, one of the
`processors or the IOP entities of the system. The oWner of
`the cache line is responsible for supplying the up-to-date
`value of the cache line When requested. A processor/IOP
`may oWn a cache line in one of tWo states: “exclusively” or
`“shared”. If a processor has exclusive oWnership of a cache
`line, it may update it Without informing the system.
`OtherWise, it must inform the system and potentially invali
`date copies in the other caches.
`A shared data structure 160 is provided for capturing and
`maintaining status information corresponding to the states of
`data used by the system. In the illustrative embodiment, the
`shared data structure is con?gured as a conventional dupli
`cate tag store (DTAG) 160 that cooperates With the indi
`vidual caches of the system to de?ne the coherence protocol
`states of the data in the system. The protocol states of the
`DTAG 160 are administered by a coherence controller 180,
`Which may be implemented as a plurality of hardWare
`registers and combinational logic con?gured to produce a
`sequential logic circuit, such as a state machine. It should be
`noted, hoWever, that other con?gurations of the controller
`and shared data structure may be advantageously used
`herein.
`The DTAG 160, coherence controller 180, IOP 130 and
`shared memory 150 are interconnected by a logical bus
`referred to an Arb bus 170. Memory reference operations
`issued by the processors are routed via the local sWitch 200
`to the Arb bus 170. The order in Which the actual memory
`reference commands appear on the Arb bus is the order in
`Which processors perceive the results of those commands. In
`accordance With this embodiment of the invention, though,
`the Arb bus 170 and the coherence controller 180 cooperate
`to provide an ordering point, as described herein.
`The commands described herein are de?ned by the
`Alpha® memory system interface and may be classi?ed into
`three types: requests, probes, and responses. Requests are
`commands that are issued by a processor When, as a result
`of executing a load or store instruction, it must obtain a copy
`of data. Requests are also used to gain exclusive oWnership
`to a data item (cache line) from the system. Requests include
`Read (Rd) commands, Read/Modify (RdMod) commands,
`Change-to-Dirty (CTD) commands, Victim commands, and
`Evict commands, the latter of Which specify removal of a
`cache line from a respective cache.
`Probes are commands issued by the system to one or more
`processors requesting data and/or cache tag status updates.
`Probes include ForWarded Read (Frd) commands, For
`Warded Read Modify (FRdMod) commands and Invalidate
`(Inval) commands. When a processor P issues a request to
`the system, the system may issue one or more probes (via
`probe packets) to other processors. For example if P requests
`a copy of a cache line (a Rd request), the system sends a
`probe to the oWner processor (if any). If P requests exclusive
`oWnership of a cache line (a CTD request), the system sends
`Inval probes to one or more processors having copies of the
`cache line. If P requests both a copy of the cache line as Well
`as exclusive oWnership of the cache line (a RdMod request)
`the system sends a FRd probe to a processor currently
`storing a dirty copy of a cache line of data. In response to the
`Frd probe, the dirty copy of the cache line is returned to the
`system. A FRdMod probe is also issued by the system to a
`processor storing a dirty copy of a cache line. In response to
`the FRdMod probe, the dirty cache line is returned to the
`system and the dirty copy stored in the cache is invalidated.
`An Inval probe may be issued by the system to a processor
`storing a copy of the cache line in its cache When the cache
`line is to be updated by another processor.
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`8
`Responses are commands from the system to processors/
`lOPs Which carry the data requested by the processor or an
`acknowledgment corresponding to a request. For Rd and
`RdMod requests, the response is a Fill and FillMod
`response, respectively, each of Which carries the requested
`data. For a CTD request, the response is a CTD-Success
`(Ack) or CTD-Failure (Nack) response, indicating success
`or failure of the CTD, Whereas for a Victim request, the
`response is a Victim-Release response.
`
`FIG. 3 is a schematic block diagram of the local sWitch
`300 comprising a plurality of ports 302—310, each of Which
`is coupled to a respective processor module (P) 202—208 and
`IOP 130 via a full-duplex, bi-directional clock forWarded
`data link. Each port includes a respective input queue
`312—320 for receiving, e.g., a memory reference request
`issued by its processor module and a respective output queue
`322—330 for receiving, e.g., a memory reference probe
`issued by system control logic associated With the sWitch.
`An arbiter 340 arbitrates among the input queues to grant
`access to the Arb bus 170 Where the requests are ordered into
`a memory reference request stream. In the illustrative
`embodiment, the arbiter selects the requests stored in the
`input queues for access to the bus in accordance With an
`arbitration policy, such as a conventional round-robin algo
`rithm.
`The folloWing example illustrates the typical operation of
`multiprocessing system including sWitch 300. A Rd request
`for data item x is received at the sWitch 300 from P1 of, e.g.,
`processor module 202, and loaded into input queue 312. The
`arbiter 340 selects the request in accordance With the arbi
`tration algorithm. Upon gaining access to the Arb bus 170,
`the selected request is routed to the ordering point 350
`Wherein the states of the corresponding cache lines are
`interrogated in the DTAG 160. Speci?cally, the coherence
`controller 180 examines the DTAG to determine Which
`entity of the system “oWns” the cache line and Which entities
`have copies of the line. If a processor of module 206 is the
`oWner of the cache line x and a processor of module 208 has
`a copy, the coherence controller generates the necessary
`probes (e.g., a Fill x and Inval x) and forWards them to the
`output queues 326 and 328 for transmission to the proces
`sors.
`
`Because of operational latencies through the sWitch and
`data paths of the system, memory reference requests issued
`by P1 may complete out-of-order. In some cases, out-of
`order completion may affect the consistency of data in the
`system, particularly for updates to a cache line. Memory
`consistency models provide formal speci?cations of hoW
`such updates become visible to the entities of the multipro
`cessor system. In the illustrative embodiment of the present
`invention, a Weak ordering consistency model is described,
`although it Will be apparent to those skilled in the art that
`other consistency models may be used.
`
`In a Weakly-ordered system, inter-reference ordering is
`typically imposed by a memory barrier (MB) instruction
`inserted betWeen memory reference instructions of a pro
`gram executed by a processor. The MB instruction separates
`and groups those instructions of a program that need order
`ing from the rest of the instructions. The semantics of Weak
`ordering mandate that all pre-MB memory reference opera
`tions are logically ordered before all post-MB references.
`
`NETAPP, INC. EXHIBIT 1002
`Pa

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket