`Sharma et al.
`
`US006055605A
`[11] Patent Number:
`[45] Date of Patent:
`
`6,055,605
`Apr. 25, 2000
`
`[54] TECHNIQUE FOR REDUCING LATENCY ()F
`INTER_REFERENCE ORDERING USING
`COMMIT SIGNALS IN A MULTIPROCESSOR
`
`SYSTEM HAVING SHARED CACHES
`
`_
`.
`[75] Inventors‘ Madhfmfltra Shanna’ Shrewsbury>
`Mass» Slmon C- Steely, Jr» Hudson’
`N.H.; Kourosh Gharachorloo,
`Stanford, Calif.; Stephen R. Van
`Daren’ Northborough’ Mass‘
`[73] Assigneez Compaq Computer Corporation,
`Houston, TeX.
`
`[21] App] NO; 08/957,544
`
`[22] Filed:
`
`Oct. 24, 1997
`
`[51] Int. c1.7 ........................... .. G06F 13/00; G06F 12/00
`[52] US. Cl. ........................ .. 711/130; 711/147; 711/151;
`711/152; 712/10
`711/130 147
`[58] Field of Search
`711/10 1’ 4 16
`’
`’
`’
`’
`’
`’
`References Cited
`
`[56]
`
`U-S~ PATENT DOCUMENTS
`5 182 808 1/1993 Bagnoli et al
`571937167
`3/1993
`' """"""""""
`5Z197Z148
`3/1993
`
`710/119
`
`5,504,900
`5,546,582
`5,551,005
`5,555,382
`5,761,731
`
`4/1996 RaZ ......................................... .. 707/10
`8/1996 Brockmeyer etal. .
`. 709/300
`8/1996 Sarangdhar et a1.
`711/145
`9/1996 Thaller et al.
`710/113
`6/1998 Van Doren et al. .................. .. 711/155
`
`5,787,480
`7/1998 Scales et a1. .......................... .. 711/148
`5,829,051 10/1998 Steely, Jr. et al.
`711/216
`5,841,973 11/1998 Kessler et al. ........................ .. 709/250
`
`OTHER PUBLICATIONS
`Shared Memory Consistency Models: A Tutorial, Sarita V.
`Adve, et al., Western Research Laboratory, 1995, pp. 1—28.
`Primary Examiner—Eddie P. Chan
`Assistant Examiner—Hong Kim
`Attorney, Agent, or Firm—Cesari and McKenna
`
`[57]
`
`ABSTRACT
`
`A technique reduces the latency of inter-reference Ordering
`between Sets Of memory reference Operations in a multipro
`cessor system having a shared memory that is distributed
`among a plurality of processors that share a cache. Accord
`ing to the technique, each processor sharing a cache inherits
`a commit-signal that is generated by control logic of the
`multiprocessor system in response to a memory reference
`operation issued by another processor sharing that cache.
`The commit-signal facilitates serialization among the pro
`cessors and shared memory entities of the multiprocessor
`system by indicating the apparent completion of the memory
`reference operation to those entities of the system.
`
`
`
`5,313,591 5,390,316
`
`
`
`5/1994 2/1995 Cramer et a1. ........................ .. 709/201
`
`17 Claims, 11 Drawing Sheets
`
`202
`
`204
`
`206
`
`208
`
`100
`/
`
`P
`
`I
`
`]
`
`P
`
`P
`
`P
`
`l
`
`SWITCH
`
`300
`
`IOP
`130
`
`CACHE
`1 32
`
`1 4O
`
`COHERENCE
`CONTROLLER
`180
`
`ARB
`Bus
`Jwo
`
`DTAG
`1 6O
`
`r
`
`SHARED 1 50
`MEMORY
`
`NETAPP, INC. EXHIBIT 1002
`Page 1 of 22
`
`
`
`U.S. Patent
`
`Apr. 25,2000
`
`Sheet 1 0f 11
`
`6,055,605
`
`00?
`
`o
`
`w
`
`om wow
`
`om?
`
`F 6E
`
`wOZmmmIOO
`
`EwIIOEPZOO
`
`ow?
`
`Q9
`
`09.
`
`NETAPP, INC. EXHIBIT 1002
`Page 2 of 22
`
`
`
`U.S. Patent
`
`Apr. 25,2000
`
`Sheet 2 0f 11
`
`6,055,605
`
`SHARED CACHE 250
`
`FIG 2
`
`200 \
`
`NETAPP, INC. EXHIBIT 1002
`Page 3 of 22
`
`
`
`U.S. Patent
`
`Apr. 25,2000
`
`Sheet 3 0f 11
`
`6,055,605
`
`mom
`
`wow wow wow
`
`om_.
`
`m. .QE
`
`vmm
`
`mmtmmz
`
`wOZmmwIOO
`
`EMIIOEPZOO
`
`om:
`
`_ _ _ _
`
`NETAPP, INC. EXHIBIT 1002
`Page 4 of 22
`
`
`
`U.S. Patent
`
`Apr. 25,2000
`
`Sheet 4 0f 11
`
`6,055,605
`
`w .GE
`
`F F .GE
`
`02.32560
`
`1/ 8:
`
`NETAPP, INC. EXHIBIT 1002
`Page 5 of 22
`
`
`
`U.S. Patent
`
`Apr. 25,2000
`
`Sheet 5 0f 11
`
`6,055,605
`
`N
`
`mmm
`
`
`
`mmjOmFZOO wIO<O
`
`Omm<Im
`
`mIO<O
`
`0mm
`
`m .GE
`
`ma
`
`EPZO
`
`NE
`
`E
`
`EFZO mom
`
`can
`
`NETAPP, INC. EXHIBIT 1002
`Page 6 of 22
`
`
`
`U.S. Patent
`
`Apr. 25,2000
`
`Sheet 6 0f 11
`
`6,055,605
`
`cow
`
`wmw
`
`mmm
`
`@5200
`
`mum
`
`wwmmnnz GZ< GEO
`
`m5 ~
`
`m @E
`
`NETAPP, INC. EXHIBIT 1002
`Page 7 of 22
`
`
`
`U.S. Patent
`
`Apr. 25,2000
`
`Sheet 7 0f 11
`
`6,055,605
`
`00w
`
`
`
`“Em “Em
`
`
`
`M502 M502
`
`05 N5
`
`x .UE
`
`NETAPP, INC. EXHIBIT 1002
`Page 8 of 22
`
`
`
`U.S. Patent
`oom
`
`Apr. 25,2000
`
`Sheet 8 0f 11
`
`6,055,605
`
`.SmZ
`
`mwnEDm
`
`%
`
`m .GE
`
`.SmZ_
`
`EmEmDm
`
`w'wu.
`
`NETAPP, INC. EXHIBIT 1002
`Page 9 of 22
`
`
`
`U.S. Patent
`
`Apr. 25,2000
`
`Sheet 9 0f 11
`
`6,055,605
`
`mO<uEmP2_
`
`05
`
`@620209 N5
`
`om?
`
`wOZwmwIOO
`
`EMJJOEFZOO
`
`om?
`
`wow
`
`wow
`
`wow
`
`wow
`
`oom
`
`NETAPP, INC. EXHIBIT 1002
`Page 10 of 22
`
`
`
`U.S. Patent
`
`Apr. 25,2000
`
`Sheet 10 0f 11
`
`6,055,605
`
`2. .GE
`
`NETAPP, INC. EXHIBIT 1002
`Page 11 of 22
`
`
`
`U.S. Patent
`
`M
`
`MH.m_h__S
`
`5096
`
`5m
`
`mmmmnmn5%.I
`
`NQIIIIIIIIIII
`5:-I-ml-II-Illll-I!
`2-I-I-II--I-ll-II:
`
`r.-I-I--|--I--ml-I'll-Ill-=MrIII-In--I--I--I-=ml--II-=
`_|_fla_._....I......II........I......._I.......I.......I......
`-I-I-I---I-=-I---In--II--=
`
`59N»GE
`
`
`
`
`
`mm_u_u5m_mmrrsmmmtamHmmmuim
`
`|«om_
`
`
`
`
`
`5&2.._.Dn_Z_._.Dn_Z_llorfiS12.
`
`
`
`
`
`NETAPP, INC. EXHIBIT 1002
`
`Page 12 of 22
`
`NETAPP, INC. EXHIBIT 1002
`Page 12 of 22
`
`
`
`6,055,605
`
`1
`TECHNIQUE FOR REDUCING LATENCY OF
`INTER-REFERENCE ORDERING USING
`COMMIT SIGNALS IN A MULTIPROCESSOR
`SYSTEM HAVING SHARED CACHES
`
`CROSS-REFERENCE TO RELATED
`APPLICATION
`
`This invention is related to the US. patent application Ser.
`No. 08/957,097 titled, Method and Apparatus for Reducing
`Latency of Inter-Reference Ordering in a Multiprocessor
`System by Sharma et al., Which Was ?led on even date
`hereWith and assigned to the assignee of the present
`invention, and Which application is hereby incorporated by
`reference as though fully set forth herein.
`
`10
`
`15
`
`FIELD OF THE INVENTION
`
`The invention relates to multiprocessor systems and, more
`particularly, to the ef?cient ordering of memory reference
`operations issued by a processor of a multiprocessor system
`having a shared cache.
`
`20
`
`BACKGROUND OF THE INVENTION
`
`Multiprocessing systems, such as symmetric multi
`processors, provide a computer environment Wherein soft
`Ware applications may operate on a plurality of processors
`using a single address space or shared memory abstraction.
`In a shared memory system, each processor can access any
`data item Without a programmer having to Worry about
`Where the data is or hoW to obtain its value; this frees the
`programmer to focus on program development, e.g.,
`algorithms, rather than managing partitioned data sets and
`communicating values. Interprocessor synchronization is
`typically accomplished in a shared memory system betWeen
`processors performing read and Write operations to “syn
`chronization variables” either before and after accesses to
`“data variables”.
`For instance, consider the case of a processor P1 updating
`a data structure and processor P2 reading the updated
`structure after synchronization. Typically, this is accom
`plished by P1 updating data values and subsequently setting
`a semaphore or ?ag variable to indicate to P2 that the data
`values have been updated. P2 checks the value of the ?ag
`variable and, if set, subsequently issues read operations
`(requests) to retrieve the neW data values. Note the signi?
`cance of the term “subsequently” used above; if P1 sets the
`?ag before it completes the data updates or if P2 retrieves the
`data before it checks the value of the ?ag, synchronization
`is not achieved. The key is that each processor must indi
`vidually impose an order on its memory references for such
`synchronization techniques to Work. The order described
`above is referred to as a processor’s inter-reference order.
`Commonly used synchronization techniques require that
`each processor be capable of imposing an inter-reference
`order on its issued memory reference operations.
`
`P1
`
`Store
`Store
`
`Data, New-value
`Flag, 0
`
`P2
`
`L1:
`
`Load
`BNZ
`Load
`
`Flag
`L1
`Data
`
`25
`
`30
`
`35
`
`40
`
`45
`
`55
`
`60
`
`The inter-reference order imposed by a processor is
`de?ned by its memory reference ordering model or, more
`commonly, its consistency model. The consistency model
`
`65
`
`2
`for a processor architecture speci?es, in part, a means by
`Which the inter-reference order is speci?ed. Typically, the
`means is realized by inserting a special memory reference
`ordering instruction, such as a Memory Barrier (MB) or
`“fence”, betWeen sets of memory reference instructions.
`Alternatively, the means may be implicit in other opcodes,
`such as in “test-and-set”. In addition, the model speci?es the
`precise semantics (meaning) of the means. TWo commonly
`used consistency models include sequential consistency and
`Weak-ordering, although those skilled in the art Will recog
`nize that there are other models that may be employed, such
`as release consistency.
`Sequential Consistency
`In a sequentially consistent system, the order in Which
`memory reference operations appear in an execution path of
`the program (herein referred to as the “I-stream order”) is the
`inter-reference order. Additional instructions are not
`required to denote the order simply because each load or
`store instruction is considered ordered before its succeeding
`operation in the I-stream order.
`Consider the program example beloW. The program per
`forms as expected on a sequentially consistent system
`because the system imposes the necessary inter-reference
`order. That is, Pl’s ?rst store instruction is ordered before
`Pl’s store-to-?ag instruction. Similarly, P2’s load ?ag
`instruction is ordered before P2’s load data instruction.
`Thus, if the system imposes the correct inter-reference
`ordering and P2 retrieves the value 0 for the ?ag, P2 Will also
`retrieve the neW value for data.
`Weak Ordering
`In a Weakly-ordered system, an order is imposed betWeen
`selected sets of memory reference operations, While other
`operations are considered unordered. One or more MB
`instructions are used to indicate the required order. In the
`case of an MB instruction de?ned by the Alpha® 21264
`processor instruction set, the MB denotes that all memory
`reference instructions above the MB (i.e., pre-MB
`instructions) are ordered before all reference instructions
`after the MB (i.e., post-MB instructions). HoWever, no order
`is required betWeen reference instructions that are not sepa
`rated by an MB.
`
`P1:
`
`Store
`Store
`MB
`Store
`
`Data1, NeW—value1
`Data2, NeW—value2
`
`Flag, 0
`
`P2:
`
`L1:
`
`Load
`
`Flag
`
`BNZ
`MB
`Load
`Load
`
`L1
`
`Data1
`Data2
`
`In above example, the MB instruction implies that each of
`Pl’s tWo pre-MB store instructions are ordered before Pl’s
`store-to-?ag instruction. HoWever, there is no logical order
`required betWeen the tWo pre-MB store instructions.
`Similarly, P2’s tWo post-MB load instructions are ordered
`after the Load ?ag; hoWever, there is no order required
`betWeen the tWo post-MB loads. It can thus be appreciated
`that Weak ordering reduces the constraints on logical order
`ing of memory references, thereby alloWing a processor to
`gain higher performance by potentially executing the unor
`dered sets concurrently.
`The prior art includes other types of barriers as described
`in literature and as implemented on commercial processors.
`For example, a Write-MB (WMB) instruction on an Alpha
`
`NETAPP, INC. EXHIBIT 1002
`Page 13 of 22
`
`
`
`6,055,605
`
`3
`microprocessor requires only that pre-WMB store instruc
`tions be logically ordered before any post-WMB stores. In
`other Words, the WMB instruction places no constraints at
`all on load instructions occurring before or after the WMB.
`In order to increase performance, modern processors do
`not execute memory reference instructions one at a time. It
`is desirable that a processor keep a large number of memory
`references outstanding and issue, as Well as complete,
`memory reference operations out-of-order. This is accom
`plished by vieWing the consistency model as a “logical
`order”, i.e., the order in Which memory reference operations
`appear to happen, rather than the order in Which those
`references are issued or completed. More precisely, a con
`sistency model de?nes only a logical order on memory
`references; it alloWs for a variety of optimiZations in imple
`mentation. It is thus desired to increase performance by
`reducing latency and alloWing (on average) a large number
`of outstanding references, While preserving the logical order
`implied by the consistency model.
`In prior systems, a memory barrier instruction is typically
`contingent upon “completion” of an operation. For example,
`When a source processor issues a read operation, the opera
`tion is considered complete When data is received at the
`source processor. When executing a store instruction, the
`source processor issues a memory reference operation to
`acquire exclusive oWnership of the data; in response to the
`issued operation, system control logic generates “probes” to
`invalidate old copies of the data at other processors and to
`request forWarding of the data from the oWner processor to
`the source processor. Here the operation completes only
`When all probes reach their destination processors and the
`data is received at the source processor.
`Broadly stated, these prior systems rely on completion to
`impose inter-reference ordering. For instance, in a Weakly
`ordered system employing MB instructions, all pre-MB
`operations must be complete before the MB is passed and
`post-MB operations may be considered. Essentially,
`“completion” of an operation requires actual completion of
`all activity, including receipt of data and acknowledgments
`for probes, corresponding to the operation. Such an arrange
`ment is inef?cient and, in the context of inter-reference
`ordering, adversely affects latency.
`
`SUMMARY OF THE INVENTION
`
`The invention relates to a technique for reducing the
`latency of inter-reference ordering in a multiprocessor sys
`tem Wherein processor modules comprise at least tWo pro
`cessors sharing a cache. According to the technique, each
`processor sharing a cache selectively inherits a commit
`signal that is generated by control logic of the multiproces
`sor system in response to a memory reference operation
`issued by the other processor sharing that cache. The
`commit-signal facilitates inter-reference ordering;
`moreover, the commit signal indicates the apparent comple
`tion of the memory reference operation, rather than actual
`completion of the operation. Notably, the apparent comple
`tion of an operation occurs substantially sooner than the
`actual completion of an operation, thereby improving per
`formance of the multiprocessor system.
`In the illustrative embodiment, inter-reference ordering
`may be imposed by a memory barrier (MB) instruction
`inserted betWeen memory reference instructions of a pro
`gram executed by a processor. Execution of these instruc
`tions Within the processor may cause out-of-order issuance
`and completion of external memory reference operations
`due to operational latencies throughout the system. To
`
`4
`ensure correct implementation of the consistency model,
`prior systems inhibit program execution past the MB
`instruction until actual completion of all pre-MB operations
`have been con?rmed to the processor. According to the
`present invention, hoWever, program execution may proceed
`past the MB instruction once the commit signals for the
`operations have been received by the processor.
`As described herein, the multiprocessing system may
`comprise
`a symmetric multiprocessing (SMP) node
`Wherein the processor and shared memory entities are inter
`connected by a local sWitch or (ii) a SMP system Wherein a
`plurality of nodes are interconnected by a hierarchical
`sWitch. Each processor preferably has a private cache for
`storing data and changes to the data as a result of the
`memory reference operations are re?ected among the enti
`ties via the transmission of probe commands in accordance
`With a conventional cache coherence protocol. Notably,
`associated With the system is an ordering point. Speci?cally,
`in the SMP node, the ordering point is associated With the
`local sWitch. In the SMP system, the ordering point is
`associated With the hierarchical sWitch.
`As an example of a SMP node With an oWnership-based,
`Write-invalidate cache coherence protocol, a requesting pro
`cessor issues a memory reference operation to the system
`requesting particular data. Upon determining Which entity is
`the oWner of the data and Which entities have valid copies of
`the data, the ordering point totally orders the memory
`reference operation With respect to the other issued refer
`ences using, e.g., a conventional arbitration or prioritiZation
`policy. Thereafter, the ordering point generates probes to
`invalidate any copies of the data at appropriate processors
`and/or to request forWarding of the data from the oWner
`processor to the requesting processor, as required by the
`memory operation. Signi?cantly, the ordering point also
`generates the commit-signal at this time for transmission to
`the requesting processor. Probes and commit signals are
`generated atomically for transmission to the appropriate
`processors. The net result is that all memory operations
`appear totally ordered.
`Ordering of the requested memory reference operation
`With respect to memory references issued by other proces
`sors of the system constitutes a commit-event for the
`requested operation. For the SMP node embodiment, the
`commit-event is the point at Which the memory reference
`operation is ordered at the local sWitch, Whereas for the SMP
`system the commit-event occurs When the memory reference
`operation is ordered at the hierarchical sWitch. In accordance
`With the present invention, the commit-signal is preferably
`transmitted to the requesting processor upon the occurrence
`of, or after, such a commit-event.
`Speci?cally, issuance of the memory reference operation
`by the requesting processor preferably increments a counter
`and receipt by the processor of the commit-signal responsive
`to the issued memory reference decrements the counter.
`When program execution reaches the MB instruction and the
`counter realiZes a value of Zero, the previously issued
`operations are considered committed and execution of the
`program may proceed past the MB. In accordance With the
`present invention Where at least tWo processors share a
`cache, each processor has its oWn counter and each selec
`tively inherits the other’s commit-signal before it may
`proceed past the MB instruction.
`As an example using an oWnership-based, Write
`invalidate cache coherence protocol, a ?rst processor
`accesses the shared cache for a data item. If the data is not
`present in the cache, the ?rst processor issues a memory
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`NETAPP, INC. EXHIBIT 1002
`Page 14 of 22
`
`
`
`6,055,605
`
`5
`reference operation to the system requesting the data and
`increments its counter. The issued memory reference opera
`tion is ordered by the ordering point, a request for the data
`is forWarded (if necessary) to the appropriate entity and that
`entity returns the data to the shared cache. Substantially
`simultaneously, a commit-signal is generated by the order
`ing point in response to the issued operation and returned to
`the ?rst processor. MeanWhile, a second processor accesses
`the shared cache for the data item and “hits” on the cache
`since the data has returned. If the commit-signal for the ?rst
`processor’s outstanding operation has yet to return, the
`second processor increments its counter to ensure that the
`inter-reference ordering imposed by the MB instruction is
`maintained. Once the commit-signal is received, each pro
`cessor decrements its counter signifying that the operation
`has committed.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The above and further advantages of the invention may be
`better understood by referring to the folloWing description in
`conjunction With the accompanying draWings in Which like
`reference numbers indicate identical or functionally similar
`elements:
`FIG. 1 is a schematic block diagram of a ?rst multipro
`cessor node embodiment comprising a plurality of processor
`modules coupled to an input/output processor and a memory
`by a local sWitch;
`FIG. 2 is a schematic diagram of a processor module
`comprising at least tWo processors coupled to a shared
`cache;
`FIG. 3 is a schematic block diagram of the local sWitch
`comprising a plurality of ports coupled to the respective
`processor modules of FIG. 1;
`FIG. 4 is a schematic diagram of an embodiment of a
`commit-signal implemented as a special probe-type,
`commit-signal packet;
`FIG. 5 is a schematic diagram of an illustrative embodi
`ment of a processor module including a miss address table
`(MAT) used for implementing a novel commit-signal inher
`iting technique in accordance With the invention;
`FIG. 6 is a schematic diagram of the MAT in accordance
`With the invention;
`FIG. 7 is a schematic block diagram of a second multi
`processing system embodiment comprising a plurality of
`multiprocessor nodes interconnected by a hierarchical
`sWitch Which may be advantageously used With the present
`invention;
`FIG. 8 is a schematic block diagram of the hierarchical
`sWitch;
`FIG. 9 is a schematic block diagram of an augmented
`multiprocessor node comprising a plurality of processors
`interconnected With a shared memory, an IOP and a global
`port interface via a local sWitch;
`FIG. 10 illustrates an embodiment of a IJoopComSig
`table;
`FIG. 11 is a schematic diagram of an incoming command
`packet modi?ed With a multicast-vector; and
`FIG. 12 is a schematic block diagram illustrating a total
`ordering property of an illustrative embodiment of the
`hierarchical sWitch.
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`DETAILED DESCRIPTION OF THE
`ILLUSTRAT IVE EMBODIMENTS
`As described herein, a hierarchical symmetric multi
`processing (SMP) system includes a number of SMP nodes
`
`65
`
`6
`interconnected via a high performance sWitch. Each SMP
`node thus functions as a building block in the SMP system.
`BeloW, the structure and operation of an SMP node embodi
`ment that may be advantageously used With the present
`invention is ?rst described, folloWed by a description of the
`SMP system embodiment.
`SMP Node
`FIG. 1 is a schematic block diagram of a ?rst multipro
`cessing system embodiment, such as a small SMP node 100,
`comprising a plurality of processor modules (P) 202—208
`coupled to an input/output (I/O) processor 130 and a
`memory 150 by a local sWitch 300. The memory 150 is
`preferably organiZed as a single address space that is shared
`by the processors and apportioned into a number of blocks,
`each of Which may include, e.g., 64 bytes of data. The U0
`processor, or IOP 130, controls the transfer of data betWeen
`external devices (not shoWn) and the system via an I/O bus
`140. Data is transferred betWeen the components of the SMP
`node in the form of packets. As used herein, the term
`“system” refers to all components of the SMP node exclud
`ing the processors and IOP. In an embodiment of the
`invention, the I/O bus may operate according to the con
`ventional Peripheral Computer Interconnect (PCI) protocol.
`FIG. 2 is a schematic diagram of a processor module 200
`comprising at least tWo processors P1 and P2 coupled to a
`shared cache 250. Each processor is a modern processor
`comprising a central processing unit (CPU) that preferably
`incorporates a traditional reduced instruction set computer
`(RISC) load/store architecture. In the illustrative embodi
`ment described herein, the CPUs are Alpha® 21264 proces
`sor chips manufactured by Digital Equipment
`Corporation®, although other types of processor chips may
`be advantageously used. For example in an alternate
`embodiment, each processor may comprise a multi-threaded
`modern processor capable of executing multiple threads of
`instructions.
`The load/store instructions executed by the processors are
`issued to the system as memory reference, e.g., read and
`Write, operations. Each operation may comprise a series of
`commands (or command packets) that are exchanged
`betWeen the processors and the system. As described further
`herein, characteristics of modern processors include the
`ability to issue memory reference operations out-of-order, to
`have more than one memory reference outstanding at a time
`and to accommodate completion of the memory reference
`operations in arbitrary order.
`In addition, each processor module 200 employs a cache
`250 that is shared among its processors (or threads) and that
`stores data determined likely to be accessed in the future.
`The shared cache 250 is preferably organiZed as a Write-back
`cache apportioned into, e.g., 64-byte cache lines accessible
`by the processors; it should be noted, hoWever, that other
`cache organiZations, such as Write-through caches, may be
`used in connection With the principles of the invention. It
`should be further noted that memory reference operations
`issued by the processors are preferably directed to a 64-byte
`cache line granularity. Since the processors may update data
`from their shared cache Without updating shared memory
`150, a cache coherence protocol is utiliZed to maintain
`consistency among the caches.
`The cache coherence protocol of the illustrative embodi
`ment is preferably a conventional Write-invalidate,
`oWnership-based protocol. “Write-Invalidate” implies that
`When a processor modi?es a cache line, it invalidates stale
`copies in other processors’ caches rather than updating them
`With the neW value. The protocol is termed an “oWnership
`
`NETAPP, INC. EXHIBIT 1002
`Page 15 of 22
`
`
`
`6,055,605
`
`7
`protocol” because there is always an identi?able oWner for
`a cache line, Whether it is shared memory, one of the
`processors or the IOP entities of the system. The oWner of
`the cache line is responsible for supplying the up-to-date
`value of the cache line When requested. A processor/IOP
`may oWn a cache line in one of tWo states: “exclusively” or
`“shared”. If a processor has exclusive oWnership of a cache
`line, it may update it Without informing the system.
`OtherWise, it must inform the system and potentially invali
`date copies in the other caches.
`A shared data structure 160 is provided for capturing and
`maintaining status information corresponding to the states of
`data used by the system. In the illustrative embodiment, the
`shared data structure is con?gured as a conventional dupli
`cate tag store (DTAG) 160 that cooperates With the indi
`vidual caches of the system to de?ne the coherence protocol
`states of the data in the system. The protocol states of the
`DTAG 160 are administered by a coherence controller 180,
`Which may be implemented as a plurality of hardWare
`registers and combinational logic con?gured to produce a
`sequential logic circuit, such as a state machine. It should be
`noted, hoWever, that other con?gurations of the controller
`and shared data structure may be advantageously used
`herein.
`The DTAG 160, coherence controller 180, IOP 130 and
`shared memory 150 are interconnected by a logical bus
`referred to an Arb bus 170. Memory reference operations
`issued by the processors are routed via the local sWitch 200
`to the Arb bus 170. The order in Which the actual memory
`reference commands appear on the Arb bus is the order in
`Which processors perceive the results of those commands. In
`accordance With this embodiment of the invention, though,
`the Arb bus 170 and the coherence controller 180 cooperate
`to provide an ordering point, as described herein.
`The commands described herein are de?ned by the
`Alpha® memory system interface and may be classi?ed into
`three types: requests, probes, and responses. Requests are
`commands that are issued by a processor When, as a result
`of executing a load or store instruction, it must obtain a copy
`of data. Requests are also used to gain exclusive oWnership
`to a data item (cache line) from the system. Requests include
`Read (Rd) commands, Read/Modify (RdMod) commands,
`Change-to-Dirty (CTD) commands, Victim commands, and
`Evict commands, the latter of Which specify removal of a
`cache line from a respective cache.
`Probes are commands issued by the system to one or more
`processors requesting data and/or cache tag status updates.
`Probes include ForWarded Read (Frd) commands, For
`Warded Read Modify (FRdMod) commands and Invalidate
`(Inval) commands. When a processor P issues a request to
`the system, the system may issue one or more probes (via
`probe packets) to other processors. For example if P requests
`a copy of a cache line (a Rd request), the system sends a
`probe to the oWner processor (if any). If P requests exclusive
`oWnership of a cache line (a CTD request), the system sends
`Inval probes to one or more processors having copies of the
`cache line. If P requests both a copy of the cache line as Well
`as exclusive oWnership of the cache line (a RdMod request)
`the system sends a FRd probe to a processor currently
`storing a dirty copy of a cache line of data. In response to the
`Frd probe, the dirty copy of the cache line is returned to the
`system. A FRdMod probe is also issued by the system to a
`processor storing a dirty copy of a cache line. In response to
`the FRdMod probe, the dirty cache line is returned to the
`system and the dirty copy stored in the cache is invalidated.
`An Inval probe may be issued by the system to a processor
`storing a copy of the cache line in its cache When the cache
`line is to be updated by another processor.
`
`10
`
`15
`
`25
`
`35
`
`45
`
`55
`
`65
`
`8
`Responses are commands from the system to processors/
`lOPs Which carry the data requested by the processor or an
`acknowledgment corresponding to a request. For Rd and
`RdMod requests, the response is a Fill and FillMod
`response, respectively, each of Which carries the requested
`data. For a CTD request, the response is a CTD-Success
`(Ack) or CTD-Failure (Nack) response, indicating success
`or failure of the CTD, Whereas for a Victim request, the
`response is a Victim-Release response.
`
`FIG. 3 is a schematic block diagram of the local sWitch
`300 comprising a plurality of ports 302—310, each of Which
`is coupled to a respective processor module (P) 202—208 and
`IOP 130 via a full-duplex, bi-directional clock forWarded
`data link. Each port includes a respective input queue
`312—320 for receiving, e.g., a memory reference request
`issued by its processor module and a respective output queue
`322—330 for receiving, e.g., a memory reference probe
`issued by system control logic associated With the sWitch.
`An arbiter 340 arbitrates among the input queues to grant
`access to the Arb bus 170 Where the requests are ordered into
`a memory reference request stream. In the illustrative
`embodiment, the arbiter selects the requests stored in the
`input queues for access to the bus in accordance With an
`arbitration policy, such as a conventional round-robin algo
`rithm.
`The folloWing example illustrates the typical operation of
`multiprocessing system including sWitch 300. A Rd request
`for data item x is received at the sWitch 300 from P1 of, e.g.,
`processor module 202, and loaded into input queue 312. The
`arbiter 340 selects the request in accordance With the arbi
`tration algorithm. Upon gaining access to the Arb bus 170,
`the selected request is routed to the ordering point 350
`Wherein the states of the corresponding cache lines are
`interrogated in the DTAG 160. Speci?cally, the coherence
`controller 180 examines the DTAG to determine Which
`entity of the system “oWns” the cache line and Which entities
`have copies of the line. If a processor of module 206 is the
`oWner of the cache line x and a processor of module 208 has
`a copy, the coherence controller generates the necessary
`probes (e.g., a Fill x and Inval x) and forWards them to the
`output queues 326 and 328 for transmission to the proces
`sors.
`
`Because of operational latencies through the sWitch and
`data paths of the system, memory reference requests issued
`by P1 may complete out-of-order. In some cases, out-of
`order completion may affect the consistency of data in the
`system, particularly for updates to a cache line. Memory
`consistency models provide formal speci?cations of hoW
`such updates become visible to the entities of the multipro
`cessor system. In the illustrative embodiment of the present
`invention, a Weak ordering consistency model is described,
`although it Will be apparent to those skilled in the art that
`other consistency models may be used.
`
`In a Weakly-ordered system, inter-reference ordering is
`typically imposed by a memory barrier (MB) instruction
`inserted betWeen memory reference instructions of a pro
`gram executed by a processor. The MB instruction separates
`and groups those instructions of a program that need order
`ing from the rest of the instructions. The semantics of Weak
`ordering mandate that all pre-MB memory reference opera
`tions are logically ordered before all post-MB references.
`
`NETAPP, INC. EXHIBIT 1002
`Pa