`
`(12) United States Patent
`Biran et al.
`
`(10) Patent No.:
`(45) Date of Patent:
`
`US 7,603,490 B2
`Oct. 13, 2009
`
`(54) BARRIER AND INTERRUPT MECHANISM
`FOR HIGH LATENCY AND OUT OF ORDER
`DMADEVICE
`
`(75) Inventors: Giora Biran, Zichron-Yaakov (IL); Luis
`E. De la Torre, Austin, TX (US);
`Bernard C. Drerup, Austin, TX (US);
`Jyoti Gupta, Austin, TX (US); Richard
`Nicholas, Pflugerville, TX (US)
`
`2007/OO73915 A1
`
`3/2007 Goet al.
`
`(Continued)
`FOREIGN PATENT DOCUMENTS
`
`CN
`
`1794214 A
`
`6, 2006
`
`(*) Notice:
`
`(73)
`
`(21)
`(22)
`(65)
`
`(51)
`
`(52)
`(58)
`
`(56)
`
`Assignee:
`
`International Business Machines
`Corporation, Armonk, NY (US)
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 256 days.
`Appl. No.: 11/621,776
`
`Filed:
`
`Jan. 10, 2007
`
`Prior Publication Data
`US 2008/O168191 A1
`Jul. 10, 2008
`
`Int. C.
`(2006.01)
`G06F 3/28
`(2006.01)
`G06F I3/32
`U.S. Cl. ........................................... 710/23: 710/24
`Field of Classification Search ................... 710/23,
`710/24
`See application file for complete search history.
`References Cited
`
`OTHER PUBLICATIONS
`
`U.S. Appl. No. 1 1/532,562, filed Sep. 18, 2006, Biran et al.
`
`(Continued)
`Primary Examiner Henry W. H. Tsai
`Assistant Examiner—Hyun Nam
`(74) Attorney, Agent, or Firm—Stephen R. Tkacs; Stephen J.
`Walder, Jr.; Matthew B. Talpis
`
`(57)
`
`ABSTRACT
`
`A direct memory access (DMA) device includes a barrier and
`interrupt mechanism that allows interrupt and mailbox opera
`tions to occur in Such away that ensures correct operation, but
`still allows for high performance out-of-order data moves to
`occur whenever possible. Certain descriptors are defined to
`be “barrier descriptors.” When the DMA device encounters a
`barrier descriptor, it ensures that all of the previous descrip
`tors complete before the barrier descriptor completes. The
`DMA device further ensures that any interrupt generated by a
`barrier descriptor will not assert until the data move associ
`ated with the barrier descriptor completes. The DMA control
`ler only permits interrupts to be generated by barrier descrip
`tors. The barrier descriptor concept also allows software to
`embed mailbox completion messages into the scatter/gather
`linked list of descriptors.
`
`20 Claims, 8 Drawing Sheets
`
`B2
`B2
`B2
`B1
`
`6,848,029
`6,981,074
`7,076,578
`7,218,566
`2003/0172208
`2004.0034718
`2004/O187122
`2005/0O27902
`2005, 0108446
`2006/0206635
`
`U.S. PATENT DOCUMENTS
`Coldewey
`Oner et al.
`Poisner et al.
`Totolos, Jr. et al.
`Fidler
`Goldenberg et al.
`Gosalia et al. .............. T18, 100
`King et al. .................... T10/24
`Inogai
`Alexander et al.
`
`1/2005
`12, 2005
`T/2006
`5/2007
`9, 2003
`2, 2004
`9, 2004
`2, 2005
`5/2005
`9, 2006
`
`41 412
`401\scriptor
`
`402
`
`403
`
`404
`
`405
`
`
`
`DESCRIPTOR
`
`DESCRIPTOR
`
`SCATTER,
`ATHERLIST
`
`DESCRIPTOR
`-
`DESCRIPTOR
`
`OMADEVICE
`
`DMA
`REGUEST
`
`DMA
`REGUEST
`
`DMA
`RECUEST
`
`DMA
`REQUEST
`
`BARRIER
`CLEAR
`
`BARRIER
`CLEAR
`
`BUSENGINE
`
`READ
`
`BUS
`
`440
`
`WRITE
`
`Intel Corporation v. ACQIS LLC
`Intel Corp.'s Exhibit 1045
`Ex. 1045, Page 1
`
`
`
`US 7,603.490 B2
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`OTHER PUBLICATIONS
`
`2007/OO74091 A1
`2007/OO79185 A1
`2007/0162652 A1
`2007/0204091 A1
`
`3, 2007
`4, 2007
`7/2007
`8, 2007
`
`Go et al.
`Totolos, Jr.
`Go et al.
`Hofmann et al.
`
`U.S. Appl. No. 1 1/621,789, filed Jan. 10, 2007, Biran et al.
`* cited by examiner
`
`Ex. 1045, Page 2
`
`
`
`U.S. Patent
`
`US 7,603,490 B2
`
`
`
`FIG. I.
`
`100
`
`
`
`EXTERNAL
`BUSES/
`DEVICES
`
`Ex. 1045, Page 3
`
`
`
`U.S. Patent
`
`Oct. 13, 2009
`
`Sheet 2 of 8
`
`US 7,603,490 B2
`
`2O2
`
`FIG. 2
`
`PROCESSING UNIT
`
`208
`
`ske S.
`
`>
`
`MEMORY
`
`200
`y
`216
`
`236
`
`AUDIO
`ADAPTER
`
`SIO
`
`BUS
`
`204
`
`
`
`238
`
`BUS
`
`240
`
`s
`
`USBAND
`PC/PCle
`NETWORK
`DISK CD-ROM ADAPTER E. DEVICES
`
`KEYBOARD
`AND
`MOUSE
`ADAPTER
`
`MODEM
`
`226
`
`230
`
`212
`
`232
`
`234
`
`220
`
`222
`
`224
`
`
`
`PROCESSING
`UNIT
`
`FIG. 3
`SYSTEM
`MEMORY
`
`SOUTH
`BRIDGE
`
`300
`
`LOCAL
`MEMORY
`
`LOCAL
`MEMORY
`CONTROLLER
`
`DMADEVICE
`DMA ENGINE R/W BUS ENGINE
`(DE)
`(BE)
`
`312
`
`314
`
`BUS UNIT
`DEVICE
`
`BUS UNIT
`DEVICE
`
`Ex. 1045, Page 4
`
`
`
`U.S. Patent
`
`Oct. 13, 2009
`
`Sheet 3 of 8
`
`US 7,603,490 B2
`
`411. 412
`
`401
`
`DESCRIPTOR
`402\DEscriptor
`
`
`
`404
`
`403\Descriptor
`DESCRIPTOR
`DESCRIPTOR
`DESCRIPTOR
`
`405
`
`
`
`SCATTER/
`GATHER LIST
`
`DMADEVICE
`
`BARRIER
`CLEAR
`
`BARRIER
`CLEAR
`
`BARRIER
`CLEAR
`
`BARRIER
`CLEAR
`
`Ex. 1045, Page 5
`
`
`
`U.S. Patent
`
`
`
`
`
`US 7,603,490 B2
`
`79GZ9909G8999997GGZGG
`
`Ex. 1045, Page 6
`
`
`
`U.S. Patent
`
`Oct. 13, 2009
`
`Sheet 5 of 8
`
`US 7,603,490 B2
`
`
`
`READ QUEUE
`602
`
`WRITE QUEUE
`606
`
`
`
`REGISTERS
`
`610
`
`INTERFACE
`TO DE
`
`615
`
`
`
`DE REQ
`AND ATTR
`
`DE ACK
`
`BUS INTERFACE UNIT
`
`630
`
`Ex. 1045, Page 7
`
`
`
`U.S. Patent
`
`Oct. 13, 2009
`
`Sheet 6 of 8
`
`US 7,603,490 B2
`
`BEGIN
`
`
`
`
`
`
`
`DMA
`CHANNEL READY
`TOISSUE REQUEST
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`BARRIER
`ATTRIBUTE SET
`
`BARRIER
`PENDING FOR
`CHANNEL
`808 YES
`
`RECEIVED
`BARRIER CLEAR
`FROM BE2
`
`FIG. 8
`
`
`
`
`
`ISSUEDMA
`REQUEST TO BE
`
`
`
`TOGGLE
`BARRIER TAG
`
`810
`
`BEGIN
`
`DESCRIPTOR s
`
`AVAILABLE
`
`704
`
`
`
`CONVERT DESCRIPTOR
`INTO ONE ORMORE
`DMA REQUESTS
`
`
`
`
`
`
`
`BARRIER
`DESCRIPTOR7
`
`
`
`SET BARRIER ATTRIBUTE
`FORLAST REQUEST FOR
`7081 BARRIER DESCRIPTOR
`
`
`
`FIG. 7
`
`Ex. 1045, Page 8
`
`
`
`U.S. Patent
`
`Oct. 13, 2009
`
`Sheet 7 of 8
`
`US 7,603,490 B2
`
`BEGIN
`BARRIER TAG=0
`
`FIG. 9A
`
`902
`
`SET BARRIER FLAG=0
`
`
`
`
`
`
`
`
`
`
`
`916
`
`NEW
`NO<10MAREGUEST
`O6
`
`
`
`RECEIVE DMA REQUEST FROM QUEUE
`
`
`
`BARRIER
`TAG=O AND BARRIER
`ATTRIBUTE= 1?
`
`PLACEDMA
`REQUEST IN
`HOLDING REGISTER
`
`
`
`
`
`
`
`BARRIER TAG=0,
`BARRIER AT TRIBUTE=0
`REOUESTSSENT
`
`914 YES
`
`918
`
`ISSUEDMA TRANSACTION TO BUS
`
`
`
`
`
`
`
`
`
`920
`
`SEND BARRIER CLEAR SIGNAL TO DE
`
`
`
`
`
`
`
`INTERRUPT
`BIT SET2
`
`YES
`
`N- TO FIG. 9B
`
`
`
`ISSUEDMA
`TRANSACTION TO BUS
`
`BARRIER TAG=0,
`BARRIER ATTRIBUTE=0
`REQUESTSSENT
`
`DMA
`REQUEST IN
`HOLDING REGISTER
`?
`
`SEND INTERRUPT
`TO SOFTWARE
`
`924
`
`Ex. 1045, Page 9
`
`
`
`U.S. Patent
`
`Oct. 13, 2009
`
`Sheet 8 of 8
`
`US 7,603,490 B2
`
`FROM FIG. 9A N
`
`.
`
`
`
`
`
`
`
`
`
`
`
`942
`
`HOLDDMA
`REQUEST
`
`
`
`
`
`
`
`
`
`
`
`
`
`BARRIER
`TAG = 1 AND BARRIER
`ATTRIBUTE=1?
`
`BARRIER TAG=1
`BARRIER AT TRIBUTE=0
`REQUESTSSENT
`
`940 YES
`
`926
`
`SET BARRIER FLAG = 1
`BE-"ON ONE"
`
`. . DMA REQUEST
`
`RECEIVE DMA REQUEST FROM QUEUE
`
`
`
`
`
`
`
`
`
`
`
`944-1 issue DMATRANSACTIONTo Bus
`
`ISSUEDMA TRANSACTION TO BUS
`
`946
`
`SEND BARRIER CLEAR SIGNAL TO DE
`
`
`
`
`
`
`
`INTERRUPT
`BITSET2
`
`YES
`
`952
`
`SET BARRIER FLAG=0
`BE="ONZERO"
`
`N TO FIG. 9A
`
`FIG. 9B
`
`ISSUEDMA
`TRANSACTION TO BUS
`
`BARRIER TAG = 1,
`BARRIER AT TRIBUTE=0
`REQUESTSSENT
`
`DMA
`REQUEST IN
`HOLDING REGISTER
`?
`
`SEND INTERRUPT
`TO SOFTWARE
`
`950
`
`Ex. 1045, Page 10
`
`
`
`US 7,603,490 B2
`
`1.
`BARRER AND INTERRUPT MECHANISM
`FOR HIGH LATENCY AND OUT OF ORDER
`DMADEVICE
`
`BACKGROUND
`
`10
`
`15
`
`25
`
`30
`
`35
`
`1. Technical Field
`The present application relates generally to an improved
`data processing system and method. More specifically, the
`present application is directed to a direct memory access
`controller with a barrier and interrupt mechanism for high
`latency and out of order direct memory access device.
`2. Description of Related Art
`Many system-on-a-chip (SOC) designs contain a device
`called a direct memory access (DMA) controller. The purpose
`of DMA is to efficiently move blocks of data from one loca
`tion in memory to another. DMA controllers are usually used
`to move data between system memory and an input/output
`(I/O) device, but are also used to move data between one
`region in System memory and another. A DMA controller is
`called “direct” because a processor is not involved in moving
`the data.
`Without a DMA controller, data blocks may be moved by
`having a processor copy data piece-by-piece from one
`memory space to another under Software control. This usually
`is not preferable for large blocks of data. When a processor
`copies large blocks of data piece-by-piece, it Is slow because
`the processor does not have large memory buffers and must
`move data in Small inefficient sizes, such as 32-bits at a time.
`Also, while the processor is doing the copy, it is not free to do
`other work. Therefore, the processor is tied up until the move
`is completed. It is more efficient to offload these data block
`moves to a DMA controller, which can do them much faster
`and in parallel with other work.
`DMA controllers usually have multiple “channels. As
`used herein, a "channel is an independent stream of data to
`be moved by the DMA controller. Thus, DMA controllers
`may be programmed to perform several block moves on dif
`ferent channels simultaneously, allowing the DMA device to
`transfer data to or from several I/O devices at the same time.
`Another feature that is typical of DMA controllers is a
`scatter/gather operation. A scatter/gather operation is one in
`which the DMA controller does not need to be programmed
`by the processor for each block of data to be moved from
`Some source to some destination. Rather, the processor sets
`up a descriptor table or descriptor linked list in system
`memory. A descriptor table or linked list is a set of descrip
`tors. Each descriptor describes a data block move, including
`Source address, destination address, and number of bytes to
`transfer. Non-scatter/gather block moves, which are pro
`grammed via the DMA registers directly, are referred to as
`“single programming DMA block moves.
`A linked list architecture of a DMA controller is more
`flexible and dynamic than the table architecture. In the linked
`list architecture, the processor refers one of the DMA chan
`nels to the first descriptor in the chain, and each descriptor in
`the linked list contains a pointer to the next descriptor in
`memory. The descriptors may be anywhere in system
`memory, and the processor may add onto the list dynamically
`as the transfers occur. The DMA controller automatically
`traverses the table or list and executes the data block moves
`described by each descriptor until the end of the table or list is
`reached.
`Modern DMA devices may be connected to busses that
`allow read data to be returned out of order. That is, the DMA
`65
`controller may issue several read transactions to the bus that
`are all part of the same or different data block moves and the
`
`40
`
`45
`
`50
`
`55
`
`60
`
`2
`data may be returned by the target devices in a different order
`than the order in which the reads were issued. Typically, each
`read transaction is assigned a “tag” number by the initiator So
`that when read data comes back from the bus, the initiator will
`know based on the tag to which transaction the data belongs.
`The transaction queued can be completed in any order. This
`allows the DMA device to achieve the best performance by
`queuing many transactions to the bus at once, including queu
`ing different transactions to different devices. The read trans
`actions can complete in any order and their associated writes
`started immediately when the read data arrives. Allowing the
`reads and their associated writes to compete in any order
`achieves the best performance possible on a given bus, but can
`cause certain problems.
`When system software sets up a large block of memory to
`be moved between an I/O device and memory or from one
`region in memory to another, the Software will want to know
`when that block of data has been completely moved so that it
`can act on the data. Because the processor or some other
`device may act on the data when the transfer is complete, it is
`imperative that the interrupt not be generated until all of the
`data associated with the move has been transferred; other
`wise, the processor may try to act on data that is not yet
`transferred and will, thus, read incorrect data. With out of
`order execution, a DMA device cannot simply generate an
`interrupt when the last transaction in a block completes.
`Some systems work by having “completion codes' moved
`to “mailboxes” when a series of data moves have been com
`pleted. A mailbox is a messaging device that acts as a first
`in-first-out (FIFO) for messages. When the DMA controller
`delivers messages to the mailbox by writing to the mailbox
`address, the DMA controller may deliver messages to the
`processor in order. Messages are typically small, on the order
`of eight or sixteen bytes. When software sets up a series of
`block moves in a scatter/gather list, the Software can input the
`completion messages in the descriptor linked list so that the
`DMA device may move both the data blocks and the comple
`tion code messages via the same list of scatter/gather descrip
`tOrS.
`However, in order for software to work correctly, when the
`DMA controller writes a completion message to the mailbox,
`it is imperative that all descriptors prior to the descriptor
`writing to the mailbox have completed, because the mailbox,
`like an interrupt, tells the processor that a certain amount of
`data has been moved. Because all transactions can complete
`out of order for performance, the DMA device can write a
`completion message to the mailbox prior to some of the other
`transactions from previous descriptors having completed
`unless there is a mechanism to prevent it.
`
`SUMMARY
`
`In one illustrative embodiment, a method is provided in a
`direct memory access engine in a direct memory access
`device for performing a direct memory access block move.
`The method comprises receiving a direct memory access
`block move descriptor. The direct memory access block move
`descriptor indicates a source and a target. The direct memory
`access block move descriptoris identified as a barrier descrip
`tor. The method further comprises converting the direct
`memory access block move descriptor into one or more direct
`memory access requests for the direct memory access block
`move descriptor, identifying a last direct memory access
`request within the one or more direct memory access
`requests, and setting a barrier attribute associated with the last
`direct memory access request. For each given direct memory
`access request in the one or more direct memory access
`
`Ex. 1045, Page 11
`
`
`
`3
`requests, the method determines whether the barrier attribute
`is set for the given direct memory access request, determines
`whether a barrier is pending for a channel associated with the
`given direct memory access request if the barrier attribute is
`set, and issues the given direct memory access request to abus
`engine in the direct memory access device if a barrier is not
`pending for the channel associated with the given direct
`memory access request.
`In another illustrative embodiment, a method is provided in
`abus engine in a direct memory access device for performing
`a direct memory access block move. The method comprises
`receiving a direct memory access request from a direct
`memory access queue. A direct memory access engine in the
`direct memory access device converts a direct memory access
`block move descriptor into one or more direct memory access
`requests, sets a barrier attribute for a last direct memory
`access request within the one or more direct memory access
`requests to mark a barrier, and stores the one or more direct
`memory access requests in the direct memory access queue.
`The method further comprises determining whether the direct
`memory access request has a barrier attribute set, determining
`whether all direct memory access requests before the barrier
`have completed if the direct memory access request has a
`barrier attribute set, and holding the direct memory access
`request from completing if all direct memory access requests
`before the barrier have not completed.
`In yet another illustrative embodiment, a direct memory
`access device comprises a direct memory access engine and a
`bus engine. The direct memory access engine is configured to
`receive a direct memory access block move descriptor. The
`direct memory access block move descriptor indicates a
`Source and a target. The direct memory access block move
`descriptor is identified as a barrier descriptor. The direct
`memory access engine is further configured to convert the
`direct memory access block move descriptor into one or more
`direct memory access requests for the direct memory access
`block move descriptor, identify a last direct memory access
`request within the one or more direct memory access
`requests, set a barrier attribute for the last direct memory
`access request to mark a barrier, and issue the one or more
`direct memory access requests to a direct memory access
`queue. The bus engine is configured to receive a direct
`memory access request from the direct memory access queue
`determine whether the received direct memory access request
`has a barrier attribute set, determine whether all direct
`memory access requests before the barrier have completed if
`the received direct memory access request has a barrier
`attribute set, and hold the received direct memory access
`request from completing if all direct memory access requests
`before the barrier have not completed.
`These and other features and advantages of the present
`invention will be described in, or will become apparent to
`those of ordinary skill in the art in view of the following
`detailed description of the exemplary embodiments of the
`present invention.
`
`10
`
`15
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention, as well as a preferred mode of use and
`further objectives and advantages thereof, will best be under
`stood by reference to the following detailed description of
`illustrative embodiments when read in conjunction with the
`accompanying drawings, wherein:
`FIG. 1 is an exemplary block diagram of a data processing
`system in which aspects of the illustrative embodiments may
`be implemented;
`
`60
`
`65
`
`US 7,603,490 B2
`
`4
`FIG. 2 is a block diagram of an exemplary data processing
`system in which aspects of the illustrative embodiments may
`be implemented;
`FIG. 3 is a block diagram illustrating a south bridge in
`accordance with an illustrative embodiment;
`FIG. 4 illustrates overall operation of a direct memory
`access device with barrier descriptors in accordance with an
`illustrative embodiment;
`FIG. 5A depicts an example descriptor in accordance with
`an illustrative embodiment;
`FIG. 5B depicts an example of DMA request attributes in
`accordance with an illustrative embodiment;
`FIG. 6 illustrates an overall bus engine queue structure in
`accordance with one illustrative embodiment;
`FIG. 7 is a flowchart illustrating the operation of a direct
`memory access engine processing descriptors in accordance
`with an illustrative embodiment;
`FIG. 8 is a flowchart illustrating the operation of a direct
`memory access engine issuing requests to a bus engine in
`accordance with an illustrative embodiment; and
`FIGS. 9A and 9B are flowcharts illustrating ion of a bus
`engine enforcing a barrier in dance with an illustrative
`embodiment.
`
`DETAILED DESCRIPTION OF THE
`ILLUSTRATIVE EMBODIMENTS
`
`With reference now to the figures and in particular with
`reference to FIGS. 1-2, exemplary diagrams of data process
`ing environments are provided in which illustrative embodi
`ments of the present invention may be implemented. It should
`be appreciated that FIGS. 1-2 are only exemplary and are not
`intended to assert or imply any limitation with regard to the
`environments in which aspects or embodiments of the present
`invention may be implemented. Many modifications to the
`depicted environments may be made without departing from
`the spirit and scope of the present invention.
`FIG. 1 is an exemplary block diagram of a data processing
`system in which aspects of the illustrative embodiments may
`be implemented. The exemplary data processing system
`shown in FIG. 1 is an example of the Cell Broadband Engine
`(CBE) data processing system. While the CBE will be used in
`the description of the preferred embodiments of the present
`invention, the present invention is not limited to such, as will
`be readily apparent to those of ordinary skill in the art upon
`reading the following description.
`As shown in FIG. 1, the CBE 100 includes a power pro
`cessor element (PPE) 110 having a processor (PPU) 116 and
`its L1 and L2 caches 112 and 114, and multiple synergistic
`processor elements (SPEs) 120-134 that each has its own
`synergistic processor unit (SPU) 140-154, memory flow con
`trol 155-162, local memory or store (LS) 163-170, and bus
`interface unit (BIU unit) 180-194 which may be, for example,
`a combination direct memory access (DMA), memory man
`agement unit (MMU), and bus interface unit. A high band
`width internal element interconnect bus (EIB) 196, a bus
`interface controller (BIC) 197, and a memory interface con
`troller (MIC) 198 are also provided.
`The local memory or local store (LS) 163-170 is a non
`coherent addressable portion of a large memory map which,
`physically, may be provided as Small memories coupled to the
`SPUs 140-154. The local stores 163-170 may be mapped to
`different address spaces. These address regions are continu
`ous in a non-aliased configuration. A local store 163-170 is
`associated with its corresponding SPU 140-154 and SPE
`120-134 by its address location, such as via the SPU Identi
`fication Register, described in greater detail hereafter. Any
`
`Ex. 1045, Page 12
`
`
`
`5
`resource in the system has the ability to read-write from/to the
`local store 163-170 as long as the local store is not placed in
`a secure mode of operation, in which case only its associated
`SPU may access the local store 163-170 or a designated
`secured portion of the local store 163-170.
`The CBE 100 may be a system-on-a-chip such that each of
`the elements depicted in FIG.1 may be provided on a single
`microprocessor chip. Moreover, the CBE 100 is a heteroge
`neous processing environment in which each of the SPUs
`may receive different instructions from each of the other
`SPUs in the system. Moreover, the instruction set for the
`SPUs is different from that of the PPU, e.g., the PPU may
`execute Reduced Instruction Set Computer (RISC) based
`instructions while the SPU execute vectorized instructions.
`The SPEs 120-134 are coupled to each other and to the L2
`cache 114 via the EIB 196. In addition, the SPEs 120-134 are
`coupled to MIC 198 and BIC 197 via the EIB 196. The MIC
`198 provides a communication interface to shared memory
`199. The BIC 197 provides a communication interface
`between the CBE 100 and other external buses and devices.
`The PPE 110 is a dual threaded PPE 110. The combination
`of this dual threaded PPE 110 and the eight SPEs 120-134
`makes the CBE 100 capable of handling 10 simultaneous
`threads and over 128 outstanding memory requests. The PPE
`110 acts as a controller for the other eight SPEs 120-134
`which handle most of the computational workload. The PPE
`110 may be used to run conventional operating systems while
`the SPEs 120-134 perform vectorized floating point code
`execution, for example.
`The SPEs 120-134 comprise a synergistic processing unit
`(SPU) 140-154, memory flow control units 155-162, local
`memory or store 163-170, and an interface unit 180-194. The
`local memory or store 163-170, in one exemplary embodi
`ment, comprises a 256 KB instruction and data memory
`which is visible to the PPE 110 and can be addressed directly
`by software.
`The PPE 110 may load the SPEs 120-134 with small pro
`grams or threads, chaining the SPEs together to handle each
`step in a complex operation. For example, a set-top box incor
`porating the CBE 100 may load programs for reading a DVD,
`Video and audio decoding, and display, and the data would be
`passed off from SPE to SPE until it finally ended up on the
`output display. At 4 GHz, each SPE120-134 gives a theoreti
`cal 32 GFLOPS of performance with the PPE 110 having a
`similar level of performance.
`The memory flow control units (MFCs) 155-162 serve as
`an interface for an SPU to the rest of the system and other
`elements. The MFCs 155-162 provide the primary mecha
`nism for data transfer, protection, and synchronization
`between main storage and the local storages 163-170. There
`is logically an MFC for each SPU in a processor. Some
`implementations can share resources of a single MFC
`between multiple SPUs. In such a case, all the facilities and
`commands defined for the MFC must appear independent to
`software for each SPU. The effects of sharing an MFC are
`limited to implementation-dependent facilities and com
`mands.
`With reference now to FIG. 2, a block diagram of an exem
`plary data processing system is shown in which aspects of the
`illustrative embodiments may be implemented. In the
`depicted example, data processing system 200 employs a hub
`architecture including South bridge and input/output (I/O)
`controller hub (SB/ICH) 204. Processing unit 202 is con
`nected to system memory 208 via memory interface control
`ler (MIC) 210. Processing unit 202 is connected to SB/ICH
`204 through bus interface controller (BIC) 206.
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`US 7,603,490 B2
`
`5
`
`10
`
`15
`
`6
`In the depicted example, local area network (LAN) adapter
`212 connects to SB/ICH 204. Audio adapter 216, keyboard
`and mouse adapter 220, modem 222, read only memory
`(ROM) 224, hard disk drive (HDD) 226, CD-ROM drive 230,
`universal serial bus (USB) ports and other communication
`ports 232, and PCI/PCIe devices 234 connect to SB/ICH204
`through bus 238 and bus 240. PCI/PCIe devices may include,
`for example, Ethernet adapters, add-in cards, and PC cards for
`notebook computers. PCI uses a card bus controller, while
`PCIe does not. ROM 224 may be, for example, a flash binary
`input/output system (BIOS).
`HDD 226 and CD-ROM drive 230 connect to SB/ICH 2.04
`through bus 240. HDD 226 and CD-ROM drive 230 may use,
`for example, an integrated drive electronics (IDE) or serial
`advanced technology attachment (SATA) interface. Super I/O
`(SIO) device 236 may be connected to SB/ICH204.
`An operating system runs on processing unit 202. The
`operating system coordinates and provides control of various
`components within the data processing system 200 in FIG. 2.
`As a client, the operating system may be a commercially
`available operating system. An object-oriented programming
`system, Such as the Java programming system, may run in
`conjunction with the operating system and provides calls to
`the operating system from JavaTM programs or applications
`executing on data processing system 200 (Java is a trademark
`of Sun Microsystems, Inc. in the United States, other coun
`tries, or both).
`As a server, data processing system 200 may be, for
`example, an IBM(R) eServer'TM pSeries(R) computer system,
`running the Advanced Interactive Executive (AIX(R) operat
`ing system or the LINUX(R) operating system (eServer,
`pSeries and AIX are trademarks of International Business
`Machines Corporation in the United States, other countries,
`or both, while LINUX is a trademark of Linus Torvalds in the
`United States, other countries, or both). Data processing sys
`tem 200 may include a plurality of processors in processing
`unit 202. Alternatively, a single processor system may be
`employed.
`Instructions for the operating system, the object-oriented
`programming System, and applications or programs are
`located on storage devices, such as HDD 226, and may be
`loaded into main memory 208 for execution by processing
`unit 202. The processes for illustrative embodiments of the
`present invention may be performed by processing unit 202
`using computer usable program code, which may be located
`in a memory such as, for example, main memory 208, ROM
`224, or in one or more peripheral devices 226 and 230, for
`example.
`A bus system, such as bus 238 or bus 240 as shown in FIG.
`2, may be comprised of one or more buses. Of course, the bus
`system may be implemented using any type of communica
`tion fabric or architecture that provides for a transfer of data
`between different components or devices attached to the fab
`ric or architecture. A communication unit, such as modem
`222 or network adapter 212 of FIG. 2, may include one or
`more devices used to transmit and receive data. A memory
`may be, for example, main memory 208, ROM 224, or a cache
`Such as found in NB/MCH 202 in FIG. 2.
`Those of ordinary skill in the art will appreciate that the
`hardware in FIGS. 1-2 may vary depending on the implemen
`tation. Other internal hardware or peripheral devices, such as
`flash memory, equivalent non-volatile memory, or optical
`disk drives and the like, may be used in addition to or in place
`of the hardware depicted in FIGS. 1-2. Also, the processes of
`the illustrative embodiments may be applied to a multipro
`cessor data processing system, other than the SMP system
`
`Ex. 1045, Page 13
`
`
`
`US 7,603,490 B2
`
`5
`
`10
`
`15
`
`25
`
`7
`mentioned previously, without departing from the spirit and
`Scope of the present invention.
`Moreover, the data processing system 200 may take the
`form of any of a number of different data processing systems
`including client computing devices, server computing
`devices, a tablet computer, laptop computer, telephone or
`other communication device, a personal digital assistant
`(PDA), video game console, or the like. In some illustrative
`examples, data processing system 200 may be a portable
`computing device which is configured with flash memory to
`provide non-volatile memory for storing operating system
`files and/or user-generated data, for example. Essentially,
`data processing system 200 may be any known or later devel
`oped data processing system without architectural limitation.
`South bridge 204 may include a direct memory access
`(DMA) controller. DMA controllers are usually used to move
`data between system memory and an input/output (I/O)
`device, but are also used to move data between one region in
`system memory and another. High latency devices present
`unique challenges if high bus utilization is desired. When
`talking to a high latency device, there must be enough simul
`taneous transactions outstanding so that the time it takes to
`receive data from the high latency device is less than or equal
`to the amount of time it takes to transfer the data from all of
`the other outstanding transactions queued ahead of it. If this
`criterion is met, then there seldom will be gaps or stalls on the
`bus where the DMA is waiting for data and does not have any
`other data available to transfer.
`With trends towards further integration, particularly with
`systems-on-a-chip, many devices in FIG.2 may be integrated
`within south bridge 204. For example, a single bus may be
`integrated within south bridge 204. Also, controllers and
`interfaces, such as USB controller, PCI and PCIe controllers,
`memory controllers, and the like, may be integrated within
`south bridge 204 and attached to the internal bus. Further
`more, South bridge 204 may include a memory controller to
`which a memory module may be connected for local memory.
`Also note that processing unit 202 may include an internal
`bus, such as EIB 196 in FIG. 1, through which the DMA
`device may access system memory 208.
`40
`FIG. 3 is a block diagram illustrating a south bridge in
`accordance with an illustrative embodiment. Processing unit
`302, for example, issues DMA commands to bus 320 in south
`bridge 300. DMA device 310 within south bridge 300 may
`then execute the DMA commands by performing read opera
`tions from source devices, such as bus unit device 322, and
`write operations to target devices, such as bus unit device 324.
`In an alternative example, a DMA command may request to
`move a block of data from bus unit device 322 to system
`memory 304, or according to yet another example, a DMA
`50
`command may request to move a block of data from memory
`304 to bus unit device 324. Bus unit device 322 and bus unit
`device 324 may be, for example, memory controllers, USB
`controllers, PCI controllers, storage device controllers, and
`the like, or combinations thereof.
`The source devices and target devices may include low
`latency devices, such as memory, and high latency devices,
`such as hard disk drives. Note, however, that devices that are
`generally low latency, such as memory devices, may also be
`high latency in Some instances depending upon their loca
`60
`tions within the bus and bridge hierarchy. Many of the com
`ponents of south bridge 300 are not shown for simplicity. A
`person of ordinary skill in the art will recognize that south
`bridge 300 will include