`US 6,760,833 B1
`(10) Patent No.:
`Dowling Jul. 6, 2004
`
`(45) Date of Patent:
`
`US006760833B1
`
`(54)
`
`SPLIT EMBEDDED DRAM PROCESSOR
`
`(75)
`
`Inventor: Eric M. Dowling, Richardson, TX (US)
`
`(73)
`
`Assignee: Micron Technology, Inc., Boise, ID
`(US)
`
`(*)
`
`Notice:
`
`Subject to any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`US.C. 154(b) by 853 days.
`
`(21)
`
`(22)
`
`(60)
`
`(60)
`
`(51)
`(52)
`(58)
`
`(56)
`
`Appl. No.: 09/652,638
`
`Filed:
`
`Aug. 31, 2000
`
`Related U.S. Application Data
`
`Continuation of application No. 09/487,639,filed on Jan. 19,
`2000, now Pat. No. 6,226,738, which is a division of
`application No. 08/997,364,filed on Dec. 23, 1997, nowPat.
`No. 6,026,478.
`Provisional application No. 60/054,546, filed on Aug. 1,
`1997,
`
`Int. Cl.’ ...
`.. GO6F 19/00
`
`US. Che cesses
`.. 712/34; 711/105
`Field of Search .....0.0.cccc 712/34; 711/105
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`10/1987 Thompsonetal.
`4,701,844 A
`1/1989 Matsuo etal.
`4,796,175 A
`/1991 Omadaetal.
`5,010,477 A
`12/1992 Beigheetal.
`5,175,835 A
`4/1993 Wetzel
`5,203,002 A
`5/1994 Sugimoto
`5,317,709 A
`/1995 Jobst et al.
`5,396,641 A
`5,475,631 A * 12/1995 Parkinson et al.
`5,485,624 A
`1/1996 Steinmetz etal.
`5,584,034 A
`12/1996 _Usamietal.
`5,588,118 A
`12/1996 Mandavaetal.
`5,594,917 A
`1/1997 Palermoetal.
`5,619,665 A
`/1997 Emma
`5,787,303 A
`/1998 Ishikawa
`5,805,850 A
`9/1998 Luick
`5,852,741 A
`12/1998 Jacobset al.
`
`............ 712/15
`
`1/1999 Motomura
`5,862,396 A
`7/1999 Moyeretal.
`5,923,893 A
`11/1999 Mohamedetal.
`5,978,838 A
`2/2000 Dowling
`6,026,478 A
`FOREIGN PATENT DOCUMENTS
`
`EP
`GB
`
`0 584 783 A2
`2 244 828 A
`
`8/1993
`/1990
`
`OTHER PUBLICATIONS
`
`Keeton, et al., “IRAM and SmartSIMM: Overcoming the
`1/O Bus Bottleneck,” Computer Science Division, Univer-
`sity of California at Berkeley (Jun. 1997) pp. 1-9.
`Papamichalis,
`Panos,
`et
`al,
`“The
`TMS320C30
`Floating-Point Digital Signal Processor”,
`IEEE Micro,
`(Dec. 1988) pp. 13-29.
`
`* cited by examiner
`
`Primary Examiner—William Treat
`(74) Attorney, Agent, or Firm—Gazdzinski & Associates
`
`(57)
`
`ABSTRACT
`
`A processing architecture includesa first CPU core portion
`coupled to a second embedded dynamic random access
`memory (DRAM)portion. These architectural components
`jointly implement a single processor and instruction set.
`Advantageously, the embedded logic on the DRAM chip
`implements the memory intensive processing tasks,
`thus
`reducing the amountoftraffic that needs to be bussed back
`and forth between the CPU core and the embedded DRAM
`
`chips. The embedded DRAM logic monitors and manipu-
`lates the instruction stream into the CPU core. The archi-
`
`tecture of the instruction set, data paths, addressing, control,
`caching, and interfaces are developed to allow the system to
`operate using a standard programming model. Specialized
`video and graphics processing systems are developed. Also,
`an extendedvery long instruction word (VLIW)architecture
`implemented as a primary VLIW processor coupled to an
`
`
`embedded DRAM VLIW extension processor efficiently
`deals with memory intensive tasks.
`In different
`embodiments, standard software can be accelerated either
`with or without the express knowledge of the processor.
`
`
`
`20 Claims, 11 Drawing Sheets
`
`STANDARD DRAM SIMM SLOTS
`
`Google Exhibit 1015
`Google Exhibit 1015
`Google v. Valtrus
`Google v. Valtrus
`
`
`
`700
`
`EXTENSION CONTROL BUS
`
`
` CPU
`
`EXTENSION
`LOGIC
` PROCESSOR /MEMORY BUS
`740
`
`
`
`BIU
`
`
`
`EMBEDDED DRAM
`
`FIC 7
`
`yuayed*S'A
`
`poo‘9‘AL
`TTJ®[29048
`Td€€8°09L‘9SA
`
`
`
` INSTRUCTION FETCH
`
`250
`
`
`
`PREFETCH
`DRAM
`UNIT
`
`ARRAY
`
`
`
`MEMORY BUS
`
`MONITOR /
`MODIFY
`
`PREFETCH
`
`CPU CORE
`
` FUNCTIONAL
`
`
`
`EMBEDDED DRAM
`
`FUNCTIONAL
`UNITS
`
`
`
`FORK
`
`UNITS
`
`SO
`
`FIG, 2
`
`yuayed*S'A
`
`poo‘9‘AL
`TTJ®Z29048
`Td€€8°09L‘9SA
`
`
`
`
`JE6O
`
`
`
`SAS
`
`
`
`T40 i INSTRUCTION FETCH nT
`
`
`
`PREFETCH
`= CACHE
`
`
`MONITOR /
`Se MODIFY
`—‘“\
`FORIC
`
`
`MoOPYrun1amaZz—
`
`READY
`729
`
`
`
`
`
`550
`
`
`CPU CORE
`
`
`
`
`FUNCTIONAL |
`UNITS
`
`°
`
`IFO
`
`320
`
`TYPE II
`
`IID
`
`F570
`
`yuajed‘S'0
`
`poo‘9‘AL
`TIJo€US
`Td€€8°09L‘9SA
`
`PREFETCH
`
`JIO
`
`FUNCTIONAL
`UNITS
`
`
`
`
`
`REGISTERS
`
`
`
`
`JE8O -
`
`4D
`J,
`
`EMBEDDED DRAM
`
`
`
`FIC. F
`
`
`
`U.S. Patent
`
`Jul. 6, 2004
`
`Sheet 4 of 11
`
`US 6,760,833 B1
`
`
`
`CPU CORE
`PROGRAM SPACE
`
`EMBEDDED DRAM
`PROGRAM SPACE
`
`f/G. 4
`
`
`
`U.S. Patent
`
`Jul. 6, 2004
`
`Sheet 5 of 11
`
`US 6,760,833 B1
`
`
`
`
`
`
`LOGICAL
`
`
`TYPE I]
`OPCODE
`ADDRESS OF
`INDICATOR
`FOR
`EDRAM
`CPU
`INSTRUCTION
`CORE
`ADDRESS
`
`FIG, 30
`
`
`
`aan 5 550 555) Kr 3
`
`550
`
` OPCODE
`
`SIGNAL
`
`STOP
`
`
`
` TYPE II OPCODE MICROSEQUENCE
`
`INDICATOR
`FOR
`EMBEDDED
`DRAM
`
`
`COMMAND
`
`F/G. 20
`
`
`
`yuayed*S'A
`
`poo‘9‘AL
`TTJ©9W048
`Td€€8°09L‘9SA
`
`CACHE
`
`O8O
`
`CBF
`
`FUNCTIONAL
`UNITS
`
`N O
`
`F 7
`
`S
`
`Ss)
`
`o20~
`“
`
`E70
`
`CIO
`
`720
`
`C09
`
`IID
`
`C60 PROGRAM
`
`INTERFACE
`REGISTERS
`
`CONTROL
`
`INTERFACE
`
`TYPE II
`INTERFACE
`
`REGISTER ©
`
`FILE
`
`-/C. O
`
` TYPE II
`
`
`
`
`
` I2S
`
`
`
`N
`SAM
`MEMORY/INTERFACE|INTERFACE
`AIS
`REGISTERS
`
`yuajed‘S'0
`
`poo‘9‘AL
`TTJ®Z948
`700.
`
`LM]|aooressADDRESS||DATA
`A0nness
`UNIT
`Td€€8°09L‘9SA
`
`
`
`PROGRAM
`CACHE
`
`
`
`_ DRAM ARRAY
`
`710
`
`PES
`
`PREFETCH
`_UNIT
`
`Vmtad
`pe GP
`
`YEO
`
`ZEO
`
`ee
`
`PIO
`
`I
`
`520
`
`TYPE Il
`
`FID
`
`FIC. 7
`
`|DISPATCH|
`
`
`
`SOO
`
`x
`
`SIO
`
`mS]
`
`BIO
`
`SIF
`
`VLIW — EXTENSION
`
`e090
`
`aN
`
`VLIW
`
`DISPATCH UNIT
`
`PREFETCH
`
`REGISTER SET A
`
`REGISTER SET B
`
`O27
`
`BRANCH
`INTERFACE
`
`SAD
`
`
`
`
`
`yuayed*S'A
`
`poo‘9‘AL
`TTJ®8299048
`Td€€8°09L‘9SA
`
`
`
`ON—CHIP DATA|___ON-CHIP_DATAMEMORY|
`
`CDRAM ARRAY
`
`-/G. &
`
`
`
`U.S. Patent
`
`Jul. 6, 2004
`
`Sheet 9 of 11
`
`US 6,760,833 B1
`
`APPLICATION
`
`FIO
`
`MESSAGE BASED
`
`
`
`G20
`API, VDI, GDI
`
`HARDWIRED
`API, VDI, GDI
`
`DRIVER
`PROGRAM
`
`G40
`
`CPU /EDRAM
`ROUTINES
`
`F/G, @
`
`
`
`U.S. Patent
`
`Jul. 6, 2004
`
`Sheet 10 of 11
`
`US 6,760,833 B1
`
`APPLICATION
`
`JO7O
`
`fo
`
`MMX, etc.
`
`
`
`EMBEDDED
`
`
`CPU PROFILE
`
`DRAM
`
`SOFTWARE
`
`
`PROFILER
`
`
`7020
`
`
`
`
`
`7030
`
`Pe
`
`CONSTRUCT
`
`
`
`MODIFICATION [-——5 Moore
`
`
`7
`TABLE
`
`
`
`
`N
`
`F/G.
`
`JO
`
`
`
`STANDARD DRAM SIMM SLOTS
`
`F/G,
`
`77
`
`yuayed*S'A
`
`poo‘9‘AL
`TTJ®IT29048
`Td€€8°09L‘9SA
`
`
`
`US 6,760,833 B1
`
`1
`SPLIT EMBEDDED DRAM PROCESSOR
`
`REFERENCE TO RELATED APPLICATIONS
`
`The present application is a continuation of U.S. appli-
`cation Ser. No. 09/487,639filed Jan. 19, 2000, entitled “Split
`Embedded DRAM Processor’, now U.S. Pat. No. 6,226,
`738, which is a divisional application of U.S. application
`Ser. No. 08/997,364 filed Dec. 23, 1997, now US. Pat. No.
`6,026,478 issued Feb. 15, 2000, which claims priority ben-
`efit of Provisional Application No. 60/054,546filed Aug.1,
`1997.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`
`The present invention relates to the fields of micropro-
`cessor and embedded DRAM architectures. More
`particularly, the invention pertains to a split processor archi-
`tecture whereby a CPU portion performs standard process-
`ing and control functions, an embedded DRAM portion
`performs memory-intensive manipulations, and the CPU
`and embedded DRAM portions function in concert
`to
`execute a single program.
`2. Description of the Prior Art
`Microprocessor technology continues to evolve rapidly.
`Every few years processor circuit speeds double, and the
`amount of logic that can be implemented on a single chip
`increases similarly. In addition, RISC, superscalar, very long
`instruction word (VLIW), and other architectural advances
`enable the processor to perform more useful work per clock
`cycle. Meanwhile,
`the number of DRAM cells per chip
`doubles and the required refresh rate halves every few years.
`The fact that DRAM access times do not double every few
`years results in a processor-DRAM speed mismatch. If the
`processoris to execute a program and manipulate data stored
`in a DRAM,it will have to insert wait states into its bus
`cycles to work with the slower DRAM. To combatthis,
`hierarchical cache structuresor large on-board SRAM banks
`are used so that on average, muchless time is spent waiting
`for the large but slower DRAM.
`Real-time multimedia capabilities are becoming increas-
`ingly important in microcomputer systems. Especially with
`video and image data,it is not practical to build caches large
`enough to hold the requisite data structures while they are
`being processed. This gives rise to large amounts of data
`traffic between the memory and the processor and decreases
`cache efficiency. For example, the Intel Pentium processors
`employ MMX technology, which essentially provides a
`vector processor subsystem that can process multiple pixels
`in parallel. However, even with faster synchronous DRAM,
`the problem remains that performance is limited by the
`DRAM access time needed to transfer data to and from the
`processor.
`Other applications where external DRAM presents a
`system bottleneck are database applications. Database pro-
`cessing involves such algorithms as searching, sorting, and
`list processing in general. A key identifying requirementis
`the frequent use of memory indirect addressing. In memory
`indirect addressing, a pointer is stored in memory. The
`pointer must be retrieved from memory and then used to
`determine the address of another pointer located in memory.
`This addressing mode is used extensively in linked list
`searching and in dealing with recursive data structures such
`as trees and heaps. In these situations, cache performance
`diminishes as the processor is burdened with having to
`manipulate large data structures distributed across large
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`areas in memory. In many cases, these memory accessesare
`interleaved with disk accesses,
`further
`reducing system
`performance.
`Several prior art approaches have been used to increase
`processing speed in microsystemsinvolving a fast processor
`and a slower DRAM. Manyof these techniques, especially
`cache oriented solutions, are detailed in “Computer Archi-
`tecture: A Quantitative Approach, 2nd Ed.,” by John Hen-
`nessy and David Patterson (Morgan Kaufmann Publishers,
`1996). This reference also discusses pipelined processing
`architectures together with instruction-level parallel pro-
`cessing techniques, as embodied in superscalar and VLIW
`architectures. These concepts are extendedherein to provide
`improved performance by providing split caching and
`instruction-level parallel processing structures and methods
`that employ a CPU core and embedded DRAM logic.
`The concept of using a coprocessorto extend a processor
`architecture is knownin the art. Floating point coprocessors,
`such as the Intel 80x87 family, monitor the instruction
`stream from the memory into the processor, and, when
`certain coprocessor instructions are detected, the coproces-
`sor latches and executes the coprocessor instructions. Upon
`completion,
`the coprocessor presents the results to the
`processor. In such systems, the processor is aware of the
`presence of the coprocessor, and the two work together to
`accelerate processing. However, the coprocessoris external
`from the memory, and no increase in effective memory
`bandwidth is realized. Rather, this solution speeds up com-
`putation by employing a faster arithmetic processor than
`could be integrated onto a single die at the time. Also, this
`solution does not provide for the important situation when
`the CPU involves a cache. In suchsituations, the coproces-
`sor instructions cannot beintercepted, for example, when the
`CPU executes looped floating point code from cache.
`Another deficiency with this prior art
`is its inability to
`provide a solution for situations where the processoris not
`aware of the presence of the coprocessor. Such a situation
`becomesdesirable in light of the present invention, whereby
`a standard DRAM maybereplaced by an embedded DRAM
`to accelerate processing without modification of preexisting
`application software.
`Motorola employed a different coprocessor interface for
`the MC68020 and MC68030 processors. In this protocol,
`when the processor executes a coprocessor instruction, a
`specialized sequence of bus cycles is initiated to pass the
`coprocessorinstruction and any required operandsacross the
`coprocessor interface. If, for example, the coprocessor is a
`floating point processor, then the combination of the pro-
`cessor and the coprocessor appears as an extended processor
`with floating point capabilities. This interface serves as a
`good starting point, but does not define a protocol to fork
`execution threadsor to jointly execute instructions on both
`sides of the interface. Furthermore,
`it does not define a
`protocol to allow the coprocessorto interact with the instruc-
`tion sequence before it arrives at the processor. Moreover,
`the interface requires the processor to wait while a sequence
`of slow bus transactions are performed. This interface con-
`cept is not sufficient to support the features and required
`performance needed of the embedded DRAM coprocessors.
`US. Pat. No. 5,485,624 discloses a coprocessor architec-
`ture for CPUsthat are unaware of the presence of a copro-
`cessor.
`In this architecture,
`the coprocessor monitors
`addresses generated by the CPU while fetching instructions,
`and when certain addresses are detected,
`interprets an
`opcodefield not used by the CPU as a coprocessor instruc-
`tion. In this system, the coprocessor then performs DMA
`transfers between memory and an interface card. This sys-
`
`
`
`US 6,760,833 B1
`
`
`
`3
`tem does not involve an embedded DRAM that can speed
`processing by minimizing the bottleneck between the CPU
`and DRAM. Moreover,the coprocessorinterface is designed
`oO monitor the address bus and to respond only to specific
`preprogrammed addresses. When one of these addressesis
`identified, then an unused portion of an opcode is needed in
`which to insert coprocessor instructions. This system is thus
`not suited to systems that use large numbers of coprocessor
`instructions as in the split processor architecture of the
`present invention. A very large content addressable memory
`(CAM) would be required to handle all the coprocessor
`instruction addresses, and this CAM would need to be
`flushed and loaded on each task switch. The need for a large
`CAM eliminates the DRAM area advantage associated with
`an embedded DRAM solution. Moreover, introduction of a
`arge task switching overhead eliminates the acceleration
`advantages. Finally, this technique involves a CPU unaware
`of the coprocessor but having opcodes that include unused
`fields that can be used by the coprocessor. A more powerful
`and general solution is needed.
`The concept of memorybased processorsis also knownin
`he art. The term “intelligent memories” is often used to
`describe such systems. For example, U.S. Pat. No. 5,396,641
`discloses a memory based processor that
`is designed
`increase processor-memory bandwidth. In this system,a set
`of bit serial processor elements function as a single
`instruction, multiple data (SIMD) parallel machine. Data is
`accessed in the memory based processor using normal row
`address and column address strobe oriented bus protocols.
`SIMDinstructionsare additionally latched in along with row
`addresses to control the operation of the SIMD machine
`under control by a host CPU. Hence, the description in U.S.
`Pat. No. 5,396,641 views the intelligent memory as a
`separate parallel processor controlled via write operations
`from the CPU. While this system may be useful as an
`attached vector processor, it does not serve to accelerate the
`normal software executed on a host processor. This archi-
`tecture requires the CPU to execute instructions to explicitly
`control and route data to and from the memory based
`coprocessor. This architecture does not provide a tightly
`coupled acceleration unit that can accelerate performance
`with specialized instruction set extensions, and it cannot be
`used to accelerate existing applications software unaware of
`the existence of the embedded DRAM coprocessor. This
`architecture requires a very specialized form of program-
`ming where SIMD parallelism is expressly identified and
`coded into the application program.
`It would be desirable to have an architecture that could
`accelerate the manipulation of data stored in a slower
`DRAM. It would also be desirable to be able to program
`such a system in a highlevel language programming model
`whereby the acceleration means are transparent to the pro-
`grammer. It would also be desirable to maintain the pro-
`cessing features and capabilities of current microprocessors,
`to include caching systems, instruction pipelining, supersca-
`lar or VLIW operation, and the like. It would also be
`desirable to have a general purpose processorcore that could
`implement operating system and applications programs so
`that
`this core could be mixed with different embedded
`DRAM coprocessors to accelerate the memory intensive
`processing of, for example, digital signal processing, mul-
`timedia or database algorithms. Finally, it would be desir-
`able if a standard DRAM module could be replaced by an
`embedded DRAM module with processor architectural
`extensions, whereby existing software would be accelerated
`by the embedded DRAM extension.
`SUMMARY OF THE INVENTION
`
`One aspect of the present invention is a processor whose
`architecture is partitioned into a CPU core portion and an
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`embedded DRAM portion. The CPU core portion handles
`the main processing and control functions, while the embed-
`ded DRAM portion performs memory-intensive data
`manipulations.
`In the architecture,
`instructions execute
`either on the CPU core portion of the processor, the embed-
`ded DRAM portion of the processor, or across both portions
`of the processor.
`the CPU
`invention,
`In another aspect of the present
`portion is able to effectively cache instructions and data
`while still sharing the instruction stream with the embedded
`DRAM portion of the processor implemented in the embed-
`ded DRAM. A separate caching structure is employed for a
`different program space on the embedded DRAM. Usingthis
`system, the separation of the CPU and embedded DRAM
`portions of the architecture is transparent to the programmer,
`allowing standard high level language software to run. In
`one embodiment, a special compiler is used to segmentthe
`code into a plurality of instruction types. The processor
`architecture takes advantage of the embedded DRAM,
`advantageously employing multiple address spaces that are
`transparent to the first portion of the processor, and that
`minimize data bussingtraffic between the processors.
`Another aspect of the present invention is an apparatus
`and method to execute standard available software on a split
`architecture. For example,
`in the personal computer and
`workstation markets there are already multi-billion dollar
`investments in preexisting software. In this aspect of the
`invention, an embedded DRAM module may be inserted
`into an existing single in line memory module (SIMM)slot.
`Thus, an accelerator may be added without needing to
`modify existing application software, and the upgrade can
`be performed effortlessly in the field. This functionality is
`enabled by allowing the embedded DRAM coprocessorto
`monitor the instruction stream and to replacecertain instruc-
`tion sequences with read and write commands.
`In one
`embodiment a profiler analyzes uniprocessor execution
`eitherstatistically or dynamically and then constructs modi-
`fication tables to reassign certain code segments to the
`embedded DRAM coprocessor.In another embodiment, the
`embedded DRAM performsthe analysis in real-time. In still
`another embodiment, the embedded DRAM is exercised by
`standard software through the use of preloaded driver pro-
`grams accessed via operating system calls.
`Another aspect of the present invention is a computer
`system which comprises a central processing unit and an
`external memory coupled to the central processor. The
`external memory comprises one or more dynamic random
`access memory (DRAM)arrays, a set of local functional
`units, a local program prefetch unit, and a monitor/modify
`unit. The monitor/modify unit is operative to evaluate each
`instruction opcode as it is fetched from the DRAM array,
`and,
`in response to the opcode,
`to perform one of the
`following actions:
`(i) sending the opcode to the central processing unit,
`(ii) sending the opcodeto the set of local functional units;
`and
`
`(iii) sending the opcodeto the local program prefetch unit
`to fork a separate execution thread for execution by the
`set of local functional units.
`Preferably, in response to the opcode, the monitor/modify
`unit also performs the actions of sending the opcode to the
`set of local functional units, substituting at least one different
`opcodefor the opcode, and sendingthe at least one different
`opcodeto the central processing unit. Also preferably, the at
`least one different opcode instructs the central processing
`unit to read values from the external memoryrepresentative
`
`
`
`US 6,760,833 B1
`
`5
`of the register contents that would have been presentin the
`central processing unit had the central processing unit
`executed the original instruction stream.
`Another aspect of the present invention is an embedded
`dynamic random access memory (DRAM) coprocessor
`designed to be coupled to a central processing unit. The
`embedded DRAM coprocessor comprises one or more
`DRAM arrays. An external memoryinterface is responsive
`to address and control signals generated from an external
`source to transfer data between the DRAM arrays and the
`external source. A set of local functional units execute
`program instructions. A local program prefetch unit fetches
`program instructions. A monitor/modify unit evaluates each
`instruction opcode as it
`is fetched under control of the
`external source from the DRAM array, and, in response to
`the opcode, performs one of the following actions:
`(i) sending the opcode to the external source;
`(ii) sending the opcodeto the set of local functional units;
`and
`
`(iii) sending the opcode to the local program prefetch unit
`to fork a separate execution thread for execution by the
`set of local functional units.
`Preferably, in response to the opcode, the monitor/modify
`unit also performs the actions of sending the opcode to the
`set of local functional units, substituting one or more dif-
`ferent opcodesfor the opcode, and sending the one or more
`different opcodes to the external source.
`Another aspect of the present invention is a computer
`system which comprisesa central processing unit coupledto
`an external memory. The central processor unit comprises a
`first set of functional units responsive to program instruc-
`tions. A first program cache memoryhasat least onelevel of
`caching and provides high speed access to the program
`instructions. A first prefetch unit controls the fetching of a
`sequence of instructions to be executed by the first set of
`functionalunits. The instructions are fetched from the exter-
`nal memoryunless the program instructions are found in the
`first program cache memory; in which case, the program
`instructions are fetched from the first program cache
`memory. The external memory comprises one or more
`dynamic random access memory (DRAM)arrays, a second
`set of local functional units, a second program prefetch unit,
`and a second program cache memory. The first program
`cache memory only caches instructions executed by the
`functional units on the central processing unit, and the
`second program cache memory only caches instructions
`executed by the second set of functional units on the external
`memory device. Preferably, the first program cache memory
`is a unified cache which also serves as a data cache. Also
`preferably, the central processing unit sends one or more
`attribute signals to identify certain memory read signals to
`be instruction fetch cycles. The attribute signals are decoded
`by logic embedded in the external memory so that
`the
`second program cache memory can identify opcode fetch
`cycles.
`In particular embodiments,
`the external memory
`further includes a monitor/modify unit which intercepts
`opcodes fetched by the first prefetch unit and passes the
`opcodes to the second prefetch unit to cause the second
`prefetch unit to fetch a sequence of program instructions for
`execution. The opcodes of the sequence of program instruc-
`tions are fetched from the one or more DRAM arrays unless
`they are found to reside in the second program cache.
`Another aspect of the present invention is an embedded
`dynamic random access memory (DRAM) coprocessor
`which comprises an external memory interface for transfer-
`ring instructions and data in response to address and control
`signals received from an external bus master. The coproces-
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`sor also comprises one or more DRAM arrays, a set of local
`functional units, a program prefetch unit, and a program
`cache memory. The program cache memory only caches
`instructions executed by the functionalunits on the external
`memory device. Preferably, the external memory interface
`receives one or more attribute signals to identify certain
`memory read signals to be instruction fetch cycles. The
`attribute signals are decoded by logic embedded in the
`external memory so that the program cache can identify
`externally generated opcode fetch cycles. The coprocessor
`preferably includes a monitor/modify unit which intercepts
`opcodesin instructionstransferred over the external memory
`interface and which passes the opcodes to the program
`prefetch unit to cause the program prefetch unit to fetch a
`sequence of program instructions for execution. The
`opcodesof the sequence of program instructions are fetched
`from the one or more DRAM arrays unless the opcodes of
`the sequence of program instructions are found to reside in
`the program cache.
`Another aspect of the present invention is a computer
`system which comprises a central processing unit coupledto
`an external memory. The central processing unit comprises
`a first set of functional units responsive to program instruc-
`tions. A first prefetch unit controls the fetching of a sequence
`of instructions from the external memoryto be executed by
`the first set of functional units. The external memory com-
`prises one or more dynamic random access memory
`(DRAM)arrays, a secondset of local functional units, one
`or more external interface busses, and a second program
`prefetch unit. The central processing unit and the external
`program memory jointly execute a single program whichis
`segmented into first and second program spaces. The first
`program space comprisestype I, type II and optionally type
`Il] instructions. The second program space comprises type II
`and type III instructions. The type I instructions always
`execute on the first set of functional units. The type II
`instructions generate interface control exchanges between
`the central processing unit and the external memory. The
`type II
`instructions selectively are split
`into portions
`executed on the central processing unit and portions
`executed on the external memory. The type III instructions
`always execute on the second set of functional units.
`Preferably, the central processing unit has a first program
`cache, and the external memory has a second program
`cache. The first cache only caches the type I and the type II
`instructions accessedin the first program space. The second
`program cache only caches type II and type III instructions
`accessed in the second program space. Preferably, upon the
`execution ofthe type II instruction onthe central processing
`unit, a logical address is transferred over one of the external
`interface busses to the external memory. The external
`memory passes the logical address to the second prefetch
`unit, which, in turn, fetches a sequence of instructions from
`the second program space. The sequence of instructions is
`executed by a secondset of functional units in the external
`memory. Preferably, the type II instructions comprise first
`and second opcodes.The first opcode executes on the central
`processing unit, and the second opcode executes on the
`external memory. The first opcode comprises instruction
`type identifier information, opcode information to direct
`execution of a one of the first set of functional units, and an
`address field to be transferred over one of the external
`interface busses to reference instructions in the second
`program space. The second opcode comprises instruction
`type identifier information and opcode information to direct
`execution of a one of the second set of functional units.
`Preferably, the second opcode further comprises signaling
`
`
`
`US 6,760,833 B1
`
`20
`
`25
`
`30
`
`35
`
`
`
`8
`7
`unit. When the central processor executes specified instruc-
`information to be passed across one of the external interface
`tions in an instruction stream read from a first program
`busses to the central processing unit. A stop field indicates
`memory space in the embedded DRAM coprocessor, the
`to the secondprefetch unit to stop fetching instructions from
`central processor sends address information to the embedded
`the second program space. Preferably, the type IJ instruction
`DRAM coprocessor which references instructions in a sec-
`is a split branch to subroutine instruction, and upon execu-
`ond program memory space located in the embedded
`tion of the split branch to subroutine instruction, a subrou-
`DRAM coprocessor. As a result, the central processing unit
`tine branch address is passed across one of the external
`and the embedded DRAM coprocessor jointly execute a
`interface bussesto activate a subroutine stored in the second
`program. Preferably,
`the embedded DRAM coprocessor
`program space. Preferably, the type II instruction involves a
`further includes a register file coupled to the DRAM array
`first operand stored in memory and a second operandstored
`and the functional units. At least a subsetof the registerfile
`in a register located on the central processing unit. The type
`contains a mirror image of a register set contained on the
`I] instruction is split into a first portion and a secondportion.
`external central processing unit. At least a subset of the set
`Thefirst portion executes on the external memoryto access
`of the one or more functional units is capable of executing
`the first operand and to place it on one of the external
`a subset of the instruction set executed on the central
`interface busses. The second portion executes on the central
`processing unit. Also preferably,
`the register file further
`processing unit whichreadsthe first operand from oneof the
`includes a set of multimedia extension (MX)registers., and
`external interface busses and computes a result of the type
`the functional units include one or more MMXfunctional
`IJ instruction.
`units.
`Another aspect of the present invention is an embedded
`invention is a central
`Another aspect of the present
`dynamic random access memory (DRAM) coprocessor
`processing unit cooperative to jointly execute programs
`which jointly executes a program with an external central
`fetched from an embedded dynamic random access memory
`processing unit. The embedded DRAM coprocessor com-
`(DRAM)coprocessor. Thecentral processing unit comprises
`prises a DRAM array which comprises one or more DRAM
`a prefetch unit which fetches instructions to be executed by
`banks. Each bank has an associated row pointer. Each row
`the central processing unit, set of internal registers, a set of
`pointer is operative to precharge and activate a row in the
`one or more functional units which executes instructions, an
`espective DRAM bank. A first synchronous external
`optional program cache, a first external memory interface
`memory interface accepts address and control information
`which transfers addresses, control signals and data to and
`used to access memory locations in the DRAM array. A
`from external memoryand input/output(I/O) devices, and a
`second synchronous external memory interface receives
`second external memory interface which transfers synchro-
`type II instruction information from an external source. A
`nization signals and address information betweenthe central
`prefetch unit is responsive to the received type IT informa-
`processing unit and the embedded DRAM coprocessor. The
`ion to execute one or more instructions referenced by the
`central processing unit and the embedded DRAM coproces-
`eceived type II information. A set of one or more functional
`sor jointly execute a single program that is partitioned into
`units is responsive to instructions fetched by the prefetch
`first and second memoryspaces. The instructions in the first
`unit. Preferably, the first and the second synchronousinter-
`memory space are executed by the central processing unit.
`aces share a common bus. Also preferably, the embedded
`The instructions in the second memory space are executed
`DRAM coprocessor further comprises a program cache
`by the embedded DRAM coprocessor. The instructions in
`which caches program instructions fetched under the control
`the first memory space includeafirst type of instruction and
`of the prefetch unit from the DRAM array. The embedded
`40
`a second type of instruction. The first type of instruction i