throbber
(12) United States Patent
`US 6,760,833 B1
`(10) Patent No.:
`Dowling Jul. 6, 2004
`
`(45) Date of Patent:
`
`US006760833B1
`
`(54)
`
`SPLIT EMBEDDED DRAM PROCESSOR
`
`(75)
`
`Inventor: Eric M. Dowling, Richardson, TX (US)
`
`(73)
`
`Assignee: Micron Technology, Inc., Boise, ID
`(US)
`
`(*)
`
`Notice:
`
`Subject to any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`US.C. 154(b) by 853 days.
`
`(21)
`
`(22)
`
`(60)
`
`(60)
`
`(51)
`(52)
`(58)
`
`(56)
`
`Appl. No.: 09/652,638
`
`Filed:
`
`Aug. 31, 2000
`
`Related U.S. Application Data
`
`Continuation of application No. 09/487,639,filed on Jan. 19,
`2000, now Pat. No. 6,226,738, which is a division of
`application No. 08/997,364,filed on Dec. 23, 1997, nowPat.
`No. 6,026,478.
`Provisional application No. 60/054,546, filed on Aug. 1,
`1997,
`
`Int. Cl.’ ...
`.. GO6F 19/00
`
`US. Che cesses
`.. 712/34; 711/105
`Field of Search .....0.0.cccc 712/34; 711/105
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`10/1987 Thompsonetal.
`4,701,844 A
`1/1989 Matsuo etal.
`4,796,175 A
`/1991 Omadaetal.
`5,010,477 A
`12/1992 Beigheetal.
`5,175,835 A
`4/1993 Wetzel
`5,203,002 A
`5/1994 Sugimoto
`5,317,709 A
`/1995 Jobst et al.
`5,396,641 A
`5,475,631 A * 12/1995 Parkinson et al.
`5,485,624 A
`1/1996 Steinmetz etal.
`5,584,034 A
`12/1996 _Usamietal.
`5,588,118 A
`12/1996 Mandavaetal.
`5,594,917 A
`1/1997 Palermoetal.
`5,619,665 A
`/1997 Emma
`5,787,303 A
`/1998 Ishikawa
`5,805,850 A
`9/1998 Luick
`5,852,741 A
`12/1998 Jacobset al.
`
`............ 712/15
`
`1/1999 Motomura
`5,862,396 A
`7/1999 Moyeretal.
`5,923,893 A
`11/1999 Mohamedetal.
`5,978,838 A
`2/2000 Dowling
`6,026,478 A
`FOREIGN PATENT DOCUMENTS
`
`EP
`GB
`
`0 584 783 A2
`2 244 828 A
`
`8/1993
`/1990
`
`OTHER PUBLICATIONS
`
`Keeton, et al., “IRAM and SmartSIMM: Overcoming the
`1/O Bus Bottleneck,” Computer Science Division, Univer-
`sity of California at Berkeley (Jun. 1997) pp. 1-9.
`Papamichalis,
`Panos,
`et
`al,
`“The
`TMS320C30
`Floating-Point Digital Signal Processor”,
`IEEE Micro,
`(Dec. 1988) pp. 13-29.
`
`* cited by examiner
`
`Primary Examiner—William Treat
`(74) Attorney, Agent, or Firm—Gazdzinski & Associates
`
`(57)
`
`ABSTRACT
`
`A processing architecture includesa first CPU core portion
`coupled to a second embedded dynamic random access
`memory (DRAM)portion. These architectural components
`jointly implement a single processor and instruction set.
`Advantageously, the embedded logic on the DRAM chip
`implements the memory intensive processing tasks,
`thus
`reducing the amountoftraffic that needs to be bussed back
`and forth between the CPU core and the embedded DRAM
`
`chips. The embedded DRAM logic monitors and manipu-
`lates the instruction stream into the CPU core. The archi-
`
`tecture of the instruction set, data paths, addressing, control,
`caching, and interfaces are developed to allow the system to
`operate using a standard programming model. Specialized
`video and graphics processing systems are developed. Also,
`an extendedvery long instruction word (VLIW)architecture
`implemented as a primary VLIW processor coupled to an
`
`
`embedded DRAM VLIW extension processor efficiently
`deals with memory intensive tasks.
`In different
`embodiments, standard software can be accelerated either
`with or without the express knowledge of the processor.
`
`
`
`20 Claims, 11 Drawing Sheets
`
`STANDARD DRAM SIMM SLOTS
`
`Google Exhibit 1015
`Google Exhibit 1015
`Google v. Valtrus
`Google v. Valtrus
`
`

`

`700
`
`EXTENSION CONTROL BUS
`
`
` CPU
`
`EXTENSION
`LOGIC
` PROCESSOR /MEMORY BUS
`740
`
`
`
`BIU
`
`
`
`EMBEDDED DRAM
`
`FIC 7
`
`yuayed*S'A
`
`poo‘9‘AL
`TTJ®[29048
`Td€€8°09L‘9SA
`
`

`

` INSTRUCTION FETCH
`
`250
`
`
`
`PREFETCH
`DRAM
`UNIT
`
`ARRAY
`
`
`
`MEMORY BUS
`
`MONITOR /
`MODIFY
`
`PREFETCH
`
`CPU CORE
`
` FUNCTIONAL
`
`
`
`EMBEDDED DRAM
`
`FUNCTIONAL
`UNITS
`
`
`
`FORK
`
`UNITS
`
`SO
`
`FIG, 2
`
`yuayed*S'A
`
`poo‘9‘AL
`TTJ®Z29048
`Td€€8°09L‘9SA
`
`

`

`
`JE6O
`
`
`
`SAS
`
`
`
`T40 i INSTRUCTION FETCH nT
`
`
`
`PREFETCH
`= CACHE
`
`
`MONITOR /
`Se MODIFY
`—‘“\
`FORIC
`
`
`MoOPYrun1amaZz—
`
`READY
`729
`
`
`
`
`
`550
`
`
`CPU CORE
`
`
`
`
`FUNCTIONAL |
`UNITS
`

`
`IFO
`
`320
`
`TYPE II
`
`IID
`
`F570
`
`yuajed‘S'0
`
`poo‘9‘AL
`TIJo€US
`Td€€8°09L‘9SA
`
`PREFETCH
`
`JIO
`
`FUNCTIONAL
`UNITS
`
`
`
`
`
`REGISTERS
`
`
`
`
`JE8O -
`
`4D
`J,
`
`EMBEDDED DRAM
`
`
`
`FIC. F
`
`

`

`U.S. Patent
`
`Jul. 6, 2004
`
`Sheet 4 of 11
`
`US 6,760,833 B1
`
`
`
`CPU CORE
`PROGRAM SPACE
`
`EMBEDDED DRAM
`PROGRAM SPACE
`
`f/G. 4
`
`

`

`U.S. Patent
`
`Jul. 6, 2004
`
`Sheet 5 of 11
`
`US 6,760,833 B1
`
`
`
`
`
`
`LOGICAL
`
`
`TYPE I]
`OPCODE
`ADDRESS OF
`INDICATOR
`FOR
`EDRAM
`CPU
`INSTRUCTION
`CORE
`ADDRESS
`
`FIG, 30
`
`
`
`aan 5 550 555) Kr 3
`
`550
`
` OPCODE
`
`SIGNAL
`
`STOP
`
`
`
` TYPE II OPCODE MICROSEQUENCE
`
`INDICATOR
`FOR
`EMBEDDED
`DRAM
`
`
`COMMAND
`
`F/G. 20
`
`

`

`yuayed*S'A
`
`poo‘9‘AL
`TTJ©9W048
`Td€€8°09L‘9SA
`
`CACHE
`
`O8O
`
`CBF
`
`FUNCTIONAL
`UNITS
`
`N O
`
`F 7
`
`S
`
`Ss)
`
`o20~
`“
`
`E70
`
`CIO
`
`720
`
`C09
`
`IID
`
`C60 PROGRAM
`
`INTERFACE
`REGISTERS
`
`CONTROL
`
`INTERFACE
`
`TYPE II
`INTERFACE
`
`REGISTER ©
`
`FILE
`
`-/C. O
`
` TYPE II
`
`
`
`
`
` I2S
`
`

`

`N
`SAM
`MEMORY/INTERFACE|INTERFACE
`AIS
`REGISTERS
`
`yuajed‘S'0
`
`poo‘9‘AL
`TTJ®Z948
`700.
`
`LM]|aooressADDRESS||DATA
`A0nness
`UNIT
`Td€€8°09L‘9SA
`
`
`
`PROGRAM
`CACHE
`
`
`
`_ DRAM ARRAY
`
`710
`
`PES
`
`PREFETCH
`_UNIT
`
`Vmtad
`pe GP
`
`YEO
`
`ZEO
`
`ee
`
`PIO
`
`I
`
`520
`
`TYPE Il
`
`FID
`
`FIC. 7
`
`|DISPATCH|
`
`

`

`SOO
`
`x
`
`SIO
`
`mS]
`
`BIO
`
`SIF
`
`VLIW — EXTENSION
`
`e090
`
`aN
`
`VLIW
`
`DISPATCH UNIT
`
`PREFETCH
`
`REGISTER SET A
`
`REGISTER SET B
`
`O27
`
`BRANCH
`INTERFACE
`
`SAD
`
`
`
`
`
`yuayed*S'A
`
`poo‘9‘AL
`TTJ®8299048
`Td€€8°09L‘9SA
`
`
`
`ON—CHIP DATA|___ON-CHIP_DATAMEMORY|
`
`CDRAM ARRAY
`
`-/G. &
`
`

`

`U.S. Patent
`
`Jul. 6, 2004
`
`Sheet 9 of 11
`
`US 6,760,833 B1
`
`APPLICATION
`
`FIO
`
`MESSAGE BASED
`
`
`
`G20
`API, VDI, GDI
`
`HARDWIRED
`API, VDI, GDI
`
`DRIVER
`PROGRAM
`
`G40
`
`CPU /EDRAM
`ROUTINES
`
`F/G, @
`
`

`

`U.S. Patent
`
`Jul. 6, 2004
`
`Sheet 10 of 11
`
`US 6,760,833 B1
`
`APPLICATION
`
`JO7O
`
`fo
`
`MMX, etc.
`
`
`
`EMBEDDED
`
`
`CPU PROFILE
`
`DRAM
`
`SOFTWARE
`
`
`PROFILER
`
`
`7020
`
`
`
`
`
`7030
`
`Pe
`
`CONSTRUCT
`
`
`
`MODIFICATION [-——5 Moore
`
`
`7
`TABLE
`
`
`
`
`N
`
`F/G.
`
`JO
`
`

`

`STANDARD DRAM SIMM SLOTS
`
`F/G,
`
`77
`
`yuayed*S'A
`
`poo‘9‘AL
`TTJ®IT29048
`Td€€8°09L‘9SA
`
`

`

`US 6,760,833 B1
`
`1
`SPLIT EMBEDDED DRAM PROCESSOR
`
`REFERENCE TO RELATED APPLICATIONS
`
`The present application is a continuation of U.S. appli-
`cation Ser. No. 09/487,639filed Jan. 19, 2000, entitled “Split
`Embedded DRAM Processor’, now U.S. Pat. No. 6,226,
`738, which is a divisional application of U.S. application
`Ser. No. 08/997,364 filed Dec. 23, 1997, now US. Pat. No.
`6,026,478 issued Feb. 15, 2000, which claims priority ben-
`efit of Provisional Application No. 60/054,546filed Aug.1,
`1997.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`
`The present invention relates to the fields of micropro-
`cessor and embedded DRAM architectures. More
`particularly, the invention pertains to a split processor archi-
`tecture whereby a CPU portion performs standard process-
`ing and control functions, an embedded DRAM portion
`performs memory-intensive manipulations, and the CPU
`and embedded DRAM portions function in concert
`to
`execute a single program.
`2. Description of the Prior Art
`Microprocessor technology continues to evolve rapidly.
`Every few years processor circuit speeds double, and the
`amount of logic that can be implemented on a single chip
`increases similarly. In addition, RISC, superscalar, very long
`instruction word (VLIW), and other architectural advances
`enable the processor to perform more useful work per clock
`cycle. Meanwhile,
`the number of DRAM cells per chip
`doubles and the required refresh rate halves every few years.
`The fact that DRAM access times do not double every few
`years results in a processor-DRAM speed mismatch. If the
`processoris to execute a program and manipulate data stored
`in a DRAM,it will have to insert wait states into its bus
`cycles to work with the slower DRAM. To combatthis,
`hierarchical cache structuresor large on-board SRAM banks
`are used so that on average, muchless time is spent waiting
`for the large but slower DRAM.
`Real-time multimedia capabilities are becoming increas-
`ingly important in microcomputer systems. Especially with
`video and image data,it is not practical to build caches large
`enough to hold the requisite data structures while they are
`being processed. This gives rise to large amounts of data
`traffic between the memory and the processor and decreases
`cache efficiency. For example, the Intel Pentium processors
`employ MMX technology, which essentially provides a
`vector processor subsystem that can process multiple pixels
`in parallel. However, even with faster synchronous DRAM,
`the problem remains that performance is limited by the
`DRAM access time needed to transfer data to and from the
`processor.
`Other applications where external DRAM presents a
`system bottleneck are database applications. Database pro-
`cessing involves such algorithms as searching, sorting, and
`list processing in general. A key identifying requirementis
`the frequent use of memory indirect addressing. In memory
`indirect addressing, a pointer is stored in memory. The
`pointer must be retrieved from memory and then used to
`determine the address of another pointer located in memory.
`This addressing mode is used extensively in linked list
`searching and in dealing with recursive data structures such
`as trees and heaps. In these situations, cache performance
`diminishes as the processor is burdened with having to
`manipulate large data structures distributed across large
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`areas in memory. In many cases, these memory accessesare
`interleaved with disk accesses,
`further
`reducing system
`performance.
`Several prior art approaches have been used to increase
`processing speed in microsystemsinvolving a fast processor
`and a slower DRAM. Manyof these techniques, especially
`cache oriented solutions, are detailed in “Computer Archi-
`tecture: A Quantitative Approach, 2nd Ed.,” by John Hen-
`nessy and David Patterson (Morgan Kaufmann Publishers,
`1996). This reference also discusses pipelined processing
`architectures together with instruction-level parallel pro-
`cessing techniques, as embodied in superscalar and VLIW
`architectures. These concepts are extendedherein to provide
`improved performance by providing split caching and
`instruction-level parallel processing structures and methods
`that employ a CPU core and embedded DRAM logic.
`The concept of using a coprocessorto extend a processor
`architecture is knownin the art. Floating point coprocessors,
`such as the Intel 80x87 family, monitor the instruction
`stream from the memory into the processor, and, when
`certain coprocessor instructions are detected, the coproces-
`sor latches and executes the coprocessor instructions. Upon
`completion,
`the coprocessor presents the results to the
`processor. In such systems, the processor is aware of the
`presence of the coprocessor, and the two work together to
`accelerate processing. However, the coprocessoris external
`from the memory, and no increase in effective memory
`bandwidth is realized. Rather, this solution speeds up com-
`putation by employing a faster arithmetic processor than
`could be integrated onto a single die at the time. Also, this
`solution does not provide for the important situation when
`the CPU involves a cache. In suchsituations, the coproces-
`sor instructions cannot beintercepted, for example, when the
`CPU executes looped floating point code from cache.
`Another deficiency with this prior art
`is its inability to
`provide a solution for situations where the processoris not
`aware of the presence of the coprocessor. Such a situation
`becomesdesirable in light of the present invention, whereby
`a standard DRAM maybereplaced by an embedded DRAM
`to accelerate processing without modification of preexisting
`application software.
`Motorola employed a different coprocessor interface for
`the MC68020 and MC68030 processors. In this protocol,
`when the processor executes a coprocessor instruction, a
`specialized sequence of bus cycles is initiated to pass the
`coprocessorinstruction and any required operandsacross the
`coprocessor interface. If, for example, the coprocessor is a
`floating point processor, then the combination of the pro-
`cessor and the coprocessor appears as an extended processor
`with floating point capabilities. This interface serves as a
`good starting point, but does not define a protocol to fork
`execution threadsor to jointly execute instructions on both
`sides of the interface. Furthermore,
`it does not define a
`protocol to allow the coprocessorto interact with the instruc-
`tion sequence before it arrives at the processor. Moreover,
`the interface requires the processor to wait while a sequence
`of slow bus transactions are performed. This interface con-
`cept is not sufficient to support the features and required
`performance needed of the embedded DRAM coprocessors.
`US. Pat. No. 5,485,624 discloses a coprocessor architec-
`ture for CPUsthat are unaware of the presence of a copro-
`cessor.
`In this architecture,
`the coprocessor monitors
`addresses generated by the CPU while fetching instructions,
`and when certain addresses are detected,
`interprets an
`opcodefield not used by the CPU as a coprocessor instruc-
`tion. In this system, the coprocessor then performs DMA
`transfers between memory and an interface card. This sys-
`
`

`

`US 6,760,833 B1
`
`
`
`3
`tem does not involve an embedded DRAM that can speed
`processing by minimizing the bottleneck between the CPU
`and DRAM. Moreover,the coprocessorinterface is designed
`oO monitor the address bus and to respond only to specific
`preprogrammed addresses. When one of these addressesis
`identified, then an unused portion of an opcode is needed in
`which to insert coprocessor instructions. This system is thus
`not suited to systems that use large numbers of coprocessor
`instructions as in the split processor architecture of the
`present invention. A very large content addressable memory
`(CAM) would be required to handle all the coprocessor
`instruction addresses, and this CAM would need to be
`flushed and loaded on each task switch. The need for a large
`CAM eliminates the DRAM area advantage associated with
`an embedded DRAM solution. Moreover, introduction of a
`arge task switching overhead eliminates the acceleration
`advantages. Finally, this technique involves a CPU unaware
`of the coprocessor but having opcodes that include unused
`fields that can be used by the coprocessor. A more powerful
`and general solution is needed.
`The concept of memorybased processorsis also knownin
`he art. The term “intelligent memories” is often used to
`describe such systems. For example, U.S. Pat. No. 5,396,641
`discloses a memory based processor that
`is designed
`increase processor-memory bandwidth. In this system,a set
`of bit serial processor elements function as a single
`instruction, multiple data (SIMD) parallel machine. Data is
`accessed in the memory based processor using normal row
`address and column address strobe oriented bus protocols.
`SIMDinstructionsare additionally latched in along with row
`addresses to control the operation of the SIMD machine
`under control by a host CPU. Hence, the description in U.S.
`Pat. No. 5,396,641 views the intelligent memory as a
`separate parallel processor controlled via write operations
`from the CPU. While this system may be useful as an
`attached vector processor, it does not serve to accelerate the
`normal software executed on a host processor. This archi-
`tecture requires the CPU to execute instructions to explicitly
`control and route data to and from the memory based
`coprocessor. This architecture does not provide a tightly
`coupled acceleration unit that can accelerate performance
`with specialized instruction set extensions, and it cannot be
`used to accelerate existing applications software unaware of
`the existence of the embedded DRAM coprocessor. This
`architecture requires a very specialized form of program-
`ming where SIMD parallelism is expressly identified and
`coded into the application program.
`It would be desirable to have an architecture that could
`accelerate the manipulation of data stored in a slower
`DRAM. It would also be desirable to be able to program
`such a system in a highlevel language programming model
`whereby the acceleration means are transparent to the pro-
`grammer. It would also be desirable to maintain the pro-
`cessing features and capabilities of current microprocessors,
`to include caching systems, instruction pipelining, supersca-
`lar or VLIW operation, and the like. It would also be
`desirable to have a general purpose processorcore that could
`implement operating system and applications programs so
`that
`this core could be mixed with different embedded
`DRAM coprocessors to accelerate the memory intensive
`processing of, for example, digital signal processing, mul-
`timedia or database algorithms. Finally, it would be desir-
`able if a standard DRAM module could be replaced by an
`embedded DRAM module with processor architectural
`extensions, whereby existing software would be accelerated
`by the embedded DRAM extension.
`SUMMARY OF THE INVENTION
`
`One aspect of the present invention is a processor whose
`architecture is partitioned into a CPU core portion and an
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`embedded DRAM portion. The CPU core portion handles
`the main processing and control functions, while the embed-
`ded DRAM portion performs memory-intensive data
`manipulations.
`In the architecture,
`instructions execute
`either on the CPU core portion of the processor, the embed-
`ded DRAM portion of the processor, or across both portions
`of the processor.
`the CPU
`invention,
`In another aspect of the present
`portion is able to effectively cache instructions and data
`while still sharing the instruction stream with the embedded
`DRAM portion of the processor implemented in the embed-
`ded DRAM. A separate caching structure is employed for a
`different program space on the embedded DRAM. Usingthis
`system, the separation of the CPU and embedded DRAM
`portions of the architecture is transparent to the programmer,
`allowing standard high level language software to run. In
`one embodiment, a special compiler is used to segmentthe
`code into a plurality of instruction types. The processor
`architecture takes advantage of the embedded DRAM,
`advantageously employing multiple address spaces that are
`transparent to the first portion of the processor, and that
`minimize data bussingtraffic between the processors.
`Another aspect of the present invention is an apparatus
`and method to execute standard available software on a split
`architecture. For example,
`in the personal computer and
`workstation markets there are already multi-billion dollar
`investments in preexisting software. In this aspect of the
`invention, an embedded DRAM module may be inserted
`into an existing single in line memory module (SIMM)slot.
`Thus, an accelerator may be added without needing to
`modify existing application software, and the upgrade can
`be performed effortlessly in the field. This functionality is
`enabled by allowing the embedded DRAM coprocessorto
`monitor the instruction stream and to replacecertain instruc-
`tion sequences with read and write commands.
`In one
`embodiment a profiler analyzes uniprocessor execution
`eitherstatistically or dynamically and then constructs modi-
`fication tables to reassign certain code segments to the
`embedded DRAM coprocessor.In another embodiment, the
`embedded DRAM performsthe analysis in real-time. In still
`another embodiment, the embedded DRAM is exercised by
`standard software through the use of preloaded driver pro-
`grams accessed via operating system calls.
`Another aspect of the present invention is a computer
`system which comprises a central processing unit and an
`external memory coupled to the central processor. The
`external memory comprises one or more dynamic random
`access memory (DRAM)arrays, a set of local functional
`units, a local program prefetch unit, and a monitor/modify
`unit. The monitor/modify unit is operative to evaluate each
`instruction opcode as it is fetched from the DRAM array,
`and,
`in response to the opcode,
`to perform one of the
`following actions:
`(i) sending the opcode to the central processing unit,
`(ii) sending the opcodeto the set of local functional units;
`and
`
`(iii) sending the opcodeto the local program prefetch unit
`to fork a separate execution thread for execution by the
`set of local functional units.
`Preferably, in response to the opcode, the monitor/modify
`unit also performs the actions of sending the opcode to the
`set of local functional units, substituting at least one different
`opcodefor the opcode, and sendingthe at least one different
`opcodeto the central processing unit. Also preferably, the at
`least one different opcode instructs the central processing
`unit to read values from the external memoryrepresentative
`
`

`

`US 6,760,833 B1
`
`5
`of the register contents that would have been presentin the
`central processing unit had the central processing unit
`executed the original instruction stream.
`Another aspect of the present invention is an embedded
`dynamic random access memory (DRAM) coprocessor
`designed to be coupled to a central processing unit. The
`embedded DRAM coprocessor comprises one or more
`DRAM arrays. An external memoryinterface is responsive
`to address and control signals generated from an external
`source to transfer data between the DRAM arrays and the
`external source. A set of local functional units execute
`program instructions. A local program prefetch unit fetches
`program instructions. A monitor/modify unit evaluates each
`instruction opcode as it
`is fetched under control of the
`external source from the DRAM array, and, in response to
`the opcode, performs one of the following actions:
`(i) sending the opcode to the external source;
`(ii) sending the opcodeto the set of local functional units;
`and
`
`(iii) sending the opcode to the local program prefetch unit
`to fork a separate execution thread for execution by the
`set of local functional units.
`Preferably, in response to the opcode, the monitor/modify
`unit also performs the actions of sending the opcode to the
`set of local functional units, substituting one or more dif-
`ferent opcodesfor the opcode, and sending the one or more
`different opcodes to the external source.
`Another aspect of the present invention is a computer
`system which comprisesa central processing unit coupledto
`an external memory. The central processor unit comprises a
`first set of functional units responsive to program instruc-
`tions. A first program cache memoryhasat least onelevel of
`caching and provides high speed access to the program
`instructions. A first prefetch unit controls the fetching of a
`sequence of instructions to be executed by the first set of
`functionalunits. The instructions are fetched from the exter-
`nal memoryunless the program instructions are found in the
`first program cache memory; in which case, the program
`instructions are fetched from the first program cache
`memory. The external memory comprises one or more
`dynamic random access memory (DRAM)arrays, a second
`set of local functional units, a second program prefetch unit,
`and a second program cache memory. The first program
`cache memory only caches instructions executed by the
`functional units on the central processing unit, and the
`second program cache memory only caches instructions
`executed by the second set of functional units on the external
`memory device. Preferably, the first program cache memory
`is a unified cache which also serves as a data cache. Also
`preferably, the central processing unit sends one or more
`attribute signals to identify certain memory read signals to
`be instruction fetch cycles. The attribute signals are decoded
`by logic embedded in the external memory so that
`the
`second program cache memory can identify opcode fetch
`cycles.
`In particular embodiments,
`the external memory
`further includes a monitor/modify unit which intercepts
`opcodes fetched by the first prefetch unit and passes the
`opcodes to the second prefetch unit to cause the second
`prefetch unit to fetch a sequence of program instructions for
`execution. The opcodes of the sequence of program instruc-
`tions are fetched from the one or more DRAM arrays unless
`they are found to reside in the second program cache.
`Another aspect of the present invention is an embedded
`dynamic random access memory (DRAM) coprocessor
`which comprises an external memory interface for transfer-
`ring instructions and data in response to address and control
`signals received from an external bus master. The coproces-
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`sor also comprises one or more DRAM arrays, a set of local
`functional units, a program prefetch unit, and a program
`cache memory. The program cache memory only caches
`instructions executed by the functionalunits on the external
`memory device. Preferably, the external memory interface
`receives one or more attribute signals to identify certain
`memory read signals to be instruction fetch cycles. The
`attribute signals are decoded by logic embedded in the
`external memory so that the program cache can identify
`externally generated opcode fetch cycles. The coprocessor
`preferably includes a monitor/modify unit which intercepts
`opcodesin instructionstransferred over the external memory
`interface and which passes the opcodes to the program
`prefetch unit to cause the program prefetch unit to fetch a
`sequence of program instructions for execution. The
`opcodesof the sequence of program instructions are fetched
`from the one or more DRAM arrays unless the opcodes of
`the sequence of program instructions are found to reside in
`the program cache.
`Another aspect of the present invention is a computer
`system which comprises a central processing unit coupledto
`an external memory. The central processing unit comprises
`a first set of functional units responsive to program instruc-
`tions. A first prefetch unit controls the fetching of a sequence
`of instructions from the external memoryto be executed by
`the first set of functional units. The external memory com-
`prises one or more dynamic random access memory
`(DRAM)arrays, a secondset of local functional units, one
`or more external interface busses, and a second program
`prefetch unit. The central processing unit and the external
`program memory jointly execute a single program whichis
`segmented into first and second program spaces. The first
`program space comprisestype I, type II and optionally type
`Il] instructions. The second program space comprises type II
`and type III instructions. The type I instructions always
`execute on the first set of functional units. The type II
`instructions generate interface control exchanges between
`the central processing unit and the external memory. The
`type II
`instructions selectively are split
`into portions
`executed on the central processing unit and portions
`executed on the external memory. The type III instructions
`always execute on the second set of functional units.
`Preferably, the central processing unit has a first program
`cache, and the external memory has a second program
`cache. The first cache only caches the type I and the type II
`instructions accessedin the first program space. The second
`program cache only caches type II and type III instructions
`accessed in the second program space. Preferably, upon the
`execution ofthe type II instruction onthe central processing
`unit, a logical address is transferred over one of the external
`interface busses to the external memory. The external
`memory passes the logical address to the second prefetch
`unit, which, in turn, fetches a sequence of instructions from
`the second program space. The sequence of instructions is
`executed by a secondset of functional units in the external
`memory. Preferably, the type II instructions comprise first
`and second opcodes.The first opcode executes on the central
`processing unit, and the second opcode executes on the
`external memory. The first opcode comprises instruction
`type identifier information, opcode information to direct
`execution of a one of the first set of functional units, and an
`address field to be transferred over one of the external
`interface busses to reference instructions in the second
`program space. The second opcode comprises instruction
`type identifier information and opcode information to direct
`execution of a one of the second set of functional units.
`Preferably, the second opcode further comprises signaling
`
`

`

`US 6,760,833 B1
`
`20
`
`25
`
`30
`
`35
`
`
`
`8
`7
`unit. When the central processor executes specified instruc-
`information to be passed across one of the external interface
`tions in an instruction stream read from a first program
`busses to the central processing unit. A stop field indicates
`memory space in the embedded DRAM coprocessor, the
`to the secondprefetch unit to stop fetching instructions from
`central processor sends address information to the embedded
`the second program space. Preferably, the type IJ instruction
`DRAM coprocessor which references instructions in a sec-
`is a split branch to subroutine instruction, and upon execu-
`ond program memory space located in the embedded
`tion of the split branch to subroutine instruction, a subrou-
`DRAM coprocessor. As a result, the central processing unit
`tine branch address is passed across one of the external
`and the embedded DRAM coprocessor jointly execute a
`interface bussesto activate a subroutine stored in the second
`program. Preferably,
`the embedded DRAM coprocessor
`program space. Preferably, the type II instruction involves a
`further includes a register file coupled to the DRAM array
`first operand stored in memory and a second operandstored
`and the functional units. At least a subsetof the registerfile
`in a register located on the central processing unit. The type
`contains a mirror image of a register set contained on the
`I] instruction is split into a first portion and a secondportion.
`external central processing unit. At least a subset of the set
`Thefirst portion executes on the external memoryto access
`of the one or more functional units is capable of executing
`the first operand and to place it on one of the external
`a subset of the instruction set executed on the central
`interface busses. The second portion executes on the central
`processing unit. Also preferably,
`the register file further
`processing unit whichreadsthe first operand from oneof the
`includes a set of multimedia extension (MX)registers., and
`external interface busses and computes a result of the type
`the functional units include one or more MMXfunctional
`IJ instruction.
`units.
`Another aspect of the present invention is an embedded
`invention is a central
`Another aspect of the present
`dynamic random access memory (DRAM) coprocessor
`processing unit cooperative to jointly execute programs
`which jointly executes a program with an external central
`fetched from an embedded dynamic random access memory
`processing unit. The embedded DRAM coprocessor com-
`(DRAM)coprocessor. Thecentral processing unit comprises
`prises a DRAM array which comprises one or more DRAM
`a prefetch unit which fetches instructions to be executed by
`banks. Each bank has an associated row pointer. Each row
`the central processing unit, set of internal registers, a set of
`pointer is operative to precharge and activate a row in the
`one or more functional units which executes instructions, an
`espective DRAM bank. A first synchronous external
`optional program cache, a first external memory interface
`memory interface accepts address and control information
`which transfers addresses, control signals and data to and
`used to access memory locations in the DRAM array. A
`from external memoryand input/output(I/O) devices, and a
`second synchronous external memory interface receives
`second external memory interface which transfers synchro-
`type II instruction information from an external source. A
`nization signals and address information betweenthe central
`prefetch unit is responsive to the received type IT informa-
`processing unit and the embedded DRAM coprocessor. The
`ion to execute one or more instructions referenced by the
`central processing unit and the embedded DRAM coproces-
`eceived type II information. A set of one or more functional
`sor jointly execute a single program that is partitioned into
`units is responsive to instructions fetched by the prefetch
`first and second memoryspaces. The instructions in the first
`unit. Preferably, the first and the second synchronousinter-
`memory space are executed by the central processing unit.
`aces share a common bus. Also preferably, the embedded
`The instructions in the second memory space are executed
`DRAM coprocessor further comprises a program cache
`by the embedded DRAM coprocessor. The instructions in
`which caches program instructions fetched under the control
`the first memory space includeafirst type of instruction and
`of the prefetch unit from the DRAM array. The embedded
`40
`a second type of instruction. The first type of instruction i

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket