throbber
as United States
`a2) Patent Application Publication co) Pub. No.: US 2002/0052914 Al
`(43) Pub. Date: May2, 2002
`
`ZALEWSKI et al.
`
`US 20020052914A1
`
`(54) SOFTWARE PARTITIONED
`MULTI-PROCESSOR SYSTEM WITH
`FLEXIBLE RESOURCE SHARING LEVELS
`
`(76)
`
`Inventors: STEPHEN H. ZALEWSKI,
`NASHUA, NH (US); ANDREW H.
`MASON, HOTLIS, NH (US);
`GREGORY H. JORDAN, HOLLIS,
`NH (US); KAREN L. NOEL,
`PEMBROKE, NH (US)
`
`Correspondence Address:
`JONATHAN M. HARRIS
`CONLEY, ROSE & TAYON
`P.O. BOX 3267
`HOUSTON, TX 77253-3267 (US)
`
`(*) Notice:
`
`This is a publication of a continued pros-
`eculion application (CPA) filed under 37
`CFR 1.53(d).
`
`(21) Appl. No.:
`
`09/095,521
`
`(22)
`
`Filed:
`
`Jun. 10, 1998
`
`Publication Classification
`
`(51)
`
`Int. Ch? oe GO06F 15/167; GO6F 15/16;
`GO6F 12/00; GO6F 13/00;
`GOGF 12/14; GOOF 12/16;
`GO6F 13/28
`(52) US. Cd. ices ees tesseesseessnes 709/203; 711/123
`
`(57)
`
`ABSTRACT
`
`Multiple instances of operating systems execute coopera-
`tively in a single multiprocessor computer wherein all
`processors and resourcesare electrically connected together.
`The single physical machine with multiple physical proces-
`sors and resources is subdivided by software into multiple
`partitions, each running a distinct copy, or instance, of an
`operating system. Eachofthe partitions has access to its own
`physical resources plus resources designated as shared. The
`partitioning is performed by assigning all resources with a
`configuration tree. None, some, or all, resources may be
`designated as shared among multiple partitions. Each indi-
`vidual operating instance will generally be assigned the
`resources it needs to execute independently and these
`resources will be designated as “private.” Other resources,
`particularly memory, can be assigned to more than one
`instance and shared. Shared memory is cache coherent so
`that
`instances may be tightly coupled, and may share
`resources that are normally allocated to a single instance.
`This allows previously distributed user or operating system
`applications which usually must pass messages via an exter-
`nal
`interconnect
`to operate cooperatively in the shared
`memory without the need for either an external interconnect
`or message passing. Examples of application that could take
`advantage of this capability include distributed lock man-
`agers and cluster interconnects. Newly-added resources,
`such as CPUs and memory, can be dynamically assigned to
`different partitions and used by instances of operating sys-
`tems running within the machine by modifying the configu-
`ration.
`
`
`901
`
`L9120
`
`ay904\
`
`
`
`900
`
`Google Exhibit 1008
`Google Exhibit 1008
`Google v. Valtrus
`Googlev. Valtrus
`
`1s
`
`Of
`
`920
`yo 902
`
`INSTANCE C
`INSTANCE B
`INSTANCE A
`
`
`
`912
`
`CPUS
`
`1234
`
`CPUS
`
`
`789
`
`
`916
`
` 910
`
`132 a
`INSTANCEA PRIVATE MEMORY
`026
`
`Tenererenee}SHARED MEMORY
`
`|INSTANCECPRIVATEMEMORY|Cc|INSTANCECPRIVATEMEMORY|MEMORY 930
`a 924 TS
`
`

`

`Patent Application Publication May 2,2002 Sheet 1 of 15
`
`US 2002/0052914 Al
`
`
`
`
`
`108
`449-113-114
`
`
`
`
`
`
`
`
`
`
`VO
`
`MEMORY
`
`418
`
`120
`100
`
`4707
`
`116
`
`P
`
`_—
`
`104
`
`FIG. 1
`
`

`

`Patent Application Publication May 2,2002 Sheet 2 of 15
`
`US 2002/0052914 Al
`
`200
`
`
`OPERATING
`
`SYSTEM
`
`INSTANCE
`
`
`
`
`
`OPERATING
`SYSTEM
`INSTANCE
`
`OPERATING
`SYSTEM
`INSTANCE
`
`
`
`WORKSTATION
`
`

`

`Patent Application Publication
`
`May 2, 2002 Sheet 3 of 15
`
`US 2002/0052914 Al
`
`ce
`
`
`
`VLE
`
`cStceeOce8ce92e
`OsadaYATIONLNOD8€fFPOsada
`WaAWAYOWSAWOSeWAW
`
`Bre100uLOOM
`oe
`
`
`
`LoodSLWIdWaLJYVMLIOSSYVMCYVH
`SrcOrevecreOve
`gle@NOILILaVdLNOILILYYdcle
`00€/1LooY334
`90€¥
`édas|aas
`
`o€
`
`vee
`
`YATIONLNOD
`ANOWSWoe
`
`€OL
`
`
`
`

`

`Patent Application Publication
`
`May2, 2002 Sheet 4 of 15
`
`US 2002/0052914 Al
`
`OPP
`
`ov
`
`Ado
`
`EP
`
`Zbbezp
`
`
`os”AYOWSWAMOWEIN
`bbOey
`120|9zpfot
`vLPoly
`ndondo
`Adondo
`
`YATIONLNODYATIONLNOOD
`@NOILILYVdbNOILILYVd
`SryBehne|9s3dosad
`Ol?ALINNNWOD
`
`907SYVMLIOS
`zpvey
`
`bOI
`
`vor
`
`¢ads|ags
`
`ccy
`
`
`
`
`

`

`Patent Application Publication May 2,2002 Sheet 5 of 15
`
`US 2002/0052914 Al
`
`START
`
`500
`
`
`
`
`
`
`
`
`
`
`INITIALIZE EACH
`PARTITION AND START
`ITS CONSOLE
`
`START MASTER
`CONSOLE
`
`PROBE HARDWARE
`
`FORM CONFIGURATION
`TREE
`
`502
`
`504
`
`506
`
`508
`
`510
`
`BOOT SOME OS
`INSTANCES
`
`FIG. 5
`
`FINISH
`
`912
`
`

`

`Patent Application Publication May 2,2002 Sheet 6 of 15
`
`US 2002/0052914 Al
`
`START
`
`600
`
`
`
`
`
`
`
`
`
`
`STORE APMP
`DATABASE VAIN IP
`HANDLER CELL
`
`602
`
`604
`
`MAP APMP DATABASE
`INITIAL SEGMENT
`
`
`
`RESET INTERRUPT
`MASKS FOR CURRENT
`INSTANCE
`
`
`
`
`INITIALIZE HEARTBEAT
`WORD AND OTHER
`INSTANCE BLOCKS
`
`
`
`606
`
`608
`
`FINISH
`
`610
`
`FIG. 6
`
`

`

`Patent Application Publication
`
`May2, 2002 Sheet 7 of 15
`
`US 2002/0052914 Al
`
`SET SYSTEM AND
`INSTANCE STATE TO
`INITIALIZING
`
`700
`
`702
`
`704
`
`
`
`PRIMARY INSTANCE
`CALLS SIZE ROUTINE
`
`706
`
`START
`
`
`
`
`ALLOCATE SPACE FOR
`APMP DATABASE
`
`
`
`FILL OFFSETS FOR
`SERVICE SYSTEM
`SEGMENTS
`
`708
`
`FIG. 7A
`
`

`

`Patent Application Publication May 2,2002 Sheet 8 of 15
`
`US 2002/0052914 Al
`
`710
`
`CALLINITIALIZATION
`ROUTINE FOR EACH
`SERVICE
`
`712
`
`
`INITIALIZE
`
`MEMBERSHIP MASK
`AND SET PARAMETERS
`
`
`
`INSTANCE SETS ITSELF
`AS BIG BROTHER
`
`714
`
`716
`
`
`
`
`
`
`
`
`
`
`
`INITIALIZE INSTANCE
`AND SYSTEM STATES
`
`718
`
`RELEASE APMP
`DATABASE LOCK
`
`
`
`FIG. 7B
`
`720
`
`FINISH
`
`

`

`Patent Application Publication May 2,2002 Sheet 9 of 15
`
`US 2002/0052914 Al
`
`START
`
`
`
`
`CHECK FOR UNIQUE NAME
`
`804
`
`
`SET SYSTEM AND INSTANCE
`STATES TO INSTANCE JOINING
`
`806
`
`MAP PORTION OF APMP DATABASE
`INTO LOCAL MEMORY
`
`
`
`808
`
`CALL SYSTEM JOIN ROUTINES
`
`FIG. 8A
`
`

`

`Patent Application Publication May 2,2002 Sheet 10 of 15
`
`US 2002/0052914 Al
`
`812
`
`ADD TO MEMBERSHIP
`MASK
`
`SELECT BIG BROTHER
`
`814
`
`816
`
`
`
`
`
`
`
`
`
`FILL IN INSTANCE
`STATE AND SET
`MEMBERSHIP FLAG
`
`
`
`818
`
`RELEASE APMP
`DATABASE LOCK
`
`
`
`FINISH
`
`820
`
`FIG. 8B
`
`

`

`Patent Application Publication
`
`May 2, 2002 Sheet 11 of 15
`
`US 2002/0052914 Al
`
`006
`
`826
`
`o|oaonwisnt|@SONVISNI
`5sndoTeesndo
`ANOWSAIWGSAYVHS_4eonsalvandg3oNvLsNI
`
`
`
`
`926AMOWSIWALVAINdVSONVLSNI
`
`
`0e6AMOWGAINSLVAIMd9SONVLSNI
`
`6DIA
`
`

`

`ltOL68
`
`YOVLlvd83GOO
`
`
`
`LOANNOONSLNIHHOMLAN
`
`May2, 2002 Sheet 12 of 15
`
`US 2002/0052914 Al
`
`
`
`ALVeVdsSHOVS
`
`S/OJOJONVLSNI
`
`Patent Application Publication OLOlt
`
`
`
`AYOWSINSLVAId0SONVLSNI|AYOWSIALVAIddSONVLSNI|ANOWAINALVAIddV¥SONVLSNI
`ODSONVLSNIgSONVLSNIVSJONVLSNI
`SNdOSNdOSNdo
`
`0001
`
`

`

`Patent Application Publication
`
`May 2, 2002 Sheet 13 of 15
`
`US 2002/0052914 Al
`
`
`
`cOLt
`
`
`
`LOANNOOYALNIMYHOMLAN
`
`OOLL TTOlt
`
`
`
`AYOWGSINALVAId9SONVISNI|AYOWSWSLVAId8SONVLSNI|AMOWAWSLVAIddVSONVLSNI
`ODADNVLSNI@JONVLSNIVSAONVLSNI
`/SNdOSNdOSNdoO
`
`NoILOasTVEOT9.-/-§-«sAMIOWWAINGSAYVHS
`
`
`AINOVLVG
`
`lLOL68
`
`OZbBLE
`
`
`
`

`

`US 2002/0052914 Al
`
`
`
`LOANNOOUALNISOVYOLS
`
`
`
`LOANNOOYSLNIHYHOMLAN
`
`May2, 2002 Sheet 14 of 15
`
`Patent Application Publication gz!ClDIA
`O°AONVLSNIVAONVILSNI
`
`LOANNOOYALNIYALSNTDCCL
`SNdoOSNdd
`
`

`

`Patent Application Publication
`
`May2, 2002 Sheet 15 of 15
`
`US 2002/0052914 Al
`
`
`
`
`
`LOANNOOMALN!MYOMLAN
`
`
`
`3AONVLSNI
`
`SNdO
`
`SNdoOVAONVILSNI
`
`
`
`XINOVLIVG
`
`
`
`LOANNOOYALNIHALSNT9
`
`flDIA
`
`OO€L
`
`
`
`

`

`US 2002/0052914 Al
`
`May 2, 2002
`
`SOFTWARE PARTITIONED MULTI-PROCESSOR
`SYSTEM WITH FLEXIBLE RESOURCE SHARING
`LEVELS
`
`FIELD OF THE INVENTION
`
`‘This invention relates to multiprocessor computer
`{0001]
`architectures in which processors and other computer hard-
`wate resources are grouped in partitions, each of which has
`an operating system instance and, more specifically,
`to
`methods and apparatus for sharing resources in a variety of
`configurations between partitions.
`
`[0002] BACKGROUND OF TIIE INVENTION
`
`[0003] The efficient operation of many applications in
`present computing environments depend uponfast, powerful
`and flexible computing systems. The configuration and
`design of such systems has become very complicated when
`such systems are to be used in an “enterprise” commercial
`environment where there may be many separate depart-
`ments, many different problem types and continually chang-
`ing computing needs. Users in such environmentsgenerally
`want to be able to quickly and easily change the capacity of
`the system, its speed and its configuration. They may also
`want
`to expand the system work capacily and change
`configurations to achieve better utilization of resources
`without stopping execution of application programs on the
`system. In addition they may want be able to configure the
`system in order to maximize resource availability so that
`each application will have an optimum computing configu-
`ration.
`
`[0004] Traditionally, computing speed has been addressed
`by using a “shared nothing” computing architecture where
`data, business logic, and graphic user interfaces are distinct
`tiers and have specific computing resources dedicated to
`eachtier. Initially, a single central processing unit was used
`and the power and speed of such a computing system was
`increased by increasing the clock rate of the single central
`processing unit. More recently, computing systems have
`been developed which use several processors working as a
`team instead one massive processor working alone. In this
`manner, a complex application can be distributed among
`many processors instead of waiting to be executed by a
`single processor. Such systems typically consist of several
`central processing units (CPUs) whichare controlled by a
`single operating system. In a variant of a multiple processor
`system called “symmetric multiprocessing” or SMP,
`the
`applications are distributed equally across all processors.
`‘The processors also share memory. In another variant called
`“asymmetric multiprocessing” or AMP, one processor acts
`as a “master”and all of the other processors act as “slaves.”
`Therefore, all operations, including the operating system,
`must pass through the master before being passed onto the
`slave processors. These multiprocessing architectures have
`the advantage that performance can be increased by adding
`additional processors, but suffer from the disadvantage that
`the software running on such systems must be carefully
`written to take advantage of the multiple processorsand it is
`difficult to scale the software as the number of processors
`increases. Current commercial workloads do not scale well
`
`the exact
`beyond 8-24 CPUs as a single SMP system,
`number depending upon platform, operating system and
`application mix.
`
`[0005] For increased performance, another typical answer
`has been to dedicate computer resources (machines) to an
`
`application in order to optimally tune the machine resources
`to the application. However, this approach has not been
`adopted by the majority of users because most sites have
`many applications and separate databases developed by
`different vendors. Therefore, it is difficult, and expensive, to
`dedicate resources amongall of the applications especially
`in environments where the application mix is constantly
`changing. Further, with dedicated resources, it is essentially
`impossible to quickly and easily migrate resources from one
`computer syslem to another, especially if different vendors
`are involved. Even if such a migration can be performed,it
`typically involves the intervention of a system administrator
`and requires at least some of the computer systems to be
`powered down andrebooted.
`
`[0006] Alternatively, a computing system can be parti-
`tioned with hardware to make a subset of the resources on
`
`a computer available to a specific application. This approach
`avoids dedicating the resources permanently since the par-
`titions can be changed, but still leaves issues concerning
`performance improvements by means of load balancing of
`resources among partitions and resource availability.
`
`{0007] The availability and maintainability issues were
`addressed by a “shared everything” model in which a large
`centralized robust server that contains most of the resources
`
`is networked with and services many small, uncomplicated
`client network computers. Alternatively, “clusters” are used
`in which each system or “node”has its own memoryandis
`controlled by its own operating system. The systemsinteract
`by sharing disks and passing messages among themselves
`via some type of communication network. A cluster system
`has the advantage that additional systems can easily be
`added to a cluster. However, networks and clusters sutter
`from a lack of shared memory and from limited interconnect
`bandwidth which places limitations on performance.
`
`In many enterprise computing environments,it is
`[0008]
`clear that
`the two separate computing models must be
`simultaneously accommodated and each model optimized.
`
`Further, it is highly desirable to be able to modify
`[0009]
`computer configurations“on the fly” without rebooting any
`of the systems. Several prior art approaches have been used
`to attempt this accommodation. For example, a design called
`a “virtual machine” or VM developed and marketed by
`International Business Machines Corporation, Armonk,
`N.Y., uses a single physical machine, with one or more
`physical processors, in combination with software which
`simulates multiple virtual machines. Each of those virtual
`machines has,
`in principle, access to all
`the physical
`resources of the underlying real computer. The assignment
`of resources to each virtual machine is controlled by a
`program called a “hypervisor”. There is only one hypervisor
`in the system and it
`is responsible for all
`the physical
`resources. Consequently, the hypervisor, not the other oper-
`ating systems, deals with the allocation of physical hard-
`ware. The hypervisor intercepts requests for resources from
`the other operating systems and deals with the requests in a
`globally-correct way.
`
`{0010] The VM architecture supports the concept of a
`“logical partition” or LPAR. Each LPAR contains some of
`the available physical CPUs and resources which are logi-
`cally assigned to the partition. The same resources can be
`assigned to more than one partition. LPARsare set up by an
`administrator statically, but can respond to changes in load
`
`

`

`US 2002/0052914 Al
`
`May 2, 2002
`
`dynamically, and without rebooting, in several ways. For
`example, if two logicalpartitions, each containing ten CPUs,
`are shared on a physical system containing ten physical
`CPUs,and, if the logical ten CPU partitions have comple-
`mentary peak loads, each partition can take over the entire
`physical ten CPU system as the workload shifts without a
`re-boot or operator intervention.
`
`In addition, the CPUs logically assigned to each
`(0011]
`partition can be turned “on” and “off? dynamically via
`normal operating system operator commands without re-
`boot. The only limitation is that the number of CPUsactive
`at system initialization is the maximum number of CPUs
`that can be turned “on” in any partition.
`
`in cascs where the aggregate workload
`[0012] Finally,
`demandof all partitions is more than can be delivered by the
`physical system, LPAR “weights” can be used to define the
`portion of the total CPU resources which is given to each
`partition. These weights can be changed by system admin-
`istrators, on-the-fly, with no disruption.
`
`{0013] Another prior art system is called a “Parallel Sys-
`plex” and is also marketed and developed by the Intcrna-
`tional Business Machines Corporation. This architecture
`consists of a set of computers that are clustered via a
`hardwareentity called a “coupling facility” attached to each
`CPU.The couplingfacilities on each node are connected,via
`a fiber-optic link, and each node operates as a traditional
`SMPmachine, with a maximum of 10 CPUs. Certain CPU
`instructions directly invoke the coupling facility. For
`example, a node registers a data structure with the coupling
`facility, then the coupling facility takes care of keeping the
`data structures coherent within the local memory of each
`node.
`
`{0014] The Enterprise 10000 Unix server developed and
`marketed by Sun Microsystems, Mountain View, Calif., uses
`a partitioning arrangement
`called “Dynamic System
`Domains” to logically divide the resources of a single
`physical server into multiple partitions, or domains, each of
`which operates as a stand-alone server. Each of the partitions
`has CPUs, memory and I/O hardware. Dynamic reconfigu-
`ration allows a system administrator to create, resize, or
`delete domains “on the fly” and without rebooting. Every
`domain remainslogically isolated from any other domain in
`the system, isolating it completely from any software error
`or CPU, memory, or I/O error generated by any other
`domain.‘There is no sharing of resources between anyof the
`domains.
`
`[0015] The Hive Project conducted at Stanford University
`uses an architecture which is structured as a set of cells.
`
`When the system boots, each cell is assigned a range of
`nodes, each having memory and I/O devices, that the cell
`owns throughout execution. Each cell manages the proces-
`sors, memory and I/O devices on those nodesasif it were
`an independent operating system. The cells cooperate to
`present the illusion of a single system to user-level pro-
`cesses.
`
`[0016] Hive cells are not responsible for deciding how to
`divide their resources between local and remote requests.
`Each cell is responsible only for maintaining its internal
`resources and for optimizing performance within the
`resourcesit has been allocated. Global resource allocation is
`
`carried out by a user-level process called “wax.” The Hive
`
`system attempts to prevent data corruption by using certain
`fault containment boundaries betweenthe cells. In order to
`implementthe tight sharing expected from a multiprocessor
`system, despite the fault containment boundaries between
`cells, resource sharing is implemented through the coopera-
`tion of the various cell kernels, but the policy is imple-
`mented outside the kernels in the wax process. Both memory
`and processors can be shared.
`
`{0017] A system called “Cellular IRIX” developed and
`marketed by Silicon Graphics Inc. Mountain View, Calif.,
`supports modular computing by extending traditional sym-
`metric multiprocessing systems. The Cellular IRIX archi-
`tecture distributes global kernel text and data into optimized
`SMP-sized chunks or “cells”. Cells represent a control
`domain consisting of one or more machine modules, where
`each module consists of processors, memory, and I/O.
`Applications running on these cells rely extensively on a full
`set of local operating system services, including local copies
`of operating system text and kernel data structures, bit only
`one instance of the operating system exists on the entire
`system. Inter-cell coordination allows application imagesto
`directly and transparently utilize processing, memory and
`I/O resources from other cells without incurring the over-
`head of data copies or extra context switches.
`
`{0018] Another existing architecture called NUMA-Q
`developed and marketed by Sequent Computer Systems,
`Inc., Beaverton, Oregon uses “quads”, or a group of four
`processors per portion of memory, as the basic building
`block for NUMA-Q SMPnodes. Adding I/O to each quad
`further improves performance. Therefore,
`the NUMA-Q
`architecture not only distributes physical memory but puts a
`predetermined numberof processors and PCI slots next to
`each processor. The memory in each quad is not local
`memoryin the traditional sense. Rather, it is a portion of the
`physical memory address space and has a specific address
`range. The address mapis divided evenly over memory, with
`each quad containing a contiguous portion of address space.
`Only one copy of the operating system is running and, as in
`any SMPsystem,it resides in memory and runs processes
`without distinction and simultaneously on one or more
`processors.
`
`[0019] Accordingly, while many attempts have been made
`at providing a ficxible computer system having maximum
`resource availability and scalability, existing systems each
`have significant shortcomings. Therefore, it would be desir-
`able to have a new computer system design which provides
`improved flexibility, resource availability and scalability.
`Specifically, it would be desirable to have a computer design
`which could accommodate each of the “shared nothing”,
`“shared partial” and “shared everything” computing models
`and could be reconfigured to switch between the models
`without major service disruptions as different needs arise.
`
`SUMMARYOF THE INVENTION
`
`In accordance with the principles of the present
`[0020]
`invention, multiple instances of operating systems execute
`cooperatively in a single multiprocessor computer wherein
`all processors and resources are electrically connected
`together. The single physical machine with multiple physical
`processors and resources is subdivided by software into
`multiple partitions, each with the ability to run a distinct
`copy, or instance, of an operating system. Each of the
`
`

`

`US 2002/0052914 Al
`
`May 2, 2002
`
`partitions has access to its own physical resources plus
`resources designated as shared. In accordance with one
`embodiment,
`the partitioning is performed by assigning
`resources using a configuration data structure such as a
`configuration tree.
`
`[0021] Since software logically partitions CPUs, memory,
`and I/O ports by assigning them to a partition, none, some,
`or all, resources may be designated as shared among mul-
`tiple partitions. Each individual operating instance will
`gencrally be assigned the resources it needs to execute
`independently and these resources will be designated as
`“private.” Other resources, particularly memory, can be
`assigned to more than one instance and shared. Shared
`memory is cache coherent so that instances may betightly
`coupled, and mayshare resourcesthat are normally allocated
`to a single instance such as distributed lock managers and
`cluster interconnects.
`
`and
`as CPUs
`such
`[0022] Newly-added resources,
`memory, can be dynamically assigned to different partitions
`and used byinstances of operating systems running within
`the machine by modifying the configuration.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0023] The above and further advantages of the invention
`may be better understood by referring to the following
`description in conjunction with the accompanying drawings
`and which:
`
`[0034] FIG. 11 is a block schematic diagram illustrating
`the inventive computing system operating as a shared partial
`computing system.
`
`[0035] FIG. 12 is a block schematic diagram illustrating
`the inventive computing system operating as a shared every-
`thing computing system.
`
`[0036] FIG. 13 is a block schematic diagram illustrating
`migration of CPUsin the inventive computing system.
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENT
`
`[0037] A computer platform constructed in accordance
`with the principles of the present invention is a multi-
`processor system capable of being partitioned to allow the
`concurrent execulion of mulliple instances of operating
`system software. The system does not require hardware
`support for the partitioning of its memory, CPUs and I/O
`subsystems, but some hardware may be used to provide
`additional hardwareassistance for isolating faults, and mini-
`mizing the cost of software engineering. The following
`specification describes the interfaces and data structures
`required to support the inventive software architecture. The
`interfaces and data structures described are not meant to
`
`imply a specific operating system mustbe used,or that only
`a single type of operating system will execute concurrently.
`Any operating system which implements the software
`requirements discussed below can participate in the inven-
`tive system operation.
`
`[0025] FIG. 2 is a schematic diagram of a computer
`system constructed in accordance with the principles of the
`present invention illustrating several partitions.
`
`[0026] FIG. 3 is a schematic diagram of a configuration
`tree illustrating child and sibling pointers.
`
`[0027] FIG. 4 is a schematic diagram of the configuration
`tree shown in FIG.3 and rearrangedto illustrate ownership
`pointers.
`
`[0028] FIG. 5 is a flowchart illustrating the steps in an
`illustrative routine for creating a computer system in accor-
`dance with the principles of the present invention.
`
`[0038] System Building Blocks
`(0024] FIG.1is a schematic block diagram of a hardware
`platform illustrating several system building blocks.
`[0039] The inventive software architecture operates on a
`hardware platform which incorporates multiple CPUs,
`memory and I/O hardware. Preferably, a modular architec-
`ture such as that shown in FIG.1 is used, although those
`skilled in the art will understandthat other architectures can
`also be used, which architectures need not be modular. FIG.
`1 illustrates a computing system constructed of four basic
`system building blocks (SBBs) 100-106. In the illustrative
`embodiment, each building block, such as block 100, is
`identical and comprises several CPUs 108-114, several
`memoryslots (illustrated collectively as memory 120), an
`I/O processor 118, and a port 116 which contains a switch
`(not shown) that can connect the system to another such
`system. However, in other embodiments, the building blocks
`need not be identical. Large multiprocessor systems can be
`constructed by connecting the desired number of system
`building blocks by meansoftheir ports. Switch technology,
`rather than bus technology, is employed to connect building
`block components in order to both achieve the improved
`bandwidth and to allow for non-uniform memory architec-
`tures (NUMA).
`
`[0029] FIG. 6 is a flowchart illustrating the steps in an
`illustrative routine followed by all nodes before joining or
`creating a computer system.
`
`[0030] FIGS. 7A and 7B, when placed together, form a
`flowchart
`illustrating the steps in an illustrative routine
`followed by a node to create a computer system in accor-
`dance with the principles of the present invention.
`
`[0031] FIGS. 8A and 8B, when placed together, form a
`flowchart
`illustrating the steps in an illustrative routine
`followed by a node to join a computer system which is
`alreadycreated.
`
`[0032] FIG. 9 is a block schematic diagram illustrating an
`overview of the inventive system.
`
`[0033] FIG. 10 is a block schematic diagram illustrating
`the inventive computing system operating as a shared noth-
`ing computing system.
`
`In accordance with the principles of the invention,
`[0040]
`the hardware switches are arranged so that each CPU can
`address all available memory andI/O ports regardless of the
`number of building blocks configured as schematically
`illustrated by line 122. In addition, all CPUs may commu-
`nicale to any or all other CPUsin all SBBs with conven-
`tional mechanisms, such as inter-processor interrupts. Con-
`sequently, the CPUs and other hardware resources can be
`associated solely with software. Sucha platformarchitecture
`is inherently scalable so that large amounts of processing
`power, memory and I/O will be available in a single com-
`puter.
`
`

`

`US 2002/0052914 Al
`
`May 2, 2002
`
`{0041] An APMP computer system 200 constructed in
`accordance with the principles of the present invention from
`a software view is illustrated in FIG. 2. In this system, the
`hardware components have been allocated to allow concur-
`rent execution of multiple operating system instances 208,
`210, 212.
`
`In a preferred embodiment, this allocation is per-
`[0042]
`formed by a software program called a “console” program,
`which, as will hereinafter be described in detail, is loaded
`into memory at power up. Console programs are shown
`schematically in FIG. 2 as programs 213, 215 and 217. The
`console program may be a modification of an existing
`administrative program or a separate program which inter-
`acts with an operating system to control the operation of the
`preferred embodiment. The console program does notvir-
`tualize the system resources, that is, it does not create any
`software layers between the running operating systems 208,
`210 and 212 and the physical hardware, such as memory and
`1/O units (not shown in FIG. 2.) Nor is the state of the
`running operating systems 208, 210 and 212 swapped to
`provide access to the same hardware. Instead, the inventive
`system logically divides the hardware intopartitions. It is the
`responsibility of operating system instance 208, 210, and
`212 to use the resources appropriately and provide coordi-
`nation of resource allocation and sharing. The hardware
`platform may optionally provide hardware assistance for the
`division of resources, and may provide fault barriers to
`minimize the ability of an operating system to corrupt
`memory, or affect devices controlled by another operating
`system copy.
`
`[0043] The execution environmentfor a single copy of an
`operating system, such as copy 208 is called a “parti-
`tion’202, and the executing operating system 208 in parti-
`tion 202 is called “instance”208. Each operating system
`instance is capable of booting and running independently of
`all other operating system instances in the computer system,
`and can cooperatively take part in sharing resources between
`operating system instances as described below.
`
`In order to run an operating system instance, a
`[0044]
`partition must include a hardware restart parameter block
`(HWRPB), a copy of a console program, some amount of
`memory, one or more CPUs,andat least one I/O bus which
`must have a dedicated physical port for the console. The
`HWRPBis a configuration block which is passed between
`the console program and the operating system.
`
`[0045] Each of console programs 213, 215 and 217, is
`connected to a console port, shown as ports 214, 216 and
`218, respectively. Console ports, such as ports 214, 216 and
`218, generally come in the form of a serial line port, or
`attached graphics, keyboard and mouse options. For the
`purposesofthe inventive computer system, the capability of
`supporling a dedicated graphics port and associated input
`devicesis not required,although a specific operating system
`may require it. The base assumption is that a serial port is
`sufficient for each partition. While a separate terminal, or
`independent graphics console, could be used to display
`information generated by each-console, preferably the serial
`lines 220, 222 and 224, can all be connected to a single
`multiplexer 226 attached to a workstation, PC, or LAT 228
`for display of console information.
`
`It is important to note that partitions are not syn-
`[0046]
`onymous with system building blocks. For example, parti-
`
`tion 202 may comprise the hardware in building blocks 100
`and 106 in FIG. 1 whereaspartitions 204 and 206 might
`comprise the hardware in building blocks 102 and 104,
`respectively. Partitions may also include part of the hard-
`ware in a building block.
`
`[0047] Partitions can be “initialized” or “uninitialized.”
`An initialized partition has sufficient resources to execute an
`operating system instance, has a console program image
`loaded, and a primary CPU available and executing. An
`initialized partition may be under control of a console
`program, or may be executing an operating systeminstance.
`In an initialized state, a partition has full ownership and
`control of hardware components assigned to it and onlythe
`partition itself may release its components.
`
`In accordance with the principles of the invention,
`[0048]
`resources can be reassigned from oneinitialized partition to
`another. Reassignment of resources can only be performed
`by the initialized partition to which the resource is currently
`assigned. When a partition is in an uninitialized state, other
`partitions may reassign its hardware components and may
`delete it.
`
`[0049] An uninitialized partition is a partition which has
`no primary CPUexecuting either under control of a console
`program or an operating system. For example, a partition
`may be uninitialized due to a lack of sufficient resources at
`power up to run a primary CPU, or when a system admin-
`istrator is reconfiguring the computer system. When in an
`uninitialized state, a partition may reassign its hardware
`components and may be deleted by another partition. Unas-
`signed resources may be assigned by anypartition.
`
`[0050] Partitions may be organized into “communities”
`which provide the basis for grouping separate execution
`contexts to allow cooperative resource sharing. Partitions in
`the same community can share resources. Partitions that are
`not within the same community cannot share resources.
`Resources may only be manually moved between partitions
`that are not in the same community by the system admin-
`istrator by de-assigning the resource (and stopping usage),
`and manually reconfiguring the resource, Communitics can
`be used to create independent operating system domains, or
`to implement user policy for hardware usage. In FIG. 2,
`partitions 202 and 204 have been organized into community
`230. Partition 206 may be in its own community 205.
`Communities can be constructed using the configuration tree
`described below and may be enforced by hardware.
`
`[0051] The Console Program
`
`[0052] When a computer system constructed in accor-
`dance with the principles of the present invention is enabled
`on a platform, multiple HWRPB’s mustbe created, multiple
`co

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket