`a2) Patent Application Publication co) Pub. No.: US 2002/0052914 Al
`(43) Pub. Date: May2, 2002
`
`ZALEWSKI et al.
`
`US 20020052914A1
`
`(54) SOFTWARE PARTITIONED
`MULTI-PROCESSOR SYSTEM WITH
`FLEXIBLE RESOURCE SHARING LEVELS
`
`(76)
`
`Inventors: STEPHEN H. ZALEWSKI,
`NASHUA, NH (US); ANDREW H.
`MASON, HOTLIS, NH (US);
`GREGORY H. JORDAN, HOLLIS,
`NH (US); KAREN L. NOEL,
`PEMBROKE, NH (US)
`
`Correspondence Address:
`JONATHAN M. HARRIS
`CONLEY, ROSE & TAYON
`P.O. BOX 3267
`HOUSTON, TX 77253-3267 (US)
`
`(*) Notice:
`
`This is a publication of a continued pros-
`eculion application (CPA) filed under 37
`CFR 1.53(d).
`
`(21) Appl. No.:
`
`09/095,521
`
`(22)
`
`Filed:
`
`Jun. 10, 1998
`
`Publication Classification
`
`(51)
`
`Int. Ch? oe GO06F 15/167; GO6F 15/16;
`GO6F 12/00; GO6F 13/00;
`GOGF 12/14; GOOF 12/16;
`GO6F 13/28
`(52) US. Cd. ices ees tesseesseessnes 709/203; 711/123
`
`(57)
`
`ABSTRACT
`
`Multiple instances of operating systems execute coopera-
`tively in a single multiprocessor computer wherein all
`processors and resourcesare electrically connected together.
`The single physical machine with multiple physical proces-
`sors and resources is subdivided by software into multiple
`partitions, each running a distinct copy, or instance, of an
`operating system. Eachofthe partitions has access to its own
`physical resources plus resources designated as shared. The
`partitioning is performed by assigning all resources with a
`configuration tree. None, some, or all, resources may be
`designated as shared among multiple partitions. Each indi-
`vidual operating instance will generally be assigned the
`resources it needs to execute independently and these
`resources will be designated as “private.” Other resources,
`particularly memory, can be assigned to more than one
`instance and shared. Shared memory is cache coherent so
`that
`instances may be tightly coupled, and may share
`resources that are normally allocated to a single instance.
`This allows previously distributed user or operating system
`applications which usually must pass messages via an exter-
`nal
`interconnect
`to operate cooperatively in the shared
`memory without the need for either an external interconnect
`or message passing. Examples of application that could take
`advantage of this capability include distributed lock man-
`agers and cluster interconnects. Newly-added resources,
`such as CPUs and memory, can be dynamically assigned to
`different partitions and used by instances of operating sys-
`tems running within the machine by modifying the configu-
`ration.
`
`
`901
`
`L9120
`
`ay904\
`
`
`
`900
`
`Google Exhibit 1008
`Google Exhibit 1008
`Google v. Valtrus
`Googlev. Valtrus
`
`1s
`
`Of
`
`920
`yo 902
`
`INSTANCE C
`INSTANCE B
`INSTANCE A
`
`
`
`912
`
`CPUS
`
`1234
`
`CPUS
`
`
`789
`
`
`916
`
` 910
`
`132 a
`INSTANCEA PRIVATE MEMORY
`026
`
`Tenererenee}SHARED MEMORY
`
`|INSTANCECPRIVATEMEMORY|Cc|INSTANCECPRIVATEMEMORY|MEMORY 930
`a 924 TS
`
`
`
`Patent Application Publication May 2,2002 Sheet 1 of 15
`
`US 2002/0052914 Al
`
`
`
`
`
`108
`449-113-114
`
`
`
`
`
`
`
`
`
`
`VO
`
`MEMORY
`
`418
`
`120
`100
`
`4707
`
`116
`
`P
`
`_—
`
`104
`
`FIG. 1
`
`
`
`Patent Application Publication May 2,2002 Sheet 2 of 15
`
`US 2002/0052914 Al
`
`200
`
`
`OPERATING
`
`SYSTEM
`
`INSTANCE
`
`
`
`
`
`OPERATING
`SYSTEM
`INSTANCE
`
`OPERATING
`SYSTEM
`INSTANCE
`
`
`
`WORKSTATION
`
`
`
`Patent Application Publication
`
`May 2, 2002 Sheet 3 of 15
`
`US 2002/0052914 Al
`
`ce
`
`
`
`VLE
`
`cStceeOce8ce92e
`OsadaYATIONLNOD8€fFPOsada
`WaAWAYOWSAWOSeWAW
`
`Bre100uLOOM
`oe
`
`
`
`LoodSLWIdWaLJYVMLIOSSYVMCYVH
`SrcOrevecreOve
`gle@NOILILaVdLNOILILYYdcle
`00€/1LooY334
`90€¥
`édas|aas
`
`o€
`
`vee
`
`YATIONLNOD
`ANOWSWoe
`
`€OL
`
`
`
`
`
`Patent Application Publication
`
`May2, 2002 Sheet 4 of 15
`
`US 2002/0052914 Al
`
`OPP
`
`ov
`
`Ado
`
`EP
`
`Zbbezp
`
`
`os”AYOWSWAMOWEIN
`bbOey
`120|9zpfot
`vLPoly
`ndondo
`Adondo
`
`YATIONLNODYATIONLNOOD
`@NOILILYVdbNOILILYVd
`SryBehne|9s3dosad
`Ol?ALINNNWOD
`
`907SYVMLIOS
`zpvey
`
`bOI
`
`vor
`
`¢ads|ags
`
`ccy
`
`
`
`
`
`
`Patent Application Publication May 2,2002 Sheet 5 of 15
`
`US 2002/0052914 Al
`
`START
`
`500
`
`
`
`
`
`
`
`
`
`
`INITIALIZE EACH
`PARTITION AND START
`ITS CONSOLE
`
`START MASTER
`CONSOLE
`
`PROBE HARDWARE
`
`FORM CONFIGURATION
`TREE
`
`502
`
`504
`
`506
`
`508
`
`510
`
`BOOT SOME OS
`INSTANCES
`
`FIG. 5
`
`FINISH
`
`912
`
`
`
`Patent Application Publication May 2,2002 Sheet 6 of 15
`
`US 2002/0052914 Al
`
`START
`
`600
`
`
`
`
`
`
`
`
`
`
`STORE APMP
`DATABASE VAIN IP
`HANDLER CELL
`
`602
`
`604
`
`MAP APMP DATABASE
`INITIAL SEGMENT
`
`
`
`RESET INTERRUPT
`MASKS FOR CURRENT
`INSTANCE
`
`
`
`
`INITIALIZE HEARTBEAT
`WORD AND OTHER
`INSTANCE BLOCKS
`
`
`
`606
`
`608
`
`FINISH
`
`610
`
`FIG. 6
`
`
`
`Patent Application Publication
`
`May2, 2002 Sheet 7 of 15
`
`US 2002/0052914 Al
`
`SET SYSTEM AND
`INSTANCE STATE TO
`INITIALIZING
`
`700
`
`702
`
`704
`
`
`
`PRIMARY INSTANCE
`CALLS SIZE ROUTINE
`
`706
`
`START
`
`
`
`
`ALLOCATE SPACE FOR
`APMP DATABASE
`
`
`
`FILL OFFSETS FOR
`SERVICE SYSTEM
`SEGMENTS
`
`708
`
`FIG. 7A
`
`
`
`Patent Application Publication May 2,2002 Sheet 8 of 15
`
`US 2002/0052914 Al
`
`710
`
`CALLINITIALIZATION
`ROUTINE FOR EACH
`SERVICE
`
`712
`
`
`INITIALIZE
`
`MEMBERSHIP MASK
`AND SET PARAMETERS
`
`
`
`INSTANCE SETS ITSELF
`AS BIG BROTHER
`
`714
`
`716
`
`
`
`
`
`
`
`
`
`
`
`INITIALIZE INSTANCE
`AND SYSTEM STATES
`
`718
`
`RELEASE APMP
`DATABASE LOCK
`
`
`
`FIG. 7B
`
`720
`
`FINISH
`
`
`
`Patent Application Publication May 2,2002 Sheet 9 of 15
`
`US 2002/0052914 Al
`
`START
`
`
`
`
`CHECK FOR UNIQUE NAME
`
`804
`
`
`SET SYSTEM AND INSTANCE
`STATES TO INSTANCE JOINING
`
`806
`
`MAP PORTION OF APMP DATABASE
`INTO LOCAL MEMORY
`
`
`
`808
`
`CALL SYSTEM JOIN ROUTINES
`
`FIG. 8A
`
`
`
`Patent Application Publication May 2,2002 Sheet 10 of 15
`
`US 2002/0052914 Al
`
`812
`
`ADD TO MEMBERSHIP
`MASK
`
`SELECT BIG BROTHER
`
`814
`
`816
`
`
`
`
`
`
`
`
`
`FILL IN INSTANCE
`STATE AND SET
`MEMBERSHIP FLAG
`
`
`
`818
`
`RELEASE APMP
`DATABASE LOCK
`
`
`
`FINISH
`
`820
`
`FIG. 8B
`
`
`
`Patent Application Publication
`
`May 2, 2002 Sheet 11 of 15
`
`US 2002/0052914 Al
`
`006
`
`826
`
`o|oaonwisnt|@SONVISNI
`5sndoTeesndo
`ANOWSAIWGSAYVHS_4eonsalvandg3oNvLsNI
`
`
`
`
`926AMOWSIWALVAINdVSONVLSNI
`
`
`0e6AMOWGAINSLVAIMd9SONVLSNI
`
`6DIA
`
`
`
`ltOL68
`
`YOVLlvd83GOO
`
`
`
`LOANNOONSLNIHHOMLAN
`
`May2, 2002 Sheet 12 of 15
`
`US 2002/0052914 Al
`
`
`
`ALVeVdsSHOVS
`
`S/OJOJONVLSNI
`
`Patent Application Publication OLOlt
`
`
`
`AYOWSINSLVAId0SONVLSNI|AYOWSIALVAIddSONVLSNI|ANOWAINALVAIddV¥SONVLSNI
`ODSONVLSNIgSONVLSNIVSJONVLSNI
`SNdOSNdOSNdo
`
`0001
`
`
`
`Patent Application Publication
`
`May 2, 2002 Sheet 13 of 15
`
`US 2002/0052914 Al
`
`
`
`cOLt
`
`
`
`LOANNOOYALNIMYHOMLAN
`
`OOLL TTOlt
`
`
`
`AYOWGSINALVAId9SONVISNI|AYOWSWSLVAId8SONVLSNI|AMOWAWSLVAIddVSONVLSNI
`ODADNVLSNI@JONVLSNIVSAONVLSNI
`/SNdOSNdOSNdoO
`
`NoILOasTVEOT9.-/-§-«sAMIOWWAINGSAYVHS
`
`
`AINOVLVG
`
`lLOL68
`
`OZbBLE
`
`
`
`
`
`US 2002/0052914 Al
`
`
`
`LOANNOOUALNISOVYOLS
`
`
`
`LOANNOOYSLNIHYHOMLAN
`
`May2, 2002 Sheet 14 of 15
`
`Patent Application Publication gz!ClDIA
`O°AONVLSNIVAONVILSNI
`
`LOANNOOYALNIYALSNTDCCL
`SNdoOSNdd
`
`
`
`Patent Application Publication
`
`May2, 2002 Sheet 15 of 15
`
`US 2002/0052914 Al
`
`
`
`
`
`LOANNOOMALN!MYOMLAN
`
`
`
`3AONVLSNI
`
`SNdO
`
`SNdoOVAONVILSNI
`
`
`
`XINOVLIVG
`
`
`
`LOANNOOYALNIHALSNT9
`
`flDIA
`
`OO€L
`
`
`
`
`
`US 2002/0052914 Al
`
`May 2, 2002
`
`SOFTWARE PARTITIONED MULTI-PROCESSOR
`SYSTEM WITH FLEXIBLE RESOURCE SHARING
`LEVELS
`
`FIELD OF THE INVENTION
`
`‘This invention relates to multiprocessor computer
`{0001]
`architectures in which processors and other computer hard-
`wate resources are grouped in partitions, each of which has
`an operating system instance and, more specifically,
`to
`methods and apparatus for sharing resources in a variety of
`configurations between partitions.
`
`[0002] BACKGROUND OF TIIE INVENTION
`
`[0003] The efficient operation of many applications in
`present computing environments depend uponfast, powerful
`and flexible computing systems. The configuration and
`design of such systems has become very complicated when
`such systems are to be used in an “enterprise” commercial
`environment where there may be many separate depart-
`ments, many different problem types and continually chang-
`ing computing needs. Users in such environmentsgenerally
`want to be able to quickly and easily change the capacity of
`the system, its speed and its configuration. They may also
`want
`to expand the system work capacily and change
`configurations to achieve better utilization of resources
`without stopping execution of application programs on the
`system. In addition they may want be able to configure the
`system in order to maximize resource availability so that
`each application will have an optimum computing configu-
`ration.
`
`[0004] Traditionally, computing speed has been addressed
`by using a “shared nothing” computing architecture where
`data, business logic, and graphic user interfaces are distinct
`tiers and have specific computing resources dedicated to
`eachtier. Initially, a single central processing unit was used
`and the power and speed of such a computing system was
`increased by increasing the clock rate of the single central
`processing unit. More recently, computing systems have
`been developed which use several processors working as a
`team instead one massive processor working alone. In this
`manner, a complex application can be distributed among
`many processors instead of waiting to be executed by a
`single processor. Such systems typically consist of several
`central processing units (CPUs) whichare controlled by a
`single operating system. In a variant of a multiple processor
`system called “symmetric multiprocessing” or SMP,
`the
`applications are distributed equally across all processors.
`‘The processors also share memory. In another variant called
`“asymmetric multiprocessing” or AMP, one processor acts
`as a “master”and all of the other processors act as “slaves.”
`Therefore, all operations, including the operating system,
`must pass through the master before being passed onto the
`slave processors. These multiprocessing architectures have
`the advantage that performance can be increased by adding
`additional processors, but suffer from the disadvantage that
`the software running on such systems must be carefully
`written to take advantage of the multiple processorsand it is
`difficult to scale the software as the number of processors
`increases. Current commercial workloads do not scale well
`
`the exact
`beyond 8-24 CPUs as a single SMP system,
`number depending upon platform, operating system and
`application mix.
`
`[0005] For increased performance, another typical answer
`has been to dedicate computer resources (machines) to an
`
`application in order to optimally tune the machine resources
`to the application. However, this approach has not been
`adopted by the majority of users because most sites have
`many applications and separate databases developed by
`different vendors. Therefore, it is difficult, and expensive, to
`dedicate resources amongall of the applications especially
`in environments where the application mix is constantly
`changing. Further, with dedicated resources, it is essentially
`impossible to quickly and easily migrate resources from one
`computer syslem to another, especially if different vendors
`are involved. Even if such a migration can be performed,it
`typically involves the intervention of a system administrator
`and requires at least some of the computer systems to be
`powered down andrebooted.
`
`[0006] Alternatively, a computing system can be parti-
`tioned with hardware to make a subset of the resources on
`
`a computer available to a specific application. This approach
`avoids dedicating the resources permanently since the par-
`titions can be changed, but still leaves issues concerning
`performance improvements by means of load balancing of
`resources among partitions and resource availability.
`
`{0007] The availability and maintainability issues were
`addressed by a “shared everything” model in which a large
`centralized robust server that contains most of the resources
`
`is networked with and services many small, uncomplicated
`client network computers. Alternatively, “clusters” are used
`in which each system or “node”has its own memoryandis
`controlled by its own operating system. The systemsinteract
`by sharing disks and passing messages among themselves
`via some type of communication network. A cluster system
`has the advantage that additional systems can easily be
`added to a cluster. However, networks and clusters sutter
`from a lack of shared memory and from limited interconnect
`bandwidth which places limitations on performance.
`
`In many enterprise computing environments,it is
`[0008]
`clear that
`the two separate computing models must be
`simultaneously accommodated and each model optimized.
`
`Further, it is highly desirable to be able to modify
`[0009]
`computer configurations“on the fly” without rebooting any
`of the systems. Several prior art approaches have been used
`to attempt this accommodation. For example, a design called
`a “virtual machine” or VM developed and marketed by
`International Business Machines Corporation, Armonk,
`N.Y., uses a single physical machine, with one or more
`physical processors, in combination with software which
`simulates multiple virtual machines. Each of those virtual
`machines has,
`in principle, access to all
`the physical
`resources of the underlying real computer. The assignment
`of resources to each virtual machine is controlled by a
`program called a “hypervisor”. There is only one hypervisor
`in the system and it
`is responsible for all
`the physical
`resources. Consequently, the hypervisor, not the other oper-
`ating systems, deals with the allocation of physical hard-
`ware. The hypervisor intercepts requests for resources from
`the other operating systems and deals with the requests in a
`globally-correct way.
`
`{0010] The VM architecture supports the concept of a
`“logical partition” or LPAR. Each LPAR contains some of
`the available physical CPUs and resources which are logi-
`cally assigned to the partition. The same resources can be
`assigned to more than one partition. LPARsare set up by an
`administrator statically, but can respond to changes in load
`
`
`
`US 2002/0052914 Al
`
`May 2, 2002
`
`dynamically, and without rebooting, in several ways. For
`example, if two logicalpartitions, each containing ten CPUs,
`are shared on a physical system containing ten physical
`CPUs,and, if the logical ten CPU partitions have comple-
`mentary peak loads, each partition can take over the entire
`physical ten CPU system as the workload shifts without a
`re-boot or operator intervention.
`
`In addition, the CPUs logically assigned to each
`(0011]
`partition can be turned “on” and “off? dynamically via
`normal operating system operator commands without re-
`boot. The only limitation is that the number of CPUsactive
`at system initialization is the maximum number of CPUs
`that can be turned “on” in any partition.
`
`in cascs where the aggregate workload
`[0012] Finally,
`demandof all partitions is more than can be delivered by the
`physical system, LPAR “weights” can be used to define the
`portion of the total CPU resources which is given to each
`partition. These weights can be changed by system admin-
`istrators, on-the-fly, with no disruption.
`
`{0013] Another prior art system is called a “Parallel Sys-
`plex” and is also marketed and developed by the Intcrna-
`tional Business Machines Corporation. This architecture
`consists of a set of computers that are clustered via a
`hardwareentity called a “coupling facility” attached to each
`CPU.The couplingfacilities on each node are connected,via
`a fiber-optic link, and each node operates as a traditional
`SMPmachine, with a maximum of 10 CPUs. Certain CPU
`instructions directly invoke the coupling facility. For
`example, a node registers a data structure with the coupling
`facility, then the coupling facility takes care of keeping the
`data structures coherent within the local memory of each
`node.
`
`{0014] The Enterprise 10000 Unix server developed and
`marketed by Sun Microsystems, Mountain View, Calif., uses
`a partitioning arrangement
`called “Dynamic System
`Domains” to logically divide the resources of a single
`physical server into multiple partitions, or domains, each of
`which operates as a stand-alone server. Each of the partitions
`has CPUs, memory and I/O hardware. Dynamic reconfigu-
`ration allows a system administrator to create, resize, or
`delete domains “on the fly” and without rebooting. Every
`domain remainslogically isolated from any other domain in
`the system, isolating it completely from any software error
`or CPU, memory, or I/O error generated by any other
`domain.‘There is no sharing of resources between anyof the
`domains.
`
`[0015] The Hive Project conducted at Stanford University
`uses an architecture which is structured as a set of cells.
`
`When the system boots, each cell is assigned a range of
`nodes, each having memory and I/O devices, that the cell
`owns throughout execution. Each cell manages the proces-
`sors, memory and I/O devices on those nodesasif it were
`an independent operating system. The cells cooperate to
`present the illusion of a single system to user-level pro-
`cesses.
`
`[0016] Hive cells are not responsible for deciding how to
`divide their resources between local and remote requests.
`Each cell is responsible only for maintaining its internal
`resources and for optimizing performance within the
`resourcesit has been allocated. Global resource allocation is
`
`carried out by a user-level process called “wax.” The Hive
`
`system attempts to prevent data corruption by using certain
`fault containment boundaries betweenthe cells. In order to
`implementthe tight sharing expected from a multiprocessor
`system, despite the fault containment boundaries between
`cells, resource sharing is implemented through the coopera-
`tion of the various cell kernels, but the policy is imple-
`mented outside the kernels in the wax process. Both memory
`and processors can be shared.
`
`{0017] A system called “Cellular IRIX” developed and
`marketed by Silicon Graphics Inc. Mountain View, Calif.,
`supports modular computing by extending traditional sym-
`metric multiprocessing systems. The Cellular IRIX archi-
`tecture distributes global kernel text and data into optimized
`SMP-sized chunks or “cells”. Cells represent a control
`domain consisting of one or more machine modules, where
`each module consists of processors, memory, and I/O.
`Applications running on these cells rely extensively on a full
`set of local operating system services, including local copies
`of operating system text and kernel data structures, bit only
`one instance of the operating system exists on the entire
`system. Inter-cell coordination allows application imagesto
`directly and transparently utilize processing, memory and
`I/O resources from other cells without incurring the over-
`head of data copies or extra context switches.
`
`{0018] Another existing architecture called NUMA-Q
`developed and marketed by Sequent Computer Systems,
`Inc., Beaverton, Oregon uses “quads”, or a group of four
`processors per portion of memory, as the basic building
`block for NUMA-Q SMPnodes. Adding I/O to each quad
`further improves performance. Therefore,
`the NUMA-Q
`architecture not only distributes physical memory but puts a
`predetermined numberof processors and PCI slots next to
`each processor. The memory in each quad is not local
`memoryin the traditional sense. Rather, it is a portion of the
`physical memory address space and has a specific address
`range. The address mapis divided evenly over memory, with
`each quad containing a contiguous portion of address space.
`Only one copy of the operating system is running and, as in
`any SMPsystem,it resides in memory and runs processes
`without distinction and simultaneously on one or more
`processors.
`
`[0019] Accordingly, while many attempts have been made
`at providing a ficxible computer system having maximum
`resource availability and scalability, existing systems each
`have significant shortcomings. Therefore, it would be desir-
`able to have a new computer system design which provides
`improved flexibility, resource availability and scalability.
`Specifically, it would be desirable to have a computer design
`which could accommodate each of the “shared nothing”,
`“shared partial” and “shared everything” computing models
`and could be reconfigured to switch between the models
`without major service disruptions as different needs arise.
`
`SUMMARYOF THE INVENTION
`
`In accordance with the principles of the present
`[0020]
`invention, multiple instances of operating systems execute
`cooperatively in a single multiprocessor computer wherein
`all processors and resources are electrically connected
`together. The single physical machine with multiple physical
`processors and resources is subdivided by software into
`multiple partitions, each with the ability to run a distinct
`copy, or instance, of an operating system. Each of the
`
`
`
`US 2002/0052914 Al
`
`May 2, 2002
`
`partitions has access to its own physical resources plus
`resources designated as shared. In accordance with one
`embodiment,
`the partitioning is performed by assigning
`resources using a configuration data structure such as a
`configuration tree.
`
`[0021] Since software logically partitions CPUs, memory,
`and I/O ports by assigning them to a partition, none, some,
`or all, resources may be designated as shared among mul-
`tiple partitions. Each individual operating instance will
`gencrally be assigned the resources it needs to execute
`independently and these resources will be designated as
`“private.” Other resources, particularly memory, can be
`assigned to more than one instance and shared. Shared
`memory is cache coherent so that instances may betightly
`coupled, and mayshare resourcesthat are normally allocated
`to a single instance such as distributed lock managers and
`cluster interconnects.
`
`and
`as CPUs
`such
`[0022] Newly-added resources,
`memory, can be dynamically assigned to different partitions
`and used byinstances of operating systems running within
`the machine by modifying the configuration.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0023] The above and further advantages of the invention
`may be better understood by referring to the following
`description in conjunction with the accompanying drawings
`and which:
`
`[0034] FIG. 11 is a block schematic diagram illustrating
`the inventive computing system operating as a shared partial
`computing system.
`
`[0035] FIG. 12 is a block schematic diagram illustrating
`the inventive computing system operating as a shared every-
`thing computing system.
`
`[0036] FIG. 13 is a block schematic diagram illustrating
`migration of CPUsin the inventive computing system.
`DETAILED DESCRIPTION OF THE
`PREFERRED EMBODIMENT
`
`[0037] A computer platform constructed in accordance
`with the principles of the present invention is a multi-
`processor system capable of being partitioned to allow the
`concurrent execulion of mulliple instances of operating
`system software. The system does not require hardware
`support for the partitioning of its memory, CPUs and I/O
`subsystems, but some hardware may be used to provide
`additional hardwareassistance for isolating faults, and mini-
`mizing the cost of software engineering. The following
`specification describes the interfaces and data structures
`required to support the inventive software architecture. The
`interfaces and data structures described are not meant to
`
`imply a specific operating system mustbe used,or that only
`a single type of operating system will execute concurrently.
`Any operating system which implements the software
`requirements discussed below can participate in the inven-
`tive system operation.
`
`[0025] FIG. 2 is a schematic diagram of a computer
`system constructed in accordance with the principles of the
`present invention illustrating several partitions.
`
`[0026] FIG. 3 is a schematic diagram of a configuration
`tree illustrating child and sibling pointers.
`
`[0027] FIG. 4 is a schematic diagram of the configuration
`tree shown in FIG.3 and rearrangedto illustrate ownership
`pointers.
`
`[0028] FIG. 5 is a flowchart illustrating the steps in an
`illustrative routine for creating a computer system in accor-
`dance with the principles of the present invention.
`
`[0038] System Building Blocks
`(0024] FIG.1is a schematic block diagram of a hardware
`platform illustrating several system building blocks.
`[0039] The inventive software architecture operates on a
`hardware platform which incorporates multiple CPUs,
`memory and I/O hardware. Preferably, a modular architec-
`ture such as that shown in FIG.1 is used, although those
`skilled in the art will understandthat other architectures can
`also be used, which architectures need not be modular. FIG.
`1 illustrates a computing system constructed of four basic
`system building blocks (SBBs) 100-106. In the illustrative
`embodiment, each building block, such as block 100, is
`identical and comprises several CPUs 108-114, several
`memoryslots (illustrated collectively as memory 120), an
`I/O processor 118, and a port 116 which contains a switch
`(not shown) that can connect the system to another such
`system. However, in other embodiments, the building blocks
`need not be identical. Large multiprocessor systems can be
`constructed by connecting the desired number of system
`building blocks by meansoftheir ports. Switch technology,
`rather than bus technology, is employed to connect building
`block components in order to both achieve the improved
`bandwidth and to allow for non-uniform memory architec-
`tures (NUMA).
`
`[0029] FIG. 6 is a flowchart illustrating the steps in an
`illustrative routine followed by all nodes before joining or
`creating a computer system.
`
`[0030] FIGS. 7A and 7B, when placed together, form a
`flowchart
`illustrating the steps in an illustrative routine
`followed by a node to create a computer system in accor-
`dance with the principles of the present invention.
`
`[0031] FIGS. 8A and 8B, when placed together, form a
`flowchart
`illustrating the steps in an illustrative routine
`followed by a node to join a computer system which is
`alreadycreated.
`
`[0032] FIG. 9 is a block schematic diagram illustrating an
`overview of the inventive system.
`
`[0033] FIG. 10 is a block schematic diagram illustrating
`the inventive computing system operating as a shared noth-
`ing computing system.
`
`In accordance with the principles of the invention,
`[0040]
`the hardware switches are arranged so that each CPU can
`address all available memory andI/O ports regardless of the
`number of building blocks configured as schematically
`illustrated by line 122. In addition, all CPUs may commu-
`nicale to any or all other CPUsin all SBBs with conven-
`tional mechanisms, such as inter-processor interrupts. Con-
`sequently, the CPUs and other hardware resources can be
`associated solely with software. Sucha platformarchitecture
`is inherently scalable so that large amounts of processing
`power, memory and I/O will be available in a single com-
`puter.
`
`
`
`US 2002/0052914 Al
`
`May 2, 2002
`
`{0041] An APMP computer system 200 constructed in
`accordance with the principles of the present invention from
`a software view is illustrated in FIG. 2. In this system, the
`hardware components have been allocated to allow concur-
`rent execution of multiple operating system instances 208,
`210, 212.
`
`In a preferred embodiment, this allocation is per-
`[0042]
`formed by a software program called a “console” program,
`which, as will hereinafter be described in detail, is loaded
`into memory at power up. Console programs are shown
`schematically in FIG. 2 as programs 213, 215 and 217. The
`console program may be a modification of an existing
`administrative program or a separate program which inter-
`acts with an operating system to control the operation of the
`preferred embodiment. The console program does notvir-
`tualize the system resources, that is, it does not create any
`software layers between the running operating systems 208,
`210 and 212 and the physical hardware, such as memory and
`1/O units (not shown in FIG. 2.) Nor is the state of the
`running operating systems 208, 210 and 212 swapped to
`provide access to the same hardware. Instead, the inventive
`system logically divides the hardware intopartitions. It is the
`responsibility of operating system instance 208, 210, and
`212 to use the resources appropriately and provide coordi-
`nation of resource allocation and sharing. The hardware
`platform may optionally provide hardware assistance for the
`division of resources, and may provide fault barriers to
`minimize the ability of an operating system to corrupt
`memory, or affect devices controlled by another operating
`system copy.
`
`[0043] The execution environmentfor a single copy of an
`operating system, such as copy 208 is called a “parti-
`tion’202, and the executing operating system 208 in parti-
`tion 202 is called “instance”208. Each operating system
`instance is capable of booting and running independently of
`all other operating system instances in the computer system,
`and can cooperatively take part in sharing resources between
`operating system instances as described below.
`
`In order to run an operating system instance, a
`[0044]
`partition must include a hardware restart parameter block
`(HWRPB), a copy of a console program, some amount of
`memory, one or more CPUs,andat least one I/O bus which
`must have a dedicated physical port for the console. The
`HWRPBis a configuration block which is passed between
`the console program and the operating system.
`
`[0045] Each of console programs 213, 215 and 217, is
`connected to a console port, shown as ports 214, 216 and
`218, respectively. Console ports, such as ports 214, 216 and
`218, generally come in the form of a serial line port, or
`attached graphics, keyboard and mouse options. For the
`purposesofthe inventive computer system, the capability of
`supporling a dedicated graphics port and associated input
`devicesis not required,although a specific operating system
`may require it. The base assumption is that a serial port is
`sufficient for each partition. While a separate terminal, or
`independent graphics console, could be used to display
`information generated by each-console, preferably the serial
`lines 220, 222 and 224, can all be connected to a single
`multiplexer 226 attached to a workstation, PC, or LAT 228
`for display of console information.
`
`It is important to note that partitions are not syn-
`[0046]
`onymous with system building blocks. For example, parti-
`
`tion 202 may comprise the hardware in building blocks 100
`and 106 in FIG. 1 whereaspartitions 204 and 206 might
`comprise the hardware in building blocks 102 and 104,
`respectively. Partitions may also include part of the hard-
`ware in a building block.
`
`[0047] Partitions can be “initialized” or “uninitialized.”
`An initialized partition has sufficient resources to execute an
`operating system instance, has a console program image
`loaded, and a primary CPU available and executing. An
`initialized partition may be under control of a console
`program, or may be executing an operating systeminstance.
`In an initialized state, a partition has full ownership and
`control of hardware components assigned to it and onlythe
`partition itself may release its components.
`
`In accordance with the principles of the invention,
`[0048]
`resources can be reassigned from oneinitialized partition to
`another. Reassignment of resources can only be performed
`by the initialized partition to which the resource is currently
`assigned. When a partition is in an uninitialized state, other
`partitions may reassign its hardware components and may
`delete it.
`
`[0049] An uninitialized partition is a partition which has
`no primary CPUexecuting either under control of a console
`program or an operating system. For example, a partition
`may be uninitialized due to a lack of sufficient resources at
`power up to run a primary CPU, or when a system admin-
`istrator is reconfiguring the computer system. When in an
`uninitialized state, a partition may reassign its hardware
`components and may be deleted by another partition. Unas-
`signed resources may be assigned by anypartition.
`
`[0050] Partitions may be organized into “communities”
`which provide the basis for grouping separate execution
`contexts to allow cooperative resource sharing. Partitions in
`the same community can share resources. Partitions that are
`not within the same community cannot share resources.
`Resources may only be manually moved between partitions
`that are not in the same community by the system admin-
`istrator by de-assigning the resource (and stopping usage),
`and manually reconfiguring the resource, Communitics can
`be used to create independent operating system domains, or
`to implement user policy for hardware usage. In FIG. 2,
`partitions 202 and 204 have been organized into community
`230. Partition 206 may be in its own community 205.
`Communities can be constructed using the configuration tree
`described below and may be enforced by hardware.
`
`[0051] The Console Program
`
`[0052] When a computer system constructed in accor-
`dance with the principles of the present invention is enabled
`on a platform, multiple HWRPB’s mustbe created, multiple
`co