`a2) Patent Application Publication co) Pub. No.: US 2002/0156612 Al
`Oct. 24, 2002
`Schulter et al.
`(43) Pub. Date:
`
`US 20020156612A1
`
`(54) ADDRESS RESOLUTION PROTOCOL
`SYSTEM AND METHODIN A VIRTUAL
`NETWORK
`
`Related U.S. Application Data
`
`(60)
`
`Provisional application No. 60/285,296, filed on Apr.
`20, 2001.
`
`(76)
`
`Inventors: Peter Schulter, Hampstead, NH (US);
`Scott Geng, Westboro, MA (US); Pete
`Manca,Sterling, MA (US); Paul
`Curtis, Sudbury, MA (US); Ewan
`Milne, Stow, MA (US); Max Smith,
`Natick, MA (US); Alan Greenspan,
`Northboro, MA (US); Edward Duffy,
`Arlington, MA (US)
`
`(51)
`(52)
`
`(57)
`
`Publication Classification
`
`Tint, C07 nce eeecccceeeceeeeeeeececeeenecennnnneeees GO6F 9/455
`US. Cl...
`
`ABSTRACT
`
`Correspondence Address:
`Peter M. Dichiara
`Hale and Dorr LLP
`60 State Street
`Boston, MA 02109 (US)
`
`(21) Appl. No.:
`
`10/038,354
`
`(22)
`
`Filed:
`
`Jan. 4, 2002
`
`A virtual networking system and method are disclosed.
`Switched Ethernet local area network semantics are pro-
`vided over an underlying point to point mesh. Computer
`processor nodes maydirectly communicate via virtual inter-
`faces over a switch fabric or they may communicate via an
`ethernet switch emulation. Address resolution protocol logic
`helps associate IP addresses with virtual interfaces while
`allowing computer processors to reply to ARP requests with
`MACaddresses.
`virtual
`
`
`~<)
`Node105a
`
`soning(7Processing
`
`
`
`
`
`Se
`
`i
`
`
`Lt Se
`
`
`
`.
`
`AS
`my
`
`_|ndJom
`Processing
`Node 105n
`
`
`
`
`Processing
`Node 105m
`
`memory
`
` 106)
`
`
`110a
`
`110b
`
`
`LL rm
`
`Fe
`
`Processing
`
`
`
`
`
`Node
`1056
`
`107
`
`@
`
`6
`
`®
`
`
`
`
`
`
`
`
`management
`logic
`
`local
`
`
`storage
`
`Switch Fabric
`1154
`
`Control Node
`
`Switch Fabric
`115b
`
`
`Control Node
`
`
`* 120b
`
`130
`
`400 J
`
`Google Exhibit 1019
`Google v. VirtaMove
`
`Google Exhibit 1019
`Google v. VirtaMove
`
`
`
`Patent Application Publication
`
`Oct. 24,2002 Sheet 1 of 14
`
`US 2002/0156612 Al
`
`quswsbeuew
`
`set|“Bl
`
`aBe10}s
`
`
`
`BPONjouyudg
`
`e0e!
`
`OeYOUMS
`
`esgli
`
`eBeojs
`
`OL
`
`Cz"|
`
`OO!
`
`
`
`
`
`
`
`
`
`got}eOrb
`SPONSwBulssao0s
`mite
`@poNla208Bursse00lg
`
`
`[iso:|[feo|
`Lo
`204|ag0}
`Buissaodid
`
`USOLSPON
`eBG01
`
`
`NVSabesojysLuGOLEPONce7|[aowew|Buisseo0old
`coh°
`Jeo)Ccpeg
`
`BPON[OsjUoD°
`gsi|qoztoUqe=UOUMS@,
`
`
`
`
`
`ZOl
`
`e
`
`
`
`
`
`
`
`
`
`Patent Application Publication
`
`Oct. 24,2002 Sheet 2 of 14
`
`US 2002/0156612 Al
`
`80dYUMS
`
`0d od
`
`
`
`
`Patent Application Publication
`
`Oct. 24,2002 Sheet 3 of 14
`
`US 2002/0156612 Al
`
`
`
`YOUMS[eNUlA
`
`ple
`
`91607One
`
`ole
`
`abe
`
`+Le
`
`
`
`9150)sossaooid
`
`ole
`
`
`
`9160)1osseood
`
`wi
`OlZ
`
`ge‘bls
`
`
`
`o16o|1osseoqld
`
`‘OL
`
`
`
`s16o|Jossaoald
`
`“012
`
`F-g0e4oUMs
`
`3
`
`l3
`
`YOM[ENblA
`
`91607
`
`VLC
`
`QOZUOIMS-151
`
`OYMS-1
`
`
`
`
`
`
`
`
`
`
`
`
`Patent Application Publication
`
`Oct. 24,2002 Sheet 4 of 14
`
`US 2002/0156612 Al
`
`
`
`9160)sosseooid
`
`‘org
`
`
`
`
`
`YOUMS|enHiA
`
`
`
`o16o|sosseooid
`
`“OLg
`
`x
`
`
`
`obo,sosseoold
`
` “OLZ.
`
`
`o16o,1osseo0ido160°]
`
`
`92‘bis
`
`91607}
`
`951g
`
`
`
`YOUMSJENuIA
`
`vle
`
`
`
`
`
`
`
`Patent Application Publication
`
`Oct. 24,2002 Sheet 5 of 14
`
`US 2002/0156612 Al
`
`91607duv
`
`ose
`
`JOALC»youedin
`
`qoze
`
`YoUMs
`
`BSLL
`
`OLS
`
`V¢“
`
`Bid
`
`
`
`
`
`JOAUGMJOMIONTENA
`
`
`
`
`
`suojyeoyddesaAe|Jeysiu
`
`
`
`wayjshsBuyeiedo
`
`pug
`
`
`
`
`
`
`
`
`Patent Application Publication
`
`Oct. 24,2002 Sheet 6 of 14
`
`US 2002/0156612 Al
`
`
`
`
`
`AXOig48]SN|DFENLIA
`
`O9€
`
`
`
`
`
`getJeNegNWTIENHIAai6o)aesdV
`
`
`
`AxOldNVTIENUIA
`
`oveOle
`
`gge
`
`JeAUQJeAud
`
`aseespe
`
`NWT1Bolshug
`
`Nvleolshyd
`
`JANG
`
`
`
`youebinjoueb|s
`
`qgee
`
`NW10u
`
`Oge
`
`JOAUG
`
`Bgee
`
`qe‘Big
`
`
`
`
`
`
`Patent Application Publication Oct. 24,2002 Sheet 7 of 14
`
`US 2002/0156612 Al
`
`
`
`
`dequeue
`
`outgoing
`datagram
`405
`
`Fig. 4A
`
`
`
`
`
`Send
`ave MAC
`
`
`Datagram
`
`aoness
`using MAC
`
`
`
`
`
`
`
`420
`
`Fig. 4B
`
`
`
`Patent Application Publication
`
`Oct. 24,2002 Sheet 8 of 14
`
`US 2002/0156612 Al
`
`
`
`
`Driver prepend
`
`TLV and send to
`control node for
`broadcast
`430
`
`
`
`Driver creates
`
`
`ARP request
`Packet
`425
`
`
`
`
`
`
`
`
`
` Control Node
`-Server Logic
`receives ARP
`request, updates
`
`source info in TLV
`
`
`header
`
`435
`
`
`
`
`
`
`
`
`
`Control node
`
`broadcasts ARP
`requestto
`members
`
`440
`
`
`
`Driver Logic on
`Receipt of ARP
`request
`
`
`Receive
`ARP Reply
`
`
`445
`
`
`
`
`
`Filter ARP —
`
`
`Packet on Local
`IP
`450
`
`
`
`
`
`
`Create local MAC
`from Packet TLV
`
`460
`
`
`
`Update ARp
`
`
`
`table and Create
`ARPreply
`
`
`465
`
`
`
`
`
`Unicast ARP
`reply
`470
`
`
`Fig. 4B
`
`
`
`Patent Application Publication Oct. 24,2002 Sheet 9 of 14
`
`US 2002/0156612 Al
`
`
`
`
`Control node
`receives ARP
`Reply from internal
`node
`
`
`
`Control node
`
`updates source node
`
`
`info in TLV of ARP
`
`reply
`473
`
`
`
`
`
`
`Control node
`
`
`
`unicasts packet
`to approrpriate
`
`
`destination node
`
`475
`
`
`Select RVI for
`
`Unicast
`493
`
`
`
`
`Prepend header
`
`TLV and Unicast
`
`datagram directly
`on RVI
`
`_ 495
`
`
`
`
`
`
`
`
`Fig. 4C
`
`
`
`
`
`
`
`
`ARP replier (or
`load balancer)
`receives ARP
`Reply
`480
`
`
`
`
`
`
`
`
`
`
`
`Update ARP
`table
`485
`
`
`
`Dequeue
`
`
`datagram from |
`
`
`ARP queue
`
`
`487
`
`
`
`
`
`
`
`Patent Application Publication Oct. 24,2002 Sheet 10 of 14
`
`US 2002/0156612 Al
`
`ServiceA
`
`ServiceB
`505
`
`Cluster
`
`0
`Bo
`on Ww
`30
`33
`z
`>
`
`LO
`©
`=
`©
`LL
`
`
`
`Patent Application Publication
`
`Oct. 24,2002 Sheet 11 of 14
`
`US 2002/0156612 Al
`
`SPiS-1OSS8001q
`
`
`
`a1Bo)ebesoys
`
`WOoZ9
`
`$e9
`
`jadsoupysulauo
`
`lossa0oid
`
`apls-10sse0dld
`
`
`
`o15o)aHeiojs~
`
`dogo
`
`@DIS-BPOUjOJJUOD
`
`
`
`oifo;aBesoys
`
`Sb9.
`
`
`
`eyeqaBeloig ‘Bis
`
`juowabeueW
`
`QoeLsyu|
`
`O19
`
`eBelojS
`
`uolyenbyuod
`
`o16o7y
`
`S09
`
`aInpOnNAS
`
`SES
`
`
`
`
`Patent Application Publication Oct. 24,2002 Sheet 12 of 14
`
`US 2002/0156612 Al
`
`N 0
`
`)2
`LL
`
`
`
`
`oO
`
`gs
`S a °
`ee Oto
`oo?
`Oocrt
`apa oO”eagee
`aS
`= 2
`zo
`=
`
`g
`<=
`3
`@
`Qo
`~
`~
`o
`2
`3
`Oo
`
`
`
`
`=
`oe
`Ou
`Ho®2o
`ae
`o = me
`>
`o@
`<7
`oO
`f&
`
`
`
`Patent Application Publication
`
`Oct. 24,2002 Sheet 13 of 14
`
`US 2002/0156612 Al
`
`Sl—.ee
`
`Lu'o'H]
`
`
`
`XU][BUOISUSWIG-b
`
`
`
`yeuueYdISOH
`
`[yw
`
`g‘big
`
`
`
`
`Patent Application Publication
`
`Oct. 24,2002 Sheet 14 0f14
`
`US 2002/0156612 Al
`
`ssolnosey9844
`
`
`
`ginjonyseyep
`
` O16
`
`
`.026
`
`eyepHumomyou
`
`gInjponi}s
`
`
`
`juawebeuelJEISNO
`
`9160|
`
`$06
`
`eyepebesojs
`
`eINyonys
`
`S16
`
`
`
`
`
`
`
`
`US 2002/0156612 Al
`
`Oct. 24, 2002
`
`ADDRESS RESOLUTION PROTOCOL SYSTEM
`AND METHODIN A VIRTUAL NETWORK
`
`BACKGROUND
`
`[0001]
`
`1. Field of the Invention
`
`[0002] The present inventionrelates to computing systems
`for enterprises and application service providers and, more
`specifically, to processing systems having virtualized com-
`munication networks.
`
`[0003]
`
`2. Discussion of Related Art
`
`In current enterprise computing and application
`[0004]
`service provider environments, personnel from multiple
`information technology(I'l) functions(electrical, network-
`ing, etc.) must participate to deploy processing and network-
`ing resources. Consequently, because of scheduling and
`other difficulties in coordinating activities from multiple
`departments, it can take weeks or months to deploy a new
`computer server. This lengthy, manual process increases
`both human and equipmentcosts, and delays the launch of
`applications.
`
`[0005] Moreover, because it is difficult to anticipate how
`muchprocessing powerapplications will require, managers
`typically over-provision the amount of computational
`power. As a result, data-center computing resources often go
`unutilized or under-utilized.
`
`If more processing poweris eventually needed than
`[0006]
`originally provisioned, the various IT functions will again
`need to coordinate activities to deploy more or improved
`servers, connect them to the communication and storage
`networksandso forth. This task gets increasingly difficult as
`the systems become larger.
`
`is also problematic. For example,
`[0007] Deployment
`when deploying 24 conventional servers, more than 100
`discrete connections may be required to configure the over-
`all system. Managing these cables is an ongoing challenge,
`and each represents a failure point. Attempting to mitigate
`the risk of failure by adding redundancy can double the
`cabling, exacerbating the problem while increasing com-
`plexity and costs.
`
`[0008] Provisioning for high availability with today’s
`technologyis a difficult and costly proposition. Generally, a
`failover server must be deployed for every primary server. In
`addition, complex management software and professional
`services are usually required.
`
`[0009] Generally,it is not possible to adjust the processing
`power or upgrade the CPUs on a legacy server. Instead,
`scaling processor capacity and/or migrating to a vendor’s
`next-generation architecture often requires
`a “forklift
`upgrade,” meaning more hardware/software systems are
`added, needing new connections and the like.
`
`[0010] Consequently, there is a need for a system and
`method of providing a platform for enterprise and ASP
`computing that addresses the above shortcomings.
`
`SUMMARY
`
`[0012] According to one aspect of the invention, a method
`and system of implementing an address resolution protocol
`(ARP)are provided. Acomputing platform hasa plurality of
`processors connected by an underlying physical network.
`Logic, executable on one of the processors, defines a topol-
`ogy of an Ethernet network to be emulated on the computing
`platform. The topology includes processor nodes and a
`switch node. Logic, executable on one of the processors,
`assigns a set of processors from the plurality to be processors
`to act as the processor nodes. Logic, executable on oneof the
`processors, assigns virtual MAC addressesto each processor
`node of the emulated Ethernet network. Logic, executable
`on one ofthe processors,allocates virtual interfaces over the
`underlying physical network to provide direct software
`communication from each processor node to each other
`processor node. Each virtual interface has a corresponding
`identification. Each processor node has ARPrequestlogic to
`communicate an ARP request to the switch node, in which
`the ARP request includes an IP address. The switch node
`includes ARP request broadcast logic to communicate the
`ARPrequest to all other processor nodes in the emulated
`Ethernet network. Each processor node has ARPreply logic
`to determine whetherit is the processor node associated with
`the IP address in an ARP request and, if so, to issue to the
`switch node an ARP reply, wherein the ARP reply contains
`the virtual MAC address of the processor node associated
`with the IP address. The switch node includes ARP reply
`logic to receive the ARP reply and to modify the ARP reply
`to include to include a virtual interface identification for the
`ARPrequesting node.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0013]
`
`In the Drawing,
`
`a system diagram illustrating one
`is
`1
`[0014] FIG.
`embodiment of the invention;
`
`FIGS. 2A-C are diagramsillustrating the commu-
`(0015]
`nication links established according to one embodimentof
`the invention;
`
`[0016] TIGS. 3A-B are diagramsillustrating the network-
`ing software architecture of certain embodiments of the
`invention;
`
`FIGS. 4A-C are flowcharts illustrating driver logic
`[0017]
`according to certain embodiments of the invention;
`
`[0018] FIG. 5 illustrates service clusters according to
`certain embodiments of the invention;
`
`FIG.6 illustrates the storage software architecture
`[0019]
`of certain embodiments of the invention;
`
`FIG.7 illustrates the processor-side storage logic
`[0020]
`of certain embodiments of the invention;
`
`{0021] FIG. 8 illustrates the storage address mapping
`logic of certain embodiments of the invention; and
`
`FIG.9 illustrates the cluster managementlogic of
`[0022]
`certain embodiments of the invention.
`
`DETAILED DESCRIPTION
`
`invention features a platform and
`[0011] The present
`method for computer processing in which virtual processing
`arca networks may be configured and deployed.
`
`[0023] Prcfcrred embodiments of the invention provide a
`processing platform from which virtual systems may be
`deployed through configuration commands. The platform
`provides a large pool of processors from which a subsct may
`
`
`
`US 2002/0156612 Al
`
`Oct. 24, 2002
`
`is formed as a serial connection between a NIC 107 and a
`port in the switch fabric 115. Each link operates at 112
`megabytes/second.
`[0029]
`In some embodiments, multiple cabinets or chas-
`sises may be connected together to form larger platforms.
`Andin other embodiments the configuration may ditter; for
`example, redundant connections, switches and control nodes
`may be eliminated.
`[0030] Under software control, the platform supports mul-
`tiple, simultaneous and independent processing areas net-
`works (PANs). Each PAN, through software commands, is
`configured to have a corresponding subset of processors 106
`that may communicate via a virtual local area network that
`is emulated over the PtP mesh. Each PAN isalso configured
`to have a corresponding virtual I/O subsystem. No physical
`deploymentor cabling is needed to establish a PAN. Under
`certain preferred embodiments, software logic executing on
`the processor nodes and/or the control nodes emulates
`swilched Ethernet semantics; other software logic executing
`on the processor nodes and/or the control nodes provides
`virtual storage subsystem functionality that follows SCSI
`semantics and that provides independent I/O address spaces
`for each PAN. Network Architecture Certain preferred
`embodiments allow an administrator to build virtual, emu-
`lated LANs using, virtual components, interfaces, and con-
`nections. Each of the virtual LANs can be internal and
`
`private to the platform 100, or multiple processors maybe
`formed into a processor cluster externally visible as a single
`IP address.
`
`be selected and configured through software commands to
`form a virtualized network of computers (“processing area
`network” or “processor clusters”) that may be deployed to
`serve a given setof applications or customer. The virtualized
`processing area network (PAN) maythen be used to execute
`customer specific applications, such as web-based server
`applications. The virtualization may includevirtualization of
`local area networks (LANs) or the virtualization of I/O
`storage. By providing such a platform, processing resources
`may be deployed rapidly and easily through software via
`configuration commands,e.g., from an administrator, rather
`than through physically providing servers, cabling network
`and storage connections, providing powerto each server and
`so forth.
`
`[0024] Overview of the Platform and its Behavior
`
`[0025] As shownin FIG.1, a preferred hardware platform
`100 includes a set of processing nodes 105a-n connected to
`a switch fabrics 115a,b via high-speed, interconnect 110a,b.
`The switch fabric 115a,b is also connected to at least one
`control node 120a,b that
`is in communication with an
`external IP network 125 (or other data communication
`network), and with a storage area network (SAN) 130. A
`management application 135,
`for example,
`executing
`remotely, may access one or more of the control nodes via
`the IP network 125 to assist in configuring the platform 100
`and deploying virtualized PANs.
`
`[0026] Under certain embodiments, about 24 processing
`nodes 105a-n, two control nodes 120, and two switch fabrics
`115a,b are contained in a single chassis and interconnected
`with a fixed, pre-wired mesh of point-to-point (PtP) links.
`Each processing node 105 is a board that includes one or
`more (e.g., 4) processors 106j-/, one or more network
`interface cards (NICs) 107, and local memory(e.g., greater
`than 4 Gbytes) that, among other things,
`includes some
`BIOS firmware for booting and initialization. There is no
`local disk for the processors 106; insteadall storage, includ-
`ing storage needed for paging, is handled by SAN storage
`devices 130.
`
`[0031] Under certain embodiments, the virtual networks
`so created emulate a switched Ethernet network, though the
`physical, underlying network is a PtP mesh. The virtual
`network utilizes IEEE MACaddresses, and the processing
`nodes support IETF ARPprocessing to identify and associ-
`ate IP addresses with MACaddresses. Consequently, a given
`processor node replies to an ARP request consistently
`whether the ARP request came from a node internal or
`external to the platform.
`[0032] FIG. 2A shows an exemplary network arrange-
`[0027] Each control node 120 is a single board that
`ment that may be modeled or emulated.Afirst subnet 202 is
`includes one or more(e.g., 4) processors, local memory, and
`formed by processing nodes PN,, PN, and PN, that may
`local disk storage for holding independent copies of the boot
`communicate with one another via switch 206. A second
`image andinitial file system that is used to boot operating
`subnet 204 is formed by processing nodes PN,, and PN,, that
`system software for the processing nodes 105 and for the
`may communicate with one another via switch 208. Under
`control nodes 106. Each control node communicates with
`switched Ethernet semantics, one node on a subnet may
`SAN 130 via 100 megabyte/second fibre channel adapter
`communicate directly with another node on the subnet; for
`cards 128 connected to fibre channel links 122, 124 and
`example, PN, may send a message to PN,. The semantics
`communicates with the Internet (or any other external net-
`also allow one node to communicate with a set of the other
`work) 125 via an external network interface 129 having one
`or more Gigabit Ethernet NICs connected to Gigabit Ether-
`net links 121,123. (Many other techniques and hardware
`may be used for SAN and external network connectivity.)
`Each control node includes a low speed Ethernet port (not
`shown)as a dedicated managementport, which may be used
`instead of remote, web-based management via management
`application 135.
`
`[0028] The switch fabrics is composed of one or more
`30-port Giganet switches 115, such as the NIC-CLAN 1000
`and clan 5300 switch, and the various processing and control
`nodes usc corresponding NICs for communication with such
`a fabric module. Giganet switch fabrics have the semantics
`of a Non-Broadcast Multiple Access (NBMA)network. All
`intcr-node communication is via a switch fabric. Each link
`
`nodes; for example PN, may send a broadcast message to
`other nodes. The processing nodes PN, and PN. cannot
`directly communicate with PN,, because PN,, is on a dif-
`ferent subnet. For PN, and PN, to communicate with PN,
`higher layer networking software would need to be utilized,
`which software would have a fuller understanding of both
`subnets. Though not shownin the figure, a given switch may
`communicate via an “uplink” to another switch or the like.
`As will be appreciated given the description below, the need
`for such uplinks is different
`than their need when the
`switches are physical. Specifically, since the switches are
`virtual and modeled in software they mayscale horizontally
`as wide as needed. (In contrast, physical switches have a
`fixed number of physical ports sometimes the uplinks are
`needed to provide horizontal scalability.)
`
`
`
`US 2002/0156612 Al
`
`Oct. 24, 2002
`
`[0033] FIG. 2B shows exemplary software communica-
`tion paths and logic used under certain embodiments to
`model the subnets 202 and 204 of FIG. 2A. The commu-
`
`nication paths 212 connect processing nodes PN,, PN., PN,,,
`and PN,,, specifically their corresponding processor-side
`network communication logic 210, and they also connect
`processing nodes to control nodes. (Though drawn as a
`single instance of logic for the purpose of clarity, PN, may
`have multiple instances of the corresponding processor
`logic, one per subnet,
`for example.) Under preferred
`embodiments, managementlogic and the control node logic
`are responsible for establishing, managing and destroying
`the communication paths. The individual processing nodes
`are not permitted to establish such paths.
`
`[0034] As will be explained in detail below, the processor
`logic and the control node logic together emulate switched
`Ethernet semantics over such communication paths. For
`example, the control nodes have control node-side virtual
`switch logic 214 to emulate some(but not necessarily all) of
`the scmantics of an Ethernet switch, and the processor logic
`includes logic to emulate some (but not necessarily all) of
`the semantics of an Ethernet driver.
`
`[0035] Within a subnet, one processor node may commu-
`nicate directly with another via a corresponding virtual
`interface 212. Likewise, a processor node may communicate
`with the control node logic via a separate virtual interface.
`Under certain embodiments,
`the underlying switch fabric
`and associated logic (e.g., switch fabric manager logic, not
`shown) provides the ability to establish and manage such
`virtual interfaces (VIs) over the point to point mesh. More-
`over, these virtual interfaces may be establishedin a reliable,
`redundant fashion and are referred to herein in as RVIs. At
`points in this description, the terms virtual interface (VI) and
`reliable virtual interface (RVI) are used interchangeably, as
`the choice between a VI versus an RVI largely depends on
`the amountofreliability desired by the system at the expense
`of system resources.
`
`[0036] Referring conjointly to FIGS. 2A-B, if node PN,is
`to communicate with node PN, it does so ordinarily by
`virtual interface 212,_,. However, preferred embodiments
`allow communication between PN, and PN. to occur via
`switch emulation logic,
`if for example VI 212,., is not
`operating satisfactorily. In this casc a message may be sent
`via VI 212,.witehz0g aNd Via VI 212,05,206-2. If PN, is to
`broadcast or multicast a message to other nodes in the subnet
`202 it does so by sending the message to control node-side
`logic 214 via virtual interface 212, _,.achoog- Control node-
`side logic 214 then emulates the broadcast or multicast
`functionality by cloning and sending the message to the
`other relevant nodes using the relevant VIs. The same or
`analogous VIs may be used to convey other messages
`requiring control node-side logic. For example, as will be
`described below, control node-side logic includes logic to
`support the address resolution protocol (ARP), and VIs are
`used to communicate ARPreplies and requests to the control
`node. Though the above description suggests just onc VI
`between processor logic and control logic, many embodi-
`ments employ several such connections. Moreover, though
`the figures suggest symmetry in the software communication
`paths, the architecture actually allows asymmetric commu-
`nication. For example, as will be discussed below, for
`communication clustercd scrviccs the packets would be
`
`routed via the control node. However, return communication
`may be direct between nodes.
`[0037] Notice that like the network of FIG. 2A,there is no
`mechanism for communication between node PN,, and
`PN,,,- Moreover, by having communication paths managed
`and created centrally (instead of via the processing nodes)
`sucha path is not creatable by the processing nodes, and the
`defined subnet connectivity cannot be violated by a proces-
`sor.
`
`[0038] FIG. 2C shows the exemplary physical connec-
`tions of certain embodimentsto realize the subnets of FIGS.
`2A and B. Specifically, each instance of processing network
`logic 210 communicates with the switch fabric 115 via a PtP
`links 216 of interconnect 110. Likewise, the control node has
`multiple instances of switch logic 214 and each communi-
`cates over a PtP conneciton 216 to the switch fabric. The
`virtual interfaces of FIG. 2B include the logic to convey
`information over these physical links, as will be described
`further below.
`
`To create and configure such networks, an admin-
`[0039]
`istrator defines the network topology of a PAN andspecifies
`(e.g., via a utility within the management software 135)
`MACaddress assignments of the various nodes. The MAC
`addressis virtual, identifying a virtual interface, and nottied
`to any specific physical node. Under certain embodiments,
`MACaddresses follow the IEEE 48 bit address format, but
`in which the contents include a “locally administered” bit
`(set to 1), the serial numberofthe control node 120 on which
`the virtual interface was originally defined (more below),
`and a count value from a persistent sequence counter on the
`control node that is kept in NVRAM inthe control node.
`These MACs will be used to identify the nodes (as is
`conventional) at a layer 2 level. For example,in replying to
`ARPrequests (whether from a node internal to the PAN or
`on an external network) these MACswill be included in the
`ARPreply.
`[0040] The control node-side networking logic maintains
`data structures that contain information reflecting the con-
`nectivity of the LAN (e.g., which nodes may communicate
`to which other nodes). ‘he control node logic also allocates
`and assigns VI (or RVI) mappings to the defined MAC
`addresses and allocates and assigns VIs or (RVIs) between
`the control nodes and between the control nodes and the
`processing nodes. In the example of FIG. 2A, the logic
`would allocate and assign VIs 212 of FIG. 2B. (The naming
`of the VIs and RVIs in some embodiments is a consequence
`of the switching fabric and the switch fabric managerlogic
`employed.)
`[0041] As each processor boots, BIOS-based boot logic
`initializes each processor 106 of the node 105 and, among
`other things, establishes a (or discovers the) VI 212 to the
`control node logic. The processor node then obtains from the
`control node relevant data link information, such as the
`processor node’s MAC address, and the MAC identities of
`other devices within the same data link configuration. Each
`processorthen registers its IP address with the control node,
`whichthen bindsthe IP address to the node and an RVI (c.g.,
`the RVI on which the registration arrived). In this fashion,
`the contral node will be able to bind IP addresses for each
`virtual MAC for cach node on a subnet. In addition to the
`
`above, the processor node also obtains the RVI or VI-related
`information for its connections to other nodes or to control
`
`node networking logic.
`
`
`
`US 2002/0156612 Al
`
`Oct. 24, 2002
`
`the various
`[0042] Thus, after boot and initialization,
`processor nodes should understand their layer 2, data link
`connectivity. As will be explained below,
`layer 3 (IP)
`connectivity and specifically layer 3 to layer 2 associations
`are determined during normal processing of the processors
`as a consequence of the address resolution protocol.
`
`[0043] FIG. 3A details the processor-side networking
`logic 210 and FIG. 3B details the control node-side net-
`working 310 logic of certain embodiments. The processor
`side logic 210 includes IP stack 305, virtual network driver
`310, ARP logic 350, RCLAN layer 315, and redundant
`Giganet drivers 320a,b. The control node-side logic 310
`includes redundant Giganet drivers 325a,b, RCLAN layer
`330, virtual Cluster proxy logic 360, virtual LAN server 335,
`ARPserverlogic 355, virtual LAN proxy 340, and physical
`LAN drivers 345.
`
`IP Stack
`
`[0044] The IP stack 305 is the communication protocol
`stack provided with the operating system (e.g., Linux) used
`by the processing nodes 106. The IP stack provides a layer
`3 interface for the applications and operating system execut-
`ing on a processor 106 to communicate with the simulated
`Etheret network. The IP stack provides packets of infor-
`mation to the virtual Ethernet layer 310 in conjunction with
`providing a layer 3, IP address as a destination for that
`packet. The IP stack logic is conventional except that certain
`embodiment avoid check sum calculations and logic.
`
`Virtual Ethernet Driver
`
`(0045] The virtual Ethernet driver 310 will appearto the IP
`stack 305 like a “real” Ethernet driver. In this regard, the
`virtual Ethernet driver 310 receives IP packets or datagrams
`from the IP stack for subsequent transmission on the net-
`work, and it receives packet information from the network
`to be delivered to the stack as an IP packet.
`
`[0046] The stack builds the MAC header. The “normal”
`Ethernet code in the stack may be used. The virtual Ethernet
`driver receives the packet with the MAC header already
`built and the correct MAC address alreadyin the header.
`
`In material part and with reference to FIGS. 4A-C,
`[0047]
`the virtual Ethernet driver 310 dequeues 405 outgoing IP
`datagrams so that the packet may be sent on the network.
`The standard IP stack ARPlogic is used. The driver, as will
`be explained below,intercepts all ARP packets entering and
`leaving the system to modify them so that
`the proper
`information ends up in each node’s ARPtables. The normal
`ARPlogic places the correct MAC address in the link layer
`header of the outgoing packet before the packet is queued to
`the Ethernet driver. The driver then just examines the link
`layer header and destination MACto determine howto send
`the packet. The driver does not directly manipulate the ARP
`table (except for the occasional invalidation of ARPentries).
`
`[0048] The driver 310 determines 415 whether ARPlogic
`350 has MAC address information (more below) associated
`with the IP address in the dequcued packet. If the ARP logic
`350 has the information, the information is used to send 420
`the packet accordingly. If the ARP logic 350 does not have
`the information, the driver necds to determine such infor-
`mation, and in certain preferred embodiments, this informa-
`tion is obtained as a result of an implementation of the ARP
`protocol as discussed in connection with FIGS. 4B-C.
`
`If the ARP logic 350 has the MAC address infor-
`[0049]
`mation, the driver analyzes the information returned from
`the ARP logic 350 to determine where and how to send the
`packet. Specifically,
`the driver looks at
`the address to
`determine whether the MAC addressis in a valid format or
`in a particular invalid format. For example, in one embodi-
`ment, internal nodes (i.e., PAN nodes internal to the plat-
`form) are signaled through a combination of setting the
`locally administered bit,
`the multicast bit, and another
`predefined bit pattern in the first byte of the MAC address.
`The overarching pattern is one which is highly improbable
`of being a valid pattern.
`
`If the MAC address returned from the ARPlogic is
`[0050]
`in a valid format, the IP address associated with that MAC
`address is for a node external at least to the relevant subnet
`and in preferred embodiments is external to the platform. To
`deliver such a packet, the driver prepends the packet wilh a
`TLV (type-length-value) header. The logic then sends the
`packet to the control node over a pre-established VI. The
`control node then handles the rest of the transmission as
`appropriate.
`
`If the MAC address information returned from the
`[0051]
`ARPlogic 350 is in an a particular invalid format, the invalid
`format signals that the IP-addressed node is to an internal
`node, and the information in the MAC address information
`is used to help identify the VI (or RVDdirectly connecting
`the two processing nodes. For example, the ARP table cntry
`may hold information identifying the RVI 212 to use to send
`the packet, e.g., 212,.,, to another processing node. The
`driver prepends the packet with a TLV header. It then places
`address information into the header as well as information
`
`identifying the Ethernet protocol type. The logic then selects
`the appropriate VI (or RVI) on which to send the encapsu-
`lated packet. If that VI (or RVI) is operating satisfactorily it
`is used to carry the packet; if it is operating unsatisfactorily
`the packet is sent to the control node switch logic (more
`below) so that the switch logic can send it to the appropriate
`node. ‘Though the ARP table may contain information to
`actually specify the RVI to use, manyother techniques may
`be employed. [‘or example, the information in the table may
`indirectly provide such information, e.g., by pointing to the
`information of interest or otherwise identifying the infor-
`mation of interest though not containit.
`
`[0052] For any multicast or broadcast type messages, the
`driver sends the messageto the control node on a defined VI.
`The control node then clones the packet and sendsitto all
`nodes (excluding the sending node) and the uplink accord-
`ingly.
`
`If there is no ARP mapping then the upper layers
`[0053]
`would never have sent the packet to the driver. If there is no
`datalink layer mapping available, the packet is put aside
`until ARP resolution is completed. Once the ARP layer has
`finished ARPing, the packets held back pending ARPget
`their datalink headers build and the packets are then sent to
`the driver.
`
`If thc ARP logic has no mapping for an IP address
`[0054]
`of an IP packet from the IP stack and, consequently, the
`driver 310 is unable to determine the associated addressing
`information (i.c., MAC address or RVI-related information),
`the driver obtains such information by following the ARP
`protocol. Referring to FIGS. 4B-C,the driver builds 425 an
`ARPrequest packet containing the relevant IP address for
`
`
`
`US 2002/0156612 Al
`
`Oct. 24, 2002
`
`which there is no MAC mappingin the local ARP table. The
`node then prepends 430 the ARP packet with a TLV-type
`header. The ARP requestis then sent via a dedicated RVI to
`the control node-side networking logic—specifically,
`the
`virtual LAN server 335.
`
`