`(12) Patent Application Publication (10) Pub. No.: US 2002/0064126A1
`(43) Pub. Date:
`May 30, 2002
`Bhattal et al.
`
`US 200200641.26A1
`
`(54) RECOVERY FOLLOWING PROCESS OR
`SYSTEM FAILURE
`(75) Inventors: Amardeep Singh Bhattal, Hampshire
`(GB); Morag Ann Hughson,
`Southampton (GB); Neil Kenneth
`Johnston, Hampshire (GB); Anthony
`John O'Dowd, Winchester (GB)
`Correspondence Address:
`IBM Corp, IP Law Dept T81/503
`3039 Cornwallis Road
`PO Box 12195
`Research Triangle Park, NC 27709-2195 (US)
`
`(73) Assignee: International Business Machines Cor
`poration, Armonk, NY
`(21) Appl. No.:
`09/844,002
`(22) Filed:
`Apr. 27, 2001
`
`
`
`(30)
`
`Foreign Application Priority Data
`
`Nov. 24, 2000 (GB)......................................... OO28688.0
`
`Publication Classification
`
`(51) Int. Cl." ....................................................... H04L 1100
`(52) U.S. Cl. ............................................ 370/217; 370/242
`(57)
`ABSTRACT
`Provided are methods, data communication apparatus and
`computer programs for managing communications between
`a remote communication manager and a set of communica
`tions managers in an associated group. The group of com
`munication managers have shared access to resources which
`enable any communication manager in the group to recover
`from failures experienced by another communication man
`ager in the group. In particular, recovery of failed inbound
`and outbound channels is achieved with the advantage of
`improved availability of data transmissions. Preferably, the
`recovery uses Synchronization information to ensure that
`data is recovered to a consistent State to that channel
`recovery is achieved without loss of data integrity.
`
`
`Ex.1016 / Page 1 of 14Ex.1016 / Page 1 of 14
`
`TESLA, INC.TESLA, INC.
`
`
`
`Patent Application Publication May 30, 2002 Sheet 1 of 5
`
`US 2002/0064126A1
`
`
`
`?ae-Fly
`
`08
`
`
`
`Tl)
`
`
`
`
`
`
`
`
`
`
`
`Tael- ºg
`
`·FL
`
`
`Ex.1016 / Page 2 of 14Ex.1016 / Page 2 of 14
`
`TESLA, INC.TESLA, INC.
`
`
`
`Patent Application Publication May 30, 2002 Sheet 2 of 5
`
`US 2002/0064126A1
`
`
`
`
`Ex.1016 / Page 3 of 14Ex.1016 / Page 3 of 14
`
`TESLA, INC.TESLA, INC.
`
`
`
`Patent Application Publication May 30, 2002. Sheet 3 of 5
`
`US 2002/0064126A1
`
`
`
`START CHANNEL ENSTANCE
`BETWEEN FRSTOM OF QSG AND
`REMOTE QM
`
`300
`
`STORE CHANNEL DEFINITION FOR
`ACCESS BY EACH OM IN GSG
`
`310
`
`PREVENT SECOND CHANNEL
`FNSTANCE STARTING WHILE FRSS
`ACTIVE
`
`320
`
`RECORD CHANNEL STATE
`INFORMATION IN SHARED-ACCESS
`STORAGE
`
`330
`
`WHEN FIRST OM IN OSG FAILS,
`SECOND QM STARTS NEW OUTBOUND
`CHANNEL INSTANCE USNG CHANNE
`DEFINITION AND STATE INFORMATION,
`RECOVERS MESSAGES USING SHARED
`SYNC QUEUE AND STARTS NEW
`NBOUND CHANNEL NSTANCEN
`RESPONSE TO REMOTEREQUEST
`
`30
`
`F. G. 3
`
`
`Ex.1016 / Page 4 of 14Ex.1016 / Page 4 of 14
`
`TESLA, INC.TESLA, INC.
`
`
`
`Patent Application Publication May 30, 2002 Sheet 4 of 5
`
`US 2002/0064126A1
`
`
`
`
`
`|TL
`ÕJUKS
`
`
`Ex.1016 / Page 5 of 14Ex.1016 / Page 5 of 14
`
`TESLA, INC.TESLA, INC.
`
`
`
`Patent Application Publication
`
`
`
`
`Ex.1016 / Page 6 of 14Ex.1016 / Page 6 of 14
`
`TESLA, INC.TESLA, INC.
`
`
`
`US 2002/0064126 A1
`
`May 30, 2002
`
`RECOVERY FOLLOWING PROCESS OR SYSTEM
`FAILURE
`
`FIELD OF INVENTION
`0001. The present invention relates to recovery following
`proceSS or System failures in a data communications network
`and in particular to recovery by a process or Subsystem other
`than the one which experienced the failure, for improved
`availability.
`
`BACKGROUND
`0002 Many existing messaging Systems use a single
`messaging manager to manage the transmission, from a
`local System, of all messages which are destined for remote
`Systems, and to handle receipt of all messages which are
`destined for the local System. An application program run
`ning on the local System which requires that a message be
`Sent to a remote System connects to the local messaging
`manager and requests that it send the message to the required
`destination. This implies reliance on the availability of the
`Single messaging manager for all communications. Any
`failure which affects that messaging manager has a signifi
`cant effect on messaging throughput, Since a full rollback
`and restart of the messaging manager is required before
`communications can resume.
`0003) It is known from U.S. Pat. Nos. 5,797,005 and
`5,887,168 to provide a System allowing messages to be
`processed by any of a plurality of data processing Systems in
`a data processing environment. A shared queue is provided
`to Store incoming messages for processing by one of the
`plurality of data processing Systems. A common queue
`Server receives and queues the messages onto the shared
`queue So that they can be retrieved by a System having
`available capacity to process the messages. A System having
`available capacity retrieves the queued message, performs
`the necessary processing and places an appropriate response
`message back on the shared queue. Thus, the shared queue
`Stores messages Sent in either direction between clients
`requesting processing and the data processing Systems that
`perform the processing. Because the messages are enqueued
`onto the Shared queue, the messages can be processed by an
`application running on any of a plurality of Systems having
`access to the queue. Automatic workload Sharing and pro
`cessing redundancy is provided by this arrangement. If a
`particular application that is processing a message fails,
`another application can retrieve that message from the
`shared queue and perform the processing without the client
`having to wait for the original application to be restarted.
`0004 U.S. patent application serial No. 60/220,685
`(attorney reference GB9-2000-032), which is commonly
`assigned to the present application and is incorporated
`herein by reference, discloses improved recovery from con
`nection failures between a queuing Subsystem and a shared
`queue, Such failure being caused either by communications
`link failure, or failure of the queuing Subsystem. Message
`data in a shared queue is communicated between message
`queuing Subsystems by means of data Structures contained in
`a coupling facility. A connection failure to the coupling
`facility is notified to queuing Subsystems other than the one
`which experienced the failure, and these queuing Sub
`Systems then share between them the recovery of active
`units of work of the failed subsystem.
`
`0005. Although the solution of U.S. Ser. No. 60/220,685
`provides significantly improved transactional recovery
`within a group of queuing Subsystems, it does not address
`the problems of how to resume communications with com
`munication managers outside the group in the event of
`failures affecting in-progreSS communications.
`SUMMARY OF INVENTION
`0006 According to a first aspect of the present invention,
`there is provided a method of managing communications
`between a set of communication managers and a remote
`communication manager, the method comprising: Starting a
`communication channel between a first communication
`manager of the Set and the remote communication manager
`for transmitting data from a data Storage repository to the
`remote communication manager, the data Storage repository
`being accessible by any one of the Set of communication
`managers, Storing State information for the communication
`channel in a Storage repository accessible by any one of the
`Set of communication managers (which may be the same
`data storage repository from which data is transmitted); in
`response to a failure affecting the first communication
`manager, a Second one of the Set of communication man
`agers using the Stored channel State information to Start a
`new channel instance and resuming transmission of data
`from the data Storage repository to the remote communica
`tion manager via the new channel instance.
`0007. The invention uses shared access to communica
`tion resources (stored data and communication mechanisms)
`to enable members of a group of associated communication
`managers to recover from failures of communications with
`remote communication managers, thereby achieving
`improved availability of data transmissions. This advantage
`of increased availability is achieved without reliance on
`redundant resource managers which are, in the prior art,
`typically required to be kept out of operational use in the
`absence of failures and yet to be kept consistent with their
`asSociated resource manager. Maintaining Such redundancy
`is expensive.
`0008. The stored channel state information preferably
`includes an identification of the communication manager
`which currently has control of the channel. Additionally, the
`State information preferably includes an indication of the
`Status of the channel (for indicating whether it was, for
`example, running, attempting to run or stopped when a
`failure occurs). Thus, each active communication manager
`within the group is able to determine which channels should
`be recovered when a first communication manager experi
`ences problems, and what State they should be recovered to.
`0009. The information stored for a channel preferably
`also includes Synchronisation data for data transmissions via
`the channel, to enable Synchronised recovery by other com
`munication managers. This Synchronisation data may be part
`of the State information or Stored Separately from it.
`0010 Each communication manager of the set preferably
`holds or has access to a copy of a channel definition for each
`channel which is active and this is used together with the
`Stored State information to enable a communication manager
`other than the first to Start a new channel instance and
`resume data transmissions.
`0011 A preferred method for recovery from communi
`cation failures includes: preventing a Second instance of a
`
`
`Ex.1016 / Page 7 of 14Ex.1016 / Page 7 of 14
`
`TESLA, INC.TESLA, INC.
`
`
`
`US 2002/0064126 A1
`
`May 30, 2002
`
`communication channel from being started (for example,
`using locks) while a first instance of the channel is in active
`use by the first communication manager; in response to
`determining that the first communication channel instance
`has experienced a failure, Starting a Second instance of the
`channel using the channel definition and current channel
`State information; and transmitting data using the Second
`channel instance. Avoiding multiple concurrent instances of
`a channel not only simplifies avoidance of resource-update
`conflicts, but may also be advantageous for avoiding the
`costs of multiple connections if, for example, the remote
`communication manager is an external Service provider with
`Significant associated connection charges.
`0012. The data storage repository is preferably a shared
`access message queue. The plurality of communication
`managers are preferably a group of queue managers having
`shared access to one or more message queues (referred to
`hereafter as a “queue Sharing group”) or communication
`manager components within or associated with Such queue
`managers. Alternatively, the communication managers could
`be any computer program or data processing System com
`ponent which performs communication management opera
`tions.
`0013 The invention according to a preferred embodi
`ment enables queue managers in a queue Sharing group (or
`their associated communication manager components) to
`take over message transmission from a shared queue when
`a first queue manager experiences a failure. A new instance
`of a failed channel is Started using channel definition param
`eters and current channel State information. Such peer
`recovery by queue managers in a queue Sharing group
`provides improved availability message transmission.
`0.014. According to a preferred embodiment of the inven
`tion, recovery from failures of outgoing message transmis
`Sions is achieved as follows. Each queue manager in a queue
`Sharing group has access to a shared outgoing-message
`queue. Each of these queue managers (or its communication
`manager component) is provided with a copy of a definition
`of a Sender channel between the Shared queue and a desti
`nation queue manager, Such that each queue manager in the
`queue sharing group (or its communication manager com
`ponent) is able to start an instance of the channel. Only a
`Single channel instance is allowed to be active at any one
`time. Certain State information for a channel is Stored
`whenever the channel is active, and a Subset of that State
`information is held in shared acceSS Storage So as to be
`available to any queue manager within the queue Sharing
`group. If the queue manager which is using a channel
`experiences a failure, another queue manager or communi
`cation manager component in the queue Sharing group uses
`the State information held in Shared acceSS Storage together
`with its copy of the channel definition to start a new instance
`of the channel. Thus, a queue manager continues message
`transmission on behalf of the queue manager which expe
`rienced the failure.
`0.015. In a second aspect, the invention provides a data
`communications System comprising: a data Storage reposi
`tory accessible by any one of a set of communication
`managers, a Set of communication managers, each adapted
`to Start an instance of a communication channel for trans
`mitting data from the data Storage repository to a remote
`communication manager, and each adapted to transmit data
`
`Via Said communication channel; a Storage repository for
`Storing current State information for the communication
`channel, the Storage repository being accessible by any one
`of the Set of communication managers, wherein the Set of
`communication managers are responsive to a failure affect
`ing a first communication manager of Said Set which has a
`first active instance of a communications channel, to Start a
`Second instance of the channel using the Stored current
`channel State information and to resume transmission of data
`from the data Storage repository to the remote communica
`tion manager via the Second channel instance.
`0016. In a third aspect, the invention provides a computer
`program comprising computer readable program code for
`controlling the operation of a data communication apparatus
`on which it runs to perform the steps of a method of
`managing communications between a Set of communication
`managers and a remote communication manager, the method
`comprising: Starting a communication channel between a
`first communication manager of the Set and the remote
`communication manager for transmitting data from a data
`Storage repository to the remote communication manager,
`the data Storage repository being accessible by any one of
`the Set of communication managers, Storing State informa
`tion for the communication channel in a storage repository
`accessible by any one of the Set of communication managers
`(which may be the same data storage repository from which
`data is transmitted); in response to a failure affecting the first
`communication manager, a Second one of the Set of com
`munication managers using the Stored channel State infor
`mation to start a new channel instance and resuming trans
`mission of data from the data Storage repository to the
`remote communication manager via the new channel
`instance.
`0017. In a further aspect of the invention, inbound com
`munication flows may be accepted by any one of the Set of
`communications managers, and any one of these communi
`cation managers may automatically replace any other com
`munication manager within the Set which had been receiving
`messages and can no longer do so. The peer recovery of both
`inbound and outbound communication channels is prefer
`ably transparent to the remote communication manager
`which Views the Set of communication managers as a single
`entity.
`0018. A preferred embodiment according to this aspect of
`the invention comprises a method of managing communi
`cations between a set of communication managers and a
`remote communication manager, including: Starting a first
`instance of a communication channel between a first com
`munication manager of the Set and the remote communica
`tion manager for receiving data from the remote communi
`cation manager; preventing a Second instance of the
`communication channel from being Started while the first
`instance of the channel is in active use by the first commu
`nication manager; in response to a channel Start request from
`the remote communication manager following a failure
`which affects the first communication manager, Starting a
`Second instance of the channel between a Second one of the
`Set of communication managers and the remote communi
`cation manager and resuming data transmissions from the
`remote communication manager via the new channel
`instance.
`0019. In a further aspect of the invention, there is pro
`Vided a data communications System comprising: a data
`
`
`Ex.1016 / Page 8 of 14Ex.1016 / Page 8 of 14
`
`TESLA, INC.TESLA, INC.
`
`
`
`US 2002/0064126 A1
`
`May 30, 2002
`
`Storage repository accessible by any one of a set of com
`munication managers, a set of communication managers,
`each adapted to Start an instance of a communication chan
`nel for transmitting data from the data Storage repository to
`a remote communication manager, and each adapted to
`transmit data Via Said communication channel; a Storage
`repository for Storing Synchronisation information for data
`transmissions via Said communication channel, the Storage
`repository being accessible by any one of the Set of com
`munication managers, wherein the Set of communication
`managers are responsive to a failure affecting a first com
`munication manager of Said Set which has a first active
`instance of a communications channel, to recover Said first
`communication manager's data transmissions to a consistent
`State using Said Stored Synchronisation information, thereby
`to enable transmission of data from the data Storage reposi
`tory to the remote communication manager to be resumed.
`
`BRIEF DESCRIPTION OF DRAWINGS
`0020 Embodiments of the invention will now be
`described in more detail, by way of example, with reference
`to the accompanying drawings in which:
`0021
`FIG. 1 is a schematic representation of the sending
`of messages between queue managers via channels in a
`messaging and queuing inter-program communication envi
`ronment, as is known in the art;
`0022 FIG. 2 is a representation of the components
`involved in communication via an outbound communication
`channel according to an embodiment of the invention;
`0023 FIG. 3 is a representation of the steps of a method
`of managing communications according to an embodiment
`of the invention;
`0024 FIG. 4 shows the problem of inability to access a
`remote queue's Synchronisation information when the
`remote queue manager fails, and
`0.025
`FIG. 5 shows how shared access resources can be
`used according to an embodiment of the present invention to
`enable Synchronised recovery of channels.
`
`DETAILED DESCRIPTION OF PREFERRED
`EMBODIMENTS
`In distributed message queuing inter-program com
`0026.
`munication, message data is sent between application pro
`grams by means of message queue managers that interface
`to the application programs via interface calls invoked by
`the application programs. A message queue manager man
`ages a Set of resources that are used in message queuing
`inter-program communication. These resources typically
`include:
`0027 Page sets that hold object definitions (including
`queue definitions) and message data;
`0028 Logs that are used to recover messages in the
`event of a queue manager failure;
`0029 Processor storage;
`0030 Connections through which different application
`environments can access the message queue manager
`APIs;
`
`0031) A queue manager channel initiator, which allows
`communication between queue managers on the same
`and other systems. This will be described in more detail
`later.
`0032 Queue managers are preferably implemented in
`Software. In certain environments, Such as in a data proceSS
`ing system running IBM Corporation's OS/390 operating
`System, a queue manager can run as a named Subsystem
`using operating System data Sets to hold information about
`logs, and to hold object definitions and message data (Stored
`in page Sets). Application programs can connect to the queue
`manager using its Subsystem name.
`0033. In an example distributed message queuing envi
`ronment, such as implemented by IBM Corporations
`MQSeries products and represented in FIG. 1, a sender
`application program 10 puts a message onto a local queue
`managed by its local queue manager 20. If the target
`application program is also on the local System, then the
`local queue is the destination queue for the message and the
`application retrieves the message from this queue when it is
`ready to process the message. If the target application
`program is remote from the Sender, the local queue is a
`transmission queue 80 and a network of queue managers
`handles transmission of the message across the network to a
`remote destination queue 30 managed by a remote queue
`manager 40. Transmission queues are a special type of local
`queue on which messages are Stored until they can be
`Successfully transmitted to another queue manager and
`Stored there. The queue managers handle the complexities of
`the network communications including interoperability
`between heterogeneous Systems and transferring messages
`Via one or more intermediate queue managers. The remote
`target application program 50 retrieves the message from the
`destination queue 30 (its input queue) when it is ready.
`0034 Messages are transmitted between queue managers
`on a channel 60 which is a one-way communication link
`between two queue managers. Software which handles the
`Sending and receiving of messages is called a message
`channel agent (MCA) 70. To send a message from queue
`manager QM1 to queue manager QM2, a Sending message
`channel agent 70 on queue manager QM1 Sets up a com
`munications link to queue manager QM2, using appropriate
`transmission queue and channel definitions. A receiving
`message channel agent 70' is started on queue manager QM2
`to receive messages from the communication link. This
`one-way path consisting of the sending MCA 70, the com
`munication link, and the receiving MCA 70' is the channel
`60. The sending MCA 70 takes the messages from the
`transmission queue 80 and sends them down the channel to
`the receiving MCA 70'. The receiving MCA 70' receives the
`messages and puts them on to the destination queues 30, 30'.
`0035 FIG. 1 shows these relationships between queue
`managers, transmission queues, channels and MCAS.
`0036). In the preferred embodiment of the invention, the
`Sending and receiving MCAS associated with a particular
`queue manager can all run inside a channel initiator com
`ponent (or mover) which uses an address space under the
`control of the queue manager. Hence it is the channel
`initiator which manages communications with other queue
`managers. There is only a single channel initiator connected
`to a queue manager. There can be any number of MCA
`processes running concurrently inside the same channel
`
`
`Ex.1016 / Page 9 of 14Ex.1016 / Page 9 of 14
`
`TESLA, INC.TESLA, INC.
`
`
`
`US 2002/0064126 A1
`
`May 30, 2002
`
`initiator. Channels may be started dynamically by a channel
`initiator in response to arrival of messages on a transmission
`queue that Satisfies Some triggering criteria for that queue.
`0037 Referring to FIG. 2, a queue sharing group 100 is
`a number of queue managers 110, 120, 130 (for example
`running within a single OS/390 sysplex) that are able to
`access the same message queuing object definitions and
`message data. Queue managers elsewhere in the network can
`View the group as a Single entity, Since a Single generic
`address can be used for connecting to any queue manager in
`the group. Each queue manager in the queue-sharing group
`100 listens for inbound session requests on an address that
`is logically related to the generic address.
`0.038. Within a queue-sharing group, the shareable object
`definitions are stored in a shared database 140 (such as IBM
`Corporation's DB2 database) and the messages in shared
`queues are held in one or more coupling facilities 150 (for
`example, in OS/390 Coupling Facility list structures). The
`shared database 140 and the Coupling Facility structures 150
`are resources that are shared between Several queue man
`agers 110, 120, 130. A Coupling Facility can be configured
`to run on a dedicated power Supply and to be resilient to
`Software failures, hardware failures and power-outages, and
`hence enables high availability of its Stored messages.
`0.039 The queue managers of the queue Sharing group
`and their resources are associated in Such a way that the
`transmission of messages is not dependent on the availabil
`ity of a single queue manager. Any of the queue managers
`of the queue Sharing group can automatically replace any
`other queue manager that had been transmitting messages
`but is no longer able to do so. When one queue manager
`experiences a failure, message transmission resumes without
`any operator or application intervention.
`0040 Additionally, inbound message flows may also be
`accepted by any of the queue managers of the queue Sharing
`group, and any one of these queue managers may automati
`cally replace any other queue manager within the group that
`had been receiving messages but can no longer do So.
`0041. A more reliable outbound and inbound messaging
`service is thus achieved than would be possible with a
`Stand-alone queue manager. This will now be described in
`more detail with reference to FIGS. 2-5.
`0042.
`In a queue-sharing group, transmission queues 160
`may be shared by the queue managers 110, 120, 130 within
`the group. Any queue-sharing group queue manager can
`access the shared transmission queue 160 to retrieve mes
`Sages and Send them to a remote queue manager 180.
`0.043 Sender channels are defined to send messages
`placed on a particular transmission queue to a remote queue
`manager. A shared Sender channel is a channel that is defined
`to Send messages placed on a shared transmission queue.
`Typically, identical shared sender channel definitions 190
`will exist on all the queue managers in the queue-Sharing
`group, allowing any of these queue managers to Start an
`instance of the channel.
`0044 Various pieces of information about the channel
`(state information) are stored to enable the channel to run. A
`Subset of this State information is held in a shared repository
`200 for shared channels (i.e. it is accessible to any of the
`queue-sharing group queue managers). This shared reposi
`
`tory 200 is known as the shared channel status table, and the
`information it contains includes: the last time the informa
`tion was updated; the channel name; the transmission queue
`name (blank in the case of an inbound channel); the remote
`queue manager name (the connection partner for this chan
`nel); the owning queue manager (which is running the
`channel instance); the channel type (for example Sender or
`receiver); channel Status (running, stopped, etc); remote
`machine address, and possibly other implementation-spe
`cific State information.
`0045. In the event of failure, it can be determined from
`the shared channel Status table which channels were being
`run by the now failed queue manager and channel initiator
`pair, and recovery can be coordinated when there are mul
`tiple parties performing the recovery process. This will be
`described further below.
`0046 Updates to any channel status entries in this table
`can only occur via compare-and-Swap logic. That is, as well
`as Supplying the entry as desired after the update (the
`after-image), the value of the entry before the update (the
`before-image) must also be Supplied or the update attempt
`will be rejected. The after-image replaces the before-image
`if the before-image is found in the table, otherwise no
`change takes place.
`0047. High availability message transmission from the
`queue Sharing group is achieved as follows.
`0048 AS described above, the portion of a queue man
`ager 20 responsible for message transmission is the channel
`initiator 25. There is one channel initiator per queue man
`ager and So the queue Sharing group comprises a number of
`queue manager and channel initiator pairs (represented in
`FIG. 2 as X, Y and Z). Each of these pairs has access to a
`shared resource which holds information relating to the
`shared channels running in a queue-Sharing group. Only one
`channel initiator manages the transmission of messages from
`a shared transmission queue 160 at any one time, and it uses
`only one sender channel instance 170 for that transmission
`queue 160. While that channel instance is active the shared
`transmission queue is locked by the channel initiator 25
`managing the channel instance (while the lock is held, no
`other channel initiators may retrieve messages from the
`queue). An entry is created in the shared channel Status table
`200 for the channel instance, and is updated to reflect that the
`channel is active, and the name of the channel initiator 25
`(which is also the name of the associated queue manager 20)
`managing the channel is also stored. Since the transmission
`queue 160 is shared, any of the queue manager plus channel
`initiator pairs in the queue-sharing group may run a channel
`instance that Sends messages from a shared queue which
`currently has no channel instance running from it.
`0049 Messages are transferred from the queue-sharing
`group to a remote queue manager (which is not a member of
`the queue-sharing group) as follows. One of the queue
`Sharing group channel initiators, Say X, runs a channel
`instance 170 to the remote channel initiator R of queue
`manager 180. The channel definition 190 details how to
`contact the remote channel initiator, and which shared
`access transmission queue to retrieve messages from to Send
`acroSS the channel. So, channel 170 is a communications
`link between X and R, used to Send messages residing on a
`transmission queue 160. The channel initiator X requests its
`asSociated queue manager component X to retrieve mes
`
`
`Ex.1016 / Page 10 of 14Ex.1016 / Page 10 of 14
`
`TESLA, INC.TESLA, INC.
`
`
`
`US 2002/0064126 A1
`
`May 30, 2002
`
`sages from the transmission queue 160 (SXmitd in FIG. 2),
`these messages are then handed over to the channel initiator
`X which sends them across the channel. The remote channel
`initiator R, receives the messages, hands them over to its
`asSociated queue manager R, which then places them on
`Some destination queue 210 accessible to it.
`0050 Four failure scenarios are detected and handled.
`0051 Shared channel status table connectivity fail
`ure (inability to update shared channel State).
`0.052 Communications subsystem failure (inability
`to communicate with the remote System).
`0053 Channel initiator failure.
`0054 Queue manager failure.
`0.055
`Failure scenarios 1 and 2 are handled by the
`channel initiator unlocking the transmission queue, and
`periodically attempting to re-start the channel on a different
`queue-sharing group queue manager by issuing start
`requests for the channel, directed to a Suitable queue-sharing
`group queue manager which is Selected using workload
`balancing techniques. Suitable workload balancing tech
`niques are well known in the art. When a queue-sharing
`group queue manager receives Such a start request, an
`attempt is made to obtain a lock on the transmission queue
`and, if Successful, a new channel instance is started, the
`shared Status table is updated as required, and message
`transmission resumes.
`0056. In the preferred embodiment of the invention,
`channel initiator failure (scenario 3) may occur indepen
`dently of queue manager failure because the channel initia
`tor runs as a separate task (effectively a separate program)
`paired with its associated queue manager. In the event of
`channel initiator failure, a failure event is logged with the
`queue manager paired with the channel initiator and this
`queue manager handles channel recovery:
`0057 The paired queue manager, say X, queries the
`shared channel status table to obtain a list of all the entries
`for the channel instances the failed channel initiator, X, was
`managing (i.e. all entries with the owning queue manager
`and channel initiator name Set to that of the failed queue
`manager and channel initiator-Since a queue manager and
`its associated channel initiator share the same name).
`0.058
`It is confirmed that the failed channel initiator is
`Still inactive, So that the list of entries obtained is guaranteed
`



