`Middleware Services and Applications
`
`Lonnie R. Welch1, Binoy Ravindran2, Paul V. Werme3, Michael W. Masters3,
`Behrooz A. Shirazi1, Prashant A. Shirolkar1, Robert D. Harrison3, Wayne Mills3, Tuy Do3,
`Judy Lafratta3, Shafqat M. Anwar1, Steve Sharp4, Terry Sergeant5, George Bilowus4,
`Mark Swick4, Jim Hoppel4, Joe Caruso4
`
`Abstract
`Some classes of real-time systems function in environments
`which cannot be modeled with static approaches. In such
`environments, the arrival rates of events which drive tran-
`sient computations may be unknown. Also, the periodic
`computations may be required to process varying numbers
`of data elements per period, but the number of data ele-
`ments to be processed in an arbitrary period cannot be
`known at the time of system engineering, nor can an upper
`bound be determined for the number of data items; thus, a
`worst case execution time cannot be obtained for such
`periodics. This paper presents middleware services that
`support such dynamic real-time systems through adaptive
`resource management. The middleware services have been
`implemented and employed for components of the experi-
`mental Navy system described in [10]. Experimental char-
`acterizations show that the services provide timely re-
`sponses, that they have a low degree of intrusiveness on
`hardware resources, and that they are scalable.
`
`1. Introduction
`
`rigorous, multi-
`systems have
`real-time
`Many
`dimensional Quality-of-Service (QoS) objectives. They
`must behave in a dependable manner, respond to threats in
`a timely fashion and provide continuous availability within
`hostile environments. Furthermore, resources should be
`utilized in an efficient manner, and scalability must be
`provided to address the ever-increasing complexity of sce-
`narios that confront such systems, even though the worst
`case scenarios of the environment may be unknown (e.g.,
`
`see [6]). This paper describes innovative QoS and resource
`management technology for such systems.
`Our approach is based on the dynamic path paradigm. A
`path-based real-time subsystem (see [11]) typically con-
`sists of a detection & assessment path, an action initiation
`path and an action guidance path. The paths interact with
`the environment via evaluating streams of data from sen-
`sors, and by causing actuators to respond (in a timely
`manner) to events detected during evaluation of sensor
`data streams.
`Most previous work in distributed real-time systems has
`focused on a lower level of abstraction than the path and
`has assumed that all system behavior follows a statically
`known pattern [8, 9]. When applying the previous work to
`some
`applications
`(such
`as
`those described
`in
`[WRHM97]), problems arise with respect to scalability of
`the analysis and modeling techniques; furthermore, it is
`sometimes impossible to obtain some of the parameters
`required by the models. The work described in this paper
`addresses these problems.
`A major difference between the traditional load balanc-
`ing techniques [7] and dynamic QoS-based resource man-
`agement services lies in the overall goals. While load bal-
`ancing systems (see Load-Leveler [5], LSF [12], NQE [2],
`PBS [4], Globus [3], and Condor [1]) attempt to achieve
`system performance goals such as minimized response
`time or maximized throughput, dynamic QoS-based re-
`source managers strive to meet the QoS requirements of
`each application they manage. Another major difference
`between these systems is their workload models. Tradi-
`tional load balancing systems assume independent jobs
`with known resource requirements. In a dynamic resource
`
`
`1 Computer Science & Engineering Dept.; University of Texas at Arlington; Box 19015, Arlington, TX 76019;
`{welch|shirazi}@cse.uta.edu
`2The Bradley Dept. of Electrical and Computer Engineering;Virginia Polytechnic Institute and State University; Blacksburg, VA
`24061;binoy@vt.edu
`3Code B35; Naval Surface Warfare Center,Dahlgren, VA 22448; {WermePV|MastersMW}@nswc.navy.mil
`4Computer Sciences Corporation, Dahlgren, VA 22448
`5Ouachita Baptist University, 410 Ouachita Street, Arkadelphia, AR 71998-0001; sergeantt@alpha.obu.edu
`
`
`Ex.1005 / Page 1 of 5Ex.1005 / Page 1 of 5
`
`TESLA, INC.TESLA, INC.
`
`
`
`management system, the workload requirements of appli-
`cations can vary, based on environmental conditions; ad-
`ditionally, applications are dependent (communicate with
`each other).
`The rest of the paper is organized as follows. Section 2
`provides an overview of a middleware architecture for
`dynamic QoS management of path-based systems, and
`describes the adaptive resource allocation approach em-
`ployed by the middleware. In section 3 we present our
`experiences with the QoS management middleware serv-
`ices. This includes a description of the Navy testbed in
`which the techniques were employed, and experimental
`results characterizing response times of the middleware
`services.
`
`\
`
`Sen -
`sor
`
`filter
`
`eval
`
`act
`
`Actua -
`tor
`
`RT paths
`
`1
`
`metrics
`calculate
`
`8
`
`allocation
`enactment
`
`spec.
`file
`
`2
`
`QoS
`monitor
`
`7
`
`allocation
`analysis
`
`6
`
`resource
`discovery
`
`5
`
`H/W
`metrics
`
`3
`
`QoS
`diagnosis
`
`4
`
`action
`selection
`
`distributed
`hardware
`
`2. Dynamic resource and QoS management
`
`:70 4., ,7.90.9:70 41 90 7084:7.0
`,3/ "4$ 2,3,02039 8419,70
`
`The logical architecture of the QoS management soft-
`ware is shown in Figure 1. It behaves as follows. The ap-
`plication programs of real-time control paths send time-
`stamped events to the metrics calculation component,
`which calculates path-level QoS metrics and sends them to
`the QoS monitor. The monitor checks for conformance of
`observed QoS to required QoS, and notifies the QoS diag-
`nosis component when a QoS violation occurs. QoS diag-
`nosis notifies the action selection component of the
`cause(s) of poor QoS and recommends actions (e.g., move
`a program to a different host or LAN, shed a program, or
`replicate a program) to improve QoS. Action selection
`ranks the recommended actions, identifies redundant ac-
`tions, and forwards the results to the allocation analysis
`component; this component consults resource discovery
`for host and LAN load index metrics, determines a good
`way to allocate the hardware resources in order to perform
`the actions, and requests the actions be performed by the
`allocation enactment component.
`
`Program name, source host name
`Destination host name
`Pathname
`
`Program status
`Program name
`
`Resource
`Manager
`
`Human
`Computer
`Interface
`
`User Commands
`
`Display
`
`User
`
`User commands
`Data Request
`
`Data
`
`Program Info
`Path Info
`Data Requests
`
`Current system status
`Data
`
`PC
`
`Start
`Kill
`
`Program status
`
`Unhealthy sub-paths
`Trend of load
`Pathid
`
`RTCS
`
`Sub-path latency
`Path latency
`Profile information
`
`Path latency
`Sub-Path latency
`Profiling Information
`
`System
`Data
`Repository
`
`Parser
`
`System
`Structure
`
`Host Data
`Network Data
`(CPU, memory,
` throughput, latency)
`
`User
`Specifications
`
`Specification
`File
`
` RTCS Data
`
`Change
`Load
`
`RTCS
`CONSOLE
`
`Display
`
`Software
`Monitors
`
`Program, Load
`Path latency
`Sub-Path latency
`
`Program, Load
`Path latency
`Sub-Path latency
`
`Profile
`Information
`
`Request Hardware
`Info
`
`User
`
`User Commands
`(change load)
`
`Display Hardware
`Info
`
`Hardware
`Monitors
`
`Hardware
`CONSOLE
`
`:70 !8., ,7.90.9:70 41 90 7084:7.0
`,3/ "4$ 2,3,02039 8419,70
`
`The physical QoS management architecture is shown
`in Figure 2. The core component of the middleware is the
`resource manager. It is activated when programs die and
`when time-constrained control paths miss their deadlines.
`In response to these events, it takes appropriate measures
`to improve the quality of service delivered to the applica-
`tions. The reallocations made by the resource manager
`make use of information provided by the hardware and
`software monitors, as well as from a specification file that
`describes QoS requirements and the structures of the soft-
`ware system and the hardware system. The system data
`repository component is responsible for collecting and
`
`
`Ex.1005 / Page 2 of 5Ex.1005 / Page 2 of 5
`
`TESLA, INC.TESLA, INC.
`
`
`
`maintaining all system information. The program control
`(PC) component consists of a central control program and
`a set of startup daemons. When resource manager needs to
`start a program on a particular host, it informs the control
`program, which notifies the startup daemon on that host.
`The HCI provides information to the user regarding the
`system configuration, application status, and reallocation
`decisions. It also allows the operator to dynamically mod-
`ify the behavioral characteristics of the resource discovery
`and the resource manager components.
`
`Program Control
`
`RT
`CS
`
`7
`
`0
`
`Adaptive Resource Management
`Adaptive Resource Management
`5
`
`Resource Manager
`
`8
`QoS Monitor
`QoS Monitor
`QoS Monitors
`10
`
`9
`Hardware Analyzer
`
`6
`
`4
`
`1
`
`RM HCI
`
`System Data
` Repository
`
`3
`Hardware Data Repository
`
`Spec File
`
`Spec Libraries
`
`2
`
`HW Monitors
`
`:70 /,59;0 7084:7.0 2,3,02039 8.0
`
`3,748
`
`Figure 3 depicts the overall architecture of the adaptive
`resource management system. In its current implementa-
`tion, resource management is activated in 3 modes: during
`the initial system start-up process (to start application pro-
`grams), when a path becomes unhealthy (i.e., a path la-
`tency exceeds the required deadline), and when an appli-
`cation program is terminated (due to hardware/software
`faults). A description of each of these resource manage-
`ment modes follows.
`
`The actions performed in start-up mode are as follows:
`1. The System Data Repository loads the user spec file
`via the Spec Libraries, which consist of a compiler
`and data structures to store the compiled real-time
`system and application QoS specifications (step 0 of
`Figure 3).
`2. The System Data Repository sends the system infor-
`mation to Resource Management Human Computer
`Interface (RM HCI) for display purposes (step 1 of
`Figure 3).
`3. Hardware (HW) Monitors continuously observe a
`resource’s load index and pass this information to the
`Hardware Data Repository (step 2 of Figure 3), which
`in turn passes such information to the System Data
`Repository (step 3 of Figure 3).
`
`4. The Resource Manager receives the initial startup
`information from the System Data Repository, as
`specified by the spec file (step 4 of Figure 3).
`5. The Resource Manager informs the Program Control
`to start the Real-Time Control System (RTCS) appli-
`cation programs on the specified hosts (step 5 of Fig-
`ure 3).
`6. The Program Control starts the programs and informs
`the Resource Manger accordingly (step 5 of Figure 3).
`7. The Resource Manger sends startup information to
`RM HCI for display purposes (step 6 of Figure 3).
`8. The RTCS continuously sends the application profile
`(time stamps or program/path latencies) information
`to QoS Monitors (step 7 of Figure 3). Global time is
`made available to RTCS via the Network Time Proto-
`col (NTP) package.
`
`The sequence performed during QoS monitoring and en-
`forcement mode is:
`6.
` If a path becomes unhealthy (misses its deadline), the
`QoS Monitors detect such a condition, diagnose the
`cause of poor health, and suggest an action (such as
`moving or replicating an application program) to the
`Resource Manager (step 8 of Figure 3).
`7. The Resource Manger needs to decide on which
`host(s) and LAN(s) the unhealthy sub-path needs to be
`replicated or moved. This decision is made by
`choosing the host(s) and LAN(s) with the smallest
`load indices. The Hardware Analyzer ranks the hosts
`and LANs in ascending order of their load index and
`passes this information to the Resource Manager (step
`9 of Figure 3).
`8. Once a host is selected, the Resource Manger notifies
`the Program Control to make the change (step 5 of
`Figure 3) and updates the RM HCI accordingly (step 6
`of Figure 3).
`
`
` The actions taken in program recovery mode are described
`below:
`1. If a sub-path (RTCS program) is terminated due to
`some hardware/software failure, Program Control de-
`tects such a condition and informs the Resource Man-
`ager accordingly (step 5 of Figure 3).
`2. The Resource Manger finds the host(s) and LAN(s)
`with the smallest load indices by querying the Host
`analyzer (step 9 of Figure 3).
`3. It then re-starts the terminated program on that host by
`informing the Program Control (step 5 of Figure 3)
`and RM HCI (step 6 of Figure 3) accordingly. In or-
`der to avoid thrashing (restarting a faulty piece of
`software), this step is only repeated an operator-
`determined fixed number of times.
`
`
`Ex.1005 / Page 3 of 5Ex.1005 / Page 3 of 5
`
`TESLA, INC.TESLA, INC.
`
`
`
`Manager. Program Start Detection time (t1’) is the time
`required by the startup daemon to detect the start of the
`program.
`We repeatedly measured the response times of the sur-
`vivability services, and observed that the total average
`response time, Tu, is 3.45 seconds, with a standard devia-
`tion of 0.021435. The data also indicate that the maximum
`time is spent during host resource discovery, followed by
`the time taken by the startup daemons to actually start the
`program.
`Response time measurements were also made for path
`overload detection and overload recovery via automatic
`scalability. The total Scalability response time (Tc) calcu-
`lations are divided into 4 phases: (1) Path Overload data
`transfer time (t2) is the time taken for the overloaded path
`information to reach the resource manager. This interval
`occurs immediately after the overload detection heuristic
`in the subsystem manager detects the overloaded condition
`of the path (t1), (2) Resource Manager processing (t3) is
`the same as in Tu, (3) Scale time (Ts) is the time taken by
`the Program Manager and Startup Daemons to actually
`start a new copy of the program. The scale time (Ts) con-
`sists of three phases, identical to the restart component of
`Tu. In repeated experiments, the average total response
`time, Tc, for scalability services was 4.51 seconds, with a
`standard deviation of 0.020731.
`
`4. Conclusions and future work
`
`This paper describes adaptive resource management
`middleware that provides integrated services for fault tol-
`erance, distributed computing, and real-time computing.
`The underlying system model differs significantly from
`that used in related work. Furthermore, the services have
`been applied to an experimental Navy system prototype.
`Experiments show that the services provide bounded re-
`sponse times, scalable services, and low intrusiveness.
`
`5. Acknowledgements
`
`This work was sponsored in part by DARPA/NCCOSC
`contract N66001-97-C-8250, and by NSWC/NCEE con-
`tracts NCEE/A303/41E-96 and NCEE/A303/50A-98.
`
`3. Experimental results
`
`The technology described in this paper was evaluated
`within the Naval Surface Warfare Center High Perform-
`ance Computing (NSWC HiPer-D) Testbed, which con-
`tains the experimental Navy system described in [10]. The
`implementation includes the following capabilities: (1) a
`simulated track source, (2) track correlation and filtering
`algorithms, (3) track data distribution services, (4) a doc-
`trine server and three types of doctrine processing, (5) an
`engagement server, (6) a display subsystem including X-
`windows based tactical displays, submode mediation, and
`alert routing surface operations, (7) a simulated weapons
`control system, and (8) identification upgrade capabilities.
`The software runs on a heterogeneous network configu-
`ration that includes Myrinet, ATM, FDDI, and ethernet, on
`multiple heterogeneous host platforms, including Dec Al-
`pha's with OSF-1, a Dec Sable with OSF-1, TAC-4's with
`HP-UX, Sun Sparc 10's with Solaris, and Pentiums with
`OSF-1RT.
`We performed experiments to determine the responsive-
`ness of our QoS and resource management middleware for
`survivability and scalability services.
`The total Survivability response time(Tu) calculation is
`divided into four major phases : (1) Program Death De-
`tection time (t1) is the time taken by the Startup Daemon to
`inform the Program Manager of a dead program, (2) Re-
`source Manager Notification time (t2) is the time taken by
`the Program Manager to inform the Resource Manager of
`the dead program, (3) Resource Manager Processing time
`(t3) is the time taken by the Resource Manager to select a
`good host, and (4) Restart time (Tr) is the time taken by the
`Program Manager and Startup Daemons to actually restart
`the program.
`The processing time at the resource manager, t3 , is fur-
`ther decomposed. Preprocessing time (t31) is the time in-
`terval after receipt of a dead program message from the
`program manager and before network discovery begins.
`This time interval is internal to the resource manager.
`Network Resource Discovery (t32) is the time interval re-
`quired to obtain network-level metrics from the network
`controller. Host Resource Discovery (t33) is the time inter-
`val required to obtain host-level metrics from all eligible
`host monitors. Allocation Decision time (t34) is the time
`interval required to choose a good host. This interval is
`internal to the resource manager. Post-processing time (t35)
`is the time interval after finding the best host and before
`sending a program restart instruction to the program man-
`ager. This interval is internal to the resource manager.
`The Restart time, Tr, consists of three phases. Program
`Notification time (t2’) is the time required to inform the
`Program Manager to restart a particular program on a par-
`ticular host. Program Manager to Startup daemon data
`transfer time (t4) is the time required to transfer the restart
`data to the appropriate Startup daemon from the Program
`
`
`Ex.1005 / Page 4 of 5Ex.1005 / Page 4 of 5
`
`TESLA, INC.TESLA, INC.
`
`
`
`6. References
`
`[1]
`
`[2]
`
`[3]
`
`[4]
`
`[5]
`
`[6]
`
`[7]
`
`“Condor Project,”
`http://www.cs.wisc.edu/condor/, 1999.
`
`Cray Research, Document in-2153 2/97, Tech-
`nical report, Cray Research, 1997.
`
`I. Foster and C. Kesselman. “Globus Project,”
`http://www.globus.org/, 1999.
`
`R. Henderson and D. Tweten. “Portable Batch
`Systems: External Reference Specification,”
`Technical report, NASA, Ames Research Cen-
`ter, 1996.
`
`IBM. Corporation. “IBM Load Leveler: User's
`Guide,” Sept. 1993.
`
`Gary Koob, “Quorum,” Proceedings of the
`DARPA ITO General PI Meeting, pages A-59
`to A-87, October 1996.
`
`B. Shirazi, A.R. Hurson, and K. Kavi, “Sched-
`uling and Load Balancing in Parallel and Dis-
`tributed Systems,” IEEE Press, 1995.
`
`[8]
`
`[9]
`
`[10]
`
`[11]
`
`[12]
`
`S. Son, "Advances in Real-Time Systems,"
`Prentice Hall, 1995.
`
`J. Stankovic, and K. Ramamritham, "Advances
`in Real-Time Systems," IEEE Computer Soci-
`ety Press, April 1992.
`
`L. R. Welch, B. Ravindran, R. Harrison, L.
`Madden, M. Masters and W. Mills, “Challenges
`in Engineering Distributed Shipboard Control
`Systems,” The IEEE Real-Time Systems Sym-
`posium, December 1996.
`
`L. R. Welch, B. Ravindran, B. Shirazi and C.
`Bruggeman, “Specification and analysis of dy-
`namic, distributed real-time systems,” in Pro-
`ceedings of the 19th IEEE Real-Time Systems
`Symposium, 72-81, IEEE Computer Society
`Press, 1998.
`
`S. Zhou, “LSF: Load Sharing in Large-scale
`Heterogeneous Distributed Systems,” Proc.
`Workshop on Cluster Computing, 1992.
`
`
`Ex.1005 / Page 5 of 5Ex.1005 / Page 5 of 5
`
`TESLA, INC.TESLA, INC.
`
`



