`
`P
`
`20 02
`2002 Pacific Rim
`International Symposium on
`Dependable Computing
`
`•
`
`-""
`
`_
`
`ti
`
`"F"'
`
`.7- .4.
`yl
`
`16-18 December 2002
`
`Tsukuba City, Ibaraki, Japan
`
`IFFF COMPUTER
`SOCIETY
`0 IEEE
`
`Sponsored by
`IEEE Computer Society Technical Committee on Fault-Tolerant Computing
`IEICE Technical Group on Dependable Computing
`•
`
`HPE, Exh. 1011, p. 1
`
`
`
`Proceedings
`
`2002 Pacific Rim International Symposium on
`Dependable Computing
`
`PRDC 2002
`
`16-18 December 2002 • Tsukuba, Japan
`
`Sponsored by
`IEEE Computer Society Technical Committee on Fault-Tolerant Computing
`IEICE Technical Group on Dependable Computing
`
`In cooperation with
`IFIP WG10.4 on Dependable Computing and Fault Tolerance
`Jet Propulsion Laboratory
`
`IEEEC )•
`
`COMPUTER
`SOCIETY
`*IEEE
`Los Alamitos, California
`Washington
`•
`Brussels
`•
`
`Tokyo
`
`HPE, Exh. 1011, p. 2
`
`
`
`0169
`,F33 P33
`
`Copyright © 2002 by The Institute of Electrical and Electronics Engineers, Inc.
`All rights reserved
`
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries may
`photocopy beyond the limits of US copyright law, for private use of patrons, those articles in this volume
`that carry a code at the bottom of the first page, provided that the per-copy fee indicated in the code is paid
`through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
`
`Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights Manager, IEEE
`Service Center, 445 Hoes Lane, P.O. Box 133, Piscataway, NJ 08855-1331.
`
`The papers in this book comprise the proceedings of the meeting mentioned on the cover and title page.
`They reflect the authors' opinions and, in the interests of timely dissemination, are published as presented
`and without change. Their inclusion in this publication does not necessarily constitute endorsement by the
`editors, the IEEE Computer Society, or the Institute of Electrical and Electronics Engineers, Inc.
`
`IEEE Computer Society Order Number PR01852
`ISBN 0-7695-1852-4
`Library of Congress Number 2002113942
`
`Additional copies may be ordered from:
`
`IEEE Computer Society
`Customer Service Center
`10662 Los Vaqueros Circle
`P.O. Box 3014
`Los Alamitos, CA 90720-1314
`Tel: + 1 800 272 6657
`Fax: + 1 714 821 4641
`http://computer.org/
`csbooks@computer.org
`
`IEEE Service Center
`445 Hoes Lane
`P.O. Box 1331
`Piscataway, NJ 08855-1331
`Tel: + 1 732 981 0060
`Fax: + 1 732 981 9667
`http://shop.ieee.org/store/
`customer-service a ieee.org
`
`IEEE Computer Society
`Asia/Pacific Office
`Watanabe Bldg., 1-4-2
`Minami-Aoyama
`Minato-ku, Tokyo 107-0062
`JAPAN
`Tel: + 81 3 3408 3118
`Fax: + 81 3 3408 3553
`tokyo.ofc@computer.org
`
`Editorial production by Anne Jacobs
`
`Cover art production by Joe Daigle/Studio Productions
`
`Printed in the United States of America by The Printing House
`
`IEEE
`
`COMPUTER
`SOCIETY
`IEEE
`
`HPE, Exh. 1011, p. 3
`
`
`
`Table of Contents
`2002 Pacific Rim International Symposium on Dependable Computing
`
`General Chair's Message
`Program Co-Chairs' Message
`Organizing Committee
`Steering Committee
`Program Committee
`Reviewers
`
`Keynote Speech (I)
`Fault Tolerance in Autonomic Computing Environment
`Y. Tohma
`
`Session 1A: Dependable Distributed Systems (I)
`Formal Specification and Verification of a Group Membership Protocol
`for an Intrusion-Tolerant Group Communication System
`H. V. Ramasamy, M. Cukier, and W. H. Sanders
`Asynchronous Active Replication in Three-tier Distributed Systems
`R. Baldoni, C. Marchetti, and S. Tucci Piergiovanni
`
`Session 1B: Checkpointing
`On Characteristics of DEF Communication-Induced Checkpointing Protocols
`J. Tsai and J.-W. Lin
`A Low Overhead Checkpointing Protocol for Mobile Computing Systems
`C.-Y. Lin, S.-C. Wang, S.-Y. Kuo, and 1.-Y. Chen
`The Cost of Checkpointing, Logging and Recovery for the Mobile Agent Systems
`H. Kim, H. Y. Yeom, T Park, and H. Park
`
`Session 2A: Dependable Systems (1)
`Analysis of the Effects of Real and Injected Software Faults: Linux as a Case Study
`T. Jarboui, J. Arlat, Y. Crouzet, K. Kanoun, and T. Marteau
`Principles of Multi-Level Reflection for Fault Tolerant Architectures
`F. Tai"ani, J.-C. Fabre, and M.-O. Killijian
`Hardware/Software Co-Reliability of Configurable Digital Systems
`M. Choi, N. Park, Y Kim, and F. Lombardi
`Highly Fault-Tolerant FPGA Processor by Degrading Strategy
`Y. Nakamura and K. Hiraki
`
`V LINDA HALL LIBRARY
`Kansas City, Mo.
`
`ix
`
`xi
`xii
`xiii
`xiv
`
`3
`
`9
`
`19
`
`29
`
`37
`
`45
`
`51
`
`59
`
`67
`
`75
`
`HPE, Exh. 1011, p. 4
`
`
`
`Session 2B: Reliability/Dependability Analysis
`Using Software Implemented Fault Inserter in Dependability Analysis
`P. Gawkowski and J. Sosnowski
`Analyzing Network Reliability with Imperfect Nodes using OBDD
`F.-M. Yeh, H.-Y. Lin, and S.-Y. Kuo
`Reliability Analysis of Grid Computing Systems
`Y. S. Dai, M. Xie, and K. L. Poh
`A Control Theory Approach for Analyzing the Effects of Data Errors
`in Safety-Critical Control Systems
`O. Askerdal, M. Gafvert, M. Hiller, and N. Suri
`
`Keynote Speech (II)
`Caveat Emptor: Making Grid Services Dependable from the Client Side
`M. Livny and D. Thain
`
`Session 3A: Dependable Distributed Systems (II)
`An Introduction to the Renaming Problem
`M. Raynal
`Passive Replication Schemes in AQUA
`Y. Ren, P. Rubel, M. Seri, M. Cukier, W. H. Sanders, and T. Courtney
`Detecting Feature Interactions in Telecommunication Services with a SAT
`Solver
`T. Tsuchiya, M. Nakamura, and T. Kikuno
`
`Session 3B: Dependable Networks
`Fault-Tolerant Properties of Generalized Hierarchical Completely-Connected
`Networks
`T. Takabatake, M. Kitakami, and H. Ito
`Energy Efficient and Robust Multicast Protocol for Mobile Ad Hoc Networks
`S. Moh, C. Yu, B. Lee, and H. Y. Youn
`Configurable PC Clusters Using a Hierarchical Complete-Connection-Based
`Switching Network
`N. Tsuda
`
`Keynote Speech (III)
`Perspectives on Dependable Computing for Solar System Exploration
`L. Alkalai
`
`vi
`
`81
`
`89
`
`97
`
`105
`
`117
`
`121
`
`125
`
`131
`
`137
`
`145
`
`153
`
`163
`
`HPE, Exh. 1011, p. 5
`
`
`
`Session 4A: Dependable Systems (II)
`Enhancing Access Control with SysGuard, a Reference Monitor Supporting
`Portable and Composable Kernel Module
`Y. Shinjo, K. Eiraku, A. Suzuki, K. Itano, and C. Pu
`An Adaptive High-Low Water Mark Destage Algorithm for Cached RAIDS
`Y. J. Nam and C. Park
`A Packet-Based Caching Proxy with Loss Recovery for Video Streaming
`K.-S. Hsueh and S.-D. Wang
`
`Session 4B: Dependability Evaluation and Verification
`Reliability Evaluation of Multi-State Systems Subject to Imperfect
`Coverage Using OBDD
`Y.-R. Chang, S. V. Amari, and S.-Y. Kuo
`Characterization of Operating Systems Behavior in the Presence of Faulty
`Drivers through Software Fault Emulation
`J. Duraes and H. Madeira
`Level Oriented Formal Model for Asynchronous Circuit Verification and
`Its Efficient Analysis Method
`T. Kitai, Y. Oguro, T. Yoneda, E. Mercer, and C. Myers
`
`Session 5A: Consensus in Distributed Systems
`Consensus in Synchronous Systems: A Concise Guided Tour
`M. Raynal
`A Consensus Algorithm for Synchronous Distributed Systems using Mobile Agent
`J. Cao, X. Wang, S. Lo, and S. K. Das
`
`Session 5B: Coding
`Evaluation of Convolutional-Code-Based FEC under Limited Recovery Time
`and Its Application to Real-Time Transmission
`M. Arai, H. Kurosu, S. Fukumoto, and K. Iwasaki
`A Class of Random Multiple Bits in a Byte Error Correcting (St/bEC) Codes
`for Semiconductor Memory Systems
`G. Umanesan and E. Fujiwara
`
`Session 6A: Dependable Database
`Providing Persistence for Sensor Streams with Light Neighbor WAL
`H. Kawashima, M. Toyama, M. /mai, and Y. Anzai
`Definition of Faultloads Based on Operator Faults for DMBS Recovery
`Benchmarking
`M. Vieira and H. Madeira
`
`vii
`
`167
`
`177
`
`185
`
`193
`
`201
`
`210
`
`221
`
`229
`
`239
`
`247
`
`257
`
`265
`
`HPE, Exh. 1011, p. 6
`
`
`
`Session 6B: Test
`An Alternative Method of Generating Tests for Path Delay Faults Using
`N,-Detection Test Sets
`H. Takahashi, K. K. Saluja, and Y. Takamatsu
`Enhancing Delay Fault Testability for Iterative Logic Arrays
`S.-K. Lu and C.-H. Yeh
`
`Session 7A: Neural Network
`Comparison with Defect Compensation Methods for Feed-Forward Neural
`Networks
`K. Takahashi, K. Yamamori, S. Horiguchi, and I. Yoshihara
`An Extreme Value Injection Approach with Reduced Learning Time to Make
`MLNs Multiple-Weight-Fault Tolerant
`I. Takanami and Y. Oyama
`
`Session 7B: Practical Reports
`Event Log Based Dependability Analysis of Windows NT and 2K Systems
`C. Simache, M. Kaaniche, and A. Saidane
`Using VHDL-Based Fault Injection to Exercise Error Detection Mechanisms
`in the Time-Triggered Architecture
`J. Gracia, D. Gil, J. C. Baraza, and P. J. Gil
`Formal Verification of a Pipelined Processor with New Memory Hierarchy
`Using a Commercial Model Checker
`H. Nakamura, T. Arai, and M. Fujita
`
`Author Index
`
`275
`
`283
`
`293
`
`301
`
`311
`
`316
`
`321
`
`325
`
`viii
`
`HPE, Exh. 1011, p. 7
`
`
`
`An Adaptive High-Low Water Mark Destage Algorithm for Cached RAIDS
`
`Young Jin Nam and Chanik Park
`Department of Computer Science and Engineering/PIRL
`Pohang University of Science and Technology
`Kyungbuk, Republic of Korea
`{ yjnam,cipark} @postech.ac.kr
`
`Abstract
`
`The High-Low Water Mark destage (HLWM) algorithm
`is widely used to enable a cached RAIDS to flush dirty data
`from its write cache to disks. It activates and deactivates a
`destaging process based on two time-invariant thresholds
`which are determined by cache occupancy levels. How-
`ever, the opportunity exists to improve I/O throughput by
`adaptively changing the thresholds. This paper proposes
`an adaptive HLWM algorithm which dynamically changes
`its thresholds according to a varying I/O workload. Two
`thresholds are defined as the multiplication of changing
`rates of the cache occupancy level and the time required
`to fill and empty the cache. Performance evaluations with a
`cached RAIDS simulator reveal that the proposed algorithm
`outperforms the HLWM algorithm in terms of read response
`time, write cache hit ratio, and disk utilization.
`
`1. Introduction
`
`Despite its prevalence, RAIDS [4] has suffered from a
`small write problem, which refers to the phenomenon that
`RAIDS requires four disk accesses to write a single disk
`block: old parity/data reads and new parity/data writes. To
`overcome the small write problem in RAIDS, several solu-
`tions have been proposed [1, 2, 5, 6, 7]. One utilizes the
`non-volatile write cache [1]. Since the service of a write
`request entails writing data onto the write cache, it can be
`performed quickly. The written (i.e., dirty) data in the cache
`should be flushed into physical disks when the cache occu-
`pancy level arrives at a pre-defined threshold. The operation
`of flushing dirty data is called a destaging process, which is
`usually initiated by a destage scheduler or a destage algo-
`rithm.
`The desirable properties that a destage algorithm encom-
`passes are as follows. First, it has no effect on the response
`time of read requests. This implies that the destaging pro-
`cess is initiated when read requests are no longer in service.
`
`Second, it keeps the cache occupancy level as high as pos-
`sible to increase the write cache hit ratio. In addition, keep-
`ing the cache occupancy level high may reduce the number
`of disk I/Os required for destaging. Third, it should avoid
`cache overflow, i.e., maintain a sufficient number of clean
`or free cache entries. Observe that a trade-off should be
`made to pursue the last two properties at the same time. In
`addition, the following two issues ought to be taken into
`consideration to meet the properties noted above: how to
`elaborately determine its thresholds to manage the destag-
`ing process (e.g., starting or stopping destaging), and how
`to decide which dirty cache entries are to be destaged. This
`paper mainly emphasizes the first issue: how to efficiently
`determine the destaging thresholds.
`Two destage algorithms have been available in terms
`of manipulating destaging thresholds: the High-Low Wa-
`ter Mark (HLWM) algorithm [1] and the Linear Threshold
`(LT) algorithm [7]. The HLWM algorithm is based on two
`destaging thresholds: the High Water Mark (HWM) and
`the Low Water Mark (LWM). It starts the destage process
`when the current cache occupancy level goes up to HWM.
`Conversely, it stops destaging when the cache occupancy
`level goes down to LWM. The LT algorithm defines more
`than two thresholds. As the current cache occupancy level
`reaches each higher threshold, it accelerates the destaging
`process by including more dirty entries in a single destaging
`write operation. Even though this algorithm outperforms
`the HLWM algorithm, it requires a complicated operation
`to compute the destaging costs for all the dirty entries at
`every disk head move. Observe that the HLWM and LT al-
`gorithms define their destaging thresholds as time-invariant
`cache occupancy levels. Next, the thresholds remain un-
`changed in the presence of a time-varying I/O workload.
`However, it is desirable to set HWM and LWM to higher
`levels under a light I/O workload in order to increase the
`write cache hit ratio. Conversely, it is required that HWM
`and LWM be set to lower levels under a heavy I/O work-
`load, so as to prepare a sufficient number of free or clean
`cache entries. In sum, we claim that the opportunity ex-
`
`0-7695-1852-4/02 $17.00 © 2002 IEEE
`
`177
`
`HPE, Exh. 1011, p. 8
`
`
`
`ists to improve the I/O throughput by adaptively changing
`HWM and LWM according to a varying I/O workload.
`In this paper, we propose an Adaptive High-Low Water
`Mark destage (AHLWM) algorithm which dynamically de-
`termines its thresholds according to a varying I/O workload.
`The thresholds are defined as the multiplication of changing
`rates of the cache occupancy level and the time required to
`fill and empty the cache. We prove that the AHLWM al-
`gorithm outperforms the HLWM algorithm in terms of read
`response time, write cache hit ratio, and disk utilization via
`performance evaluations on our cached RAIDS simulator.
`This paper is organized as follows: Section 2 describes the
`AHLWM algorithm and Section 3 evaluates its performance
`by comparison with the HLWM algorithm and then dis-
`cusses an issue related to a bursty I/O workload. Finally,
`we conclude this paper with Section 4.
`
`2. Adaptive High-Low Water Mark Destage
`Algorithm
`
`This section describes the Adaptive High-Low Water
`Mark destage algorithm and then gives illustrative exam-
`ples to explain its operation under various workload envi-
`ronments and configuration parameters.
`
`Figure 1 compares a High-
`Algorithm Description:
`Low Water Mark destage (briefly HLWM) algorithm and
`an Adaptive High-Low Water Mark destage (AHLWM) al-
`gorithm in terms of managing their thresholds. Typically,
`
`s
`
`pHs 4-
`
`p,S
`
`o --
`
`s
`
`pH(t)S
`
`p,(t)S
`
`Vaal
`
`0
`
`HWM
`
`static) and LWM (£ static) are written as
`
`Natatic
`
`£static
`
`= PHS
`PLS,
`
`=
`
`where S is the cache size and 0 < pL < pH < 1. Ob-
`serve that both HWM and LWM of the HLWM algorithm
`remain unchanged during its operation under a time-variant
`I/O workload.
`While the HWM and LWM are time-invariant in the
`HLWM algorithm, they are time-variant in the AHLWM al-
`gorithm, as shown in Figure 1(b). The HLM and LWM of
`the AHLWM algorithm can be written as follows:
`
`Nadaptive(t)
`
`£adaptive (t) =
`
`= PH(t)S,
`pL(t)S.
`
`(3)
`(4)
`
`Next, we will drive the time-variant p H (t) and ALM. Let
`AB and µB represent base increasing and decreasing rates
`of the cache occupancy level. Note that AB is related to
`the increasing rate of the cache occupancy level, which ac-
`counts for the write cache hit and is not directly proportional
`to the arrival rate of write I/O requests. Let the current in-
`creasing and decreasing rates of the cache occupancy level
`at time t be A(t) and µ(t) , In addition, we define an effec-
`tive decreasing rate of the cache occupancy level at time t,
`a (t) = µ(t) - A(t). Similarly, aB =AB— AB. Then, let
`us define TB and TL which represent the times required to
`fill and empty the cache as follows:
`
`(1- PH) s
`
`=
`
`AB
`S.
`AB
`
`(5)
`
`(6)
`
`Given )(t) and a(t), the following equations should be
`valid:
`
`A(t)TH + p H (t)S = S,
`PL(t)S — TL,a(t) = 0.
`
`(7)
`(8)
`
`Finally, the time-variant pH (t) and AL (t) are written as:
`
`(a) HLWM algorithm
`
`(b) AHLWM algorithm
`
`pH (t)
`
`pL(t)
`
`At
`rflax{0, 1 - (1 - PH)
`-A(17)
`t
`-
`} •
`ct(B)
`
`min{ 1, max{0,
`
`/,
`
`(9)
`
`(10)
`
`Figure 1. Comparisons of two thresholds of
`the HLWM algorithm and the AHLWM algo-
`rithm
`
`the HLWM algorithm defines two static threshold values —
`High Water Mark (HWM) and Low Water Mark (LWM) by
`using cache occupancy levels, as shown in Figure 1(a). The
`
`Note that pH (t) can be smaller than pL (t) under a rare con-
`dition. To avoid this, the following condition has to be
`added in its implementation: pL(t) = pH (t) if pH (t) <
`PL(t) •
`Next, we will look into the useful properties of the
`AHLWM algorithm associated with the current increasing
`and decreasing rates. To begin, we can easily see that the
`following lemma holds without proof.
`
`178
`
`HPE, Exh. 1011, p. 9
`
`
`
`Lemma 1 Given A(t) = AB and µ(t) = µB at any time
`t, the AHLWM algorithm degenerates to the HLWM algo-
`rithm.
`
`Note that the two thresholds of the AHLWM algo-
`rithm are dynamically determined as the A(t) and µ( t)
`change. However, the following inequality should be met
`at any given time t for normal operations: 7{ adaptive
`adaptive (t) >
`adaptive(t).
`
`Theorem 1 '{adaptive (t) > Ladaptive (t) at any time t if
`
`(1- pH)k < AB < PB pLk,
`
`(11)
`
`for a given µ(t) = k, where 0 < k < 1-(Z3 PL).
`
`Proof: Basically, Equation (11) can be derived from two
`boundary conditions in Equation (9) and (10). It is easily
`seen that '{adaptive(t) > Ladaptive(t) when A(t) = AB
`and µ(t) = µH according to Lemma 1. Next, we will
`consider the two boundary conditions: 1) pL(t)j),(0=0 ≤
`PH MIA(t)=0 = 1 and 2) pH(t)I A(t)=k ≥ PL(t)lx(t)=k = 0.
`Then, we can derive the relation of AB 5_ µB - pLk from
`the boundary condition 1. Also, the (1 — pH)k < AB rela-
`tion can be obtained from the boundary condition 2. Since
`1-(p "HB-PL) . It
`(1 — pH)k < µH — pLk, then we have k
`implies that the maximum value of k, i.e., µ(t) is confined
`ita
`PL ) , ' called µIL'. Also, we can show that the
`to
`1—( px—
`two boundary conditions hold when k = µBay. Thus, this
`theorem follows. I
`
`Corollary 1 11._adaptive(t) > Ladaptive(t) holds at any time
`t is satisfied if AB =
` AB-
`
`1 ,;
`
`H P
`
`+ 11 p
`
`L
`
`in Equation (11) of Theorem 1,
`If µ(t) = µ7141"
`Proof:
`the upper and lower bounds of AB converge into the point
`µB which always makes the condition
`of AB =
`1—pHi-PL
`hold. Thus, this corollary follows. I
`
`Corollary 2 Given 0 < µ(t) < µBay, '{adaptive (t) >
`Ladaptzve (t) holds at any time t if
`
`(1 PH)µBax < AB
`
`≤
`
`(1
`
`PL)P731ax •
`
`(12)
`
`Proof: The Equation (12) can be obtained by simply re-
`placing k with µr3" in Equation (11) of Theorem 1. Thus,
`this corollary follows.
`
`I
`
`179
`
`Illustrative Examples:
`Figure 2 demonstrates the ex-
`pected adaptiveness of HWM and LWM by the AHLWM
`algorithm according to the variation of an I/O workload.
`In Figure 2(a), 7-1B becomes 1-1B — ABB with a higher
`A(t) = AB + oBH mainly attributed to a higher increas-
`ing rate of dirty data in the write cache. Conversely, NB
`goes up to 7-1H + Am, with a lower increasing rate of A(t)
`by oBL. Similarly, LB becomes LB + ALB with a higher
`a(t) = aH + oLH attributed to a higher rate of destag-
`ing data to disks or a lower increasing rate of dirty data in
`the write cache, as shown in Figure 2(b). Conversely, LB
`goes down to LB — ALL with the decrease of a(t) by hi,.
`Observe that Equation (5)—(8) remain valid under•any I/O
`workload level.
`
`rock, nersoponn level
`
`AN"Sar
`
`A, .4a 7T„
`
`each, or, wenn level
`
`T,
`
`4 ,4•4,„,
`
`1a. a.
` •
`
`(a) Variation of HWM
`
`(b) Variation of LWM
`
`Figure 2. Variations of HWM and LWM accord-
`ing to different levels of I/O workloads, where
`S represents the cache size
`
`3. Performance Evaluations under Cached
`RAID5
`
`For performance evaluations of the AHLWM algorithm,
`we have implemented it within a cached RAIDS simulator
`which employs a write cache. Then, we compare the per-
`formance of the proposed algorithm with the HLWM algo-
`rithm. We begin by describing the cached RAIDS simulator.
`
`3.1. Simulation Environments
`
`The cached RAIDS simulator consists of a host I/O re-
`quest queue, a host I/O scheduler, RAID I/O processes, a
`volatile read cache, a non-volatile write cache, disk I/O re-
`quest queues, disk I/O schedulers, a destage process which
`includes different types of destage algorithms, and a syn-
`thetic I/O workload generator, as shown in Figure 3.
`
`Overall I/O Operations: Host I/O requests issued by
`a group of workload generators is first be placed into the
`
`HPE, Exh. 1011, p. 10
`
`
`
`060
`
`Non n° nepaed pwe
`
`nom VO
`rise
`
`IUD
`mad YO
`
`RAID
`WM' Do
`
`Den ae
`Roc
`
`Volatile
`Read Cache
`
`lboomQ
`
`-
`
`INA
`
`Non.Volao
`Write Cache
`
`INA VO
`Schad
`
`auk 1/0
`SNed
`
`Out .
`Sdled
`
`Ova VO
`ached
`
`Dad
`
`Duk
`
`Figure 3. The internal structure of the cached
`RAID5 simulator
`
`host I/O request queue in an FIFO manner. Next, the host
`I/O scheduler receives a host I/O request from the host I/O
`request queue. Then, it spawns either a RAID read I/O pro-
`cess or a RAID write I/O process, according to the request
`type. The RAID read I/O process searches its data first from
`the write cache and then from the read cache. If valid data
`resides in either of the two caches (a read cache hit), the
`cached data is immediately transferred to hosts with no disk
`I/O. If not (a read cache miss), the data is fetched from phys-
`ical disks into the read cache and is then returned to hosts.
`The RAID write I/O process writes its data simply into the
`write cache and notifies its host of the completion of the cur-
`rent write request service. Later, the destage process flushes
`the newly written data called dirty data (entries) into phys-
`ical disks. The behavior of the destage process will be de-
`scribed later in detail.
`
`Each individual disk has a pair of queues which are the
`host request queue and the destage request queue. While
`disk I/O requests made from a read request are placed into
`the host request queue of a disk, the destage read/write I/O
`requests issued by the destage process are lined up in the
`destage queue of a disk. Note that a host request queue is
`given a higher priority in service than a destage queue, i.e.,
`the requests in the destage request queue are processed only
`while the host request queue becomes empty. In the event
`of a write cache overflow, the destage requests are placed
`into the host request queue. Each disk also contains its own
`1/O scheduler. Even though a few efficient disk schedul-
`ing algorithms are currently available, we employ a simple
`FIFO disk scheduling algorithm because performances of
`disk scheduling are not our concern.
`
`180
`
`Write I/O Operations and Destaging Process: To be-
`gin, Figure 4 depicts a state diagram of a cache entry
`Initially, the state of each en-
`within the write cache.
`try is NOT_USED and INVALID. When the entry is al-
`located for storing dirty data from the host, its state is
`changed to IN_USE and VALID/DIRTY. If new dirty data
`arrives to the same location before the destaging of the
`dirty data, it is overwritten to the entry. This is called a
`write cache hit. Conversely, if new dirty data arrives dur-
`ing the destaging of the dirty data, it is stored separately
`by allocating different cache entries. At this point, its state
`is changed to IN_USE and VALID/DDIRTY. Subsequent
`dirty data at the same location will be overwritten, as previ-
`ously. Once the destaging of a dirty data/entry completes,
`its state is changed to NOT_USED and VALID/CLEAN. In
`addition, if an associate cache entry of the IN_USE and
`VALID/DDIRTY state exists, then its state is changed to
`IN_USE and VALID/DIRTY.
`
`a new request
`write cache hit(overwrite)
`
`Q
`
`a new requestidestaging
`subsequent write cache hit (overwrite)
`
`DLUSE
`vmaimoarnr
`
`
`
`destage done
`
`DI est
` vALuwooinv
`
`NOT_USED
`INVALID
`
`a new request
`write cache miss
`
`a new request
`write cache miss
`
`destage done
`
`a new request/destaging
`first write cache hit
`
`NOT_USED
`USED
`VA LIDLEAN
`
`Figure 4. Stage diagram of cache entries
`within the write cache by the destage process
`
`Next, we explain the destage process which flushes dirty
`write cache entries into physical disks in the background.
`Note that the old data and parity blocks are read into the
`read cache. However, those entries are not locked in the
`read cache until the computation of its new parity blocks
`completes. This implies that these blocks can be replaced
`with others before the new parity is calculated. Newly com-
`puted parity blocks are overwritten into the old parity blocks
`in the read cache, not the write cache. Also, the write of
`the new parity blocks are not delayed. In the presence of
`write cache overflows implying that no available cache en-
`tries exist for the newly arrived data from the host, subse-
`quent write requests bypass the write cache, i.e., the write
`requests are treated as in the traditional RAID5. Finally,
`host read requests are given a higher priority than destage
`requests. Then, the destage process starts when the current
`occupancy reaches its HWM and then stops when the cur-
`rent occupancy goes down to its LWM. The destage process
`first scans the write cache and then selects a dirty cache en-
`
`HPE, Exh. 1011, p. 11
`
`
`
`try not currently in destaging. Then, it decides if the old data
`and parity blocks of the entry reside in the read cache. If not,
`it issues destage read requests for the missing blocks. Af-
`ter reading all the old data and parity blocks by the destage
`reads, it computes the new parity and generates two destage
`write requests for the new data and parity. Note that a pos-
`sibility exists that the old blocks in the read cache can be re-
`placed while the destage read requests of missing blocks are
`in progress. Thus, the destage process goes through another
`checking process to determine whether all the blocks are
`still there. When the destage write requests are completed,
`the state of the dirty cache entry is changed into clean, as
`previously described.
`
`I/O Workload and Simulation Parameters: Basically,
`we employ synthetically generated I/O workloads through-
`out our performance evaluations. Each workload genera-
`tor models an I/O application which issues an I/O request
`and then waits for its completion in a synchronous manner.
`We assume that each I/O application generates 8KB read or
`write requests with one percent of spatial locality, where the
`read and write ratio is set to 4:1. Using realistic I/O work-
`loads is one of our future work.
`
`Table 1 shows configuration parameters which are used
`in our cache RAID5 simulator. The simulated RAID system
`consists of five(5) disks, four of which are used for data and
`one for parity. Since the read cache is larger than the write
`cache, the size of the read cache is two times larger than that
`of the write cache. The cache entry size of each cache is set
`to 8KB.
`
`In the experiments, the static cache occupancy levels for
`HWM and LWM are set to 0.7 and 0.3 of the whole write
`cache size, respectively. Note that these static threshold val-
`ues are generally determined in an ad-hoc manner indepen-
`dently of a time-varying I/O workload. Recall that it is the
`main motivation of this paper. The configuration of the base
`increasing and decreasing rates of the cache occupancy lev-
`els takes the following two steps. First, the base decreasing
`rate µB62 was empirically obtained from our simulator un-
`der the condition that no intervention of host read requests
`exists. Thus, we obtained 70 IOPS with 8KB requests. Sec-
`ond, the inequality of Corollary 2 is applied to the obtained
`µ"Bwr value in order to decide the value of AB. Our ex-
`periments takes the largest number of AB by multiplying
`1 — pl,(= 0.7) with 70. Finally, the current changing rates
`of the cache occupancy level are monitored every 400 msec.
`Also, the independent disk features 512-byte sector, 1962
`cylinders , 19 tracks per cylinder, 72 sectors per track, and
`4002 rpm.
`
`Table 1. Configuration parameters of our
`cached RAID5 simulator
`
`Parameter
`RAID Size
`Number of Disks
`Read Cache Size
`Write Cache Size
`Cache Entry Size
`Read/Write Ratio
`
`PH
`PL
`µB
`AB
`Monitoring Interval
`
`Value
`4.19 GB
`5 (data:4, parity: 1)
`16 MB (2048 cache entries)
`8 MB (1024 cache entries)
`8KB
`4:1
`0.7
`0.3
`70 IOPS
`49 IOPS (0.7µB°2)
`400 msec
`
`3.2. Simulation Results
`
`The I/O performances of the HLWM and AHLWM algo-
`rithms are evaluated in terms of the response time of reads,
`write cache hit ratio, and disk utilization with an increase
`of I/O throughput. The response time of read requests has
`become more significant, because fast writes were achieved
`with non-volatile memory. This response time can be af-
`fected by disk utilization, especially the number of destage
`I/Os, and write cache overflows, as well as the read cache hit
`ratio. The high level of the write cache occupancy enhances
`the write cache hit ratio, resultantly reducing the response
`time of reads. Disk utilization is dependent mainly on the
`number of read and destage I/O requests. Of these, only
`destage I/O requests are directly related to an underlying
`destage algorithm.
`
`Fig-
`1/O Performance with Different I/O Workloads:
`ures 5-6 show the variation of read response time, write
`cache hit ratio, the number of disk I/Os issued by destage
`requests (briefly called destage I/Os), and disk utilization
`of the HLWM algorithm and the AHLWM algorithm with
`the increase of I/O throughput. Throughout all experiments,
`the base increasing rate of the cache occupancy AB and the
`base decreasing rate of the cache occupancy µB are set to
`49(0.7µB) IOs/sec and 70 IOs/sec. As shown in Figures 5-
`6, the AHLWM algorithm outperforms the HLWM algo-
`rithm. The performance differences are maximized espe-
`cially under a light I/O workload in that its A(t) is relatively
`lower than the base value. As the I/O workload increases,
`however, the performances of both algorithms become al-
`most equal, even though the AHLWM algorithm works
`slightly better than the HLWM algorithm. The performance
`
`181
`
`HPE, Exh. 1011, p. 12
`
`
`
`40
`
`35 1
`3 30 r
`1 25
`
`
`U213
`
`,;,2
`
`15
`
`10
`5
`0
`70
`
`HLWM
`AHLWM
`
`18
`
`16
`
`14
`F-4 12
`10
`
`a
`
`8
`
`4
`
`2
`0
`70
`
`HLWM
`AHLWM
`
`80
`
`90
`
`100
`
`110
`
`120
`
`130
`
`140
`
`150
`
`MPS
`
`80
`
`90
`
`100
`
`110
`
`120
`
`130
`
`140
`
`150
`
`lops
`
`(a) Read response time
`
`(b) Write cache hit ratio
`
`Figure 5. Variations of read response time and write hit ratio of the HLWM algorithm and the AHLWM
`algorithm with the increase of I/O throughput : (a) read response time, (b) write cache hit ratio, where
`PL = 0.3, AB = 49 1Os/sec, and µB = 70 1Os/sec
`PH
`
`enhancements of read response time in the AHLWM algo-
`rithm are primarily due to time-varying, adaptive thresh-
`olds, i.e., both HWM and LWM at each time are deter-
`mined based on the current increasing and decreasing rates
`of the cache occupancy level. Under a light I/O workload,
`the AHLWM algorithm keeps up the cache occupancy level
`higher than the HLWM algorithm, meaning that the write
`cache hit ratio increases. It is easily inferred that the high
`level of the write cache hit ratio enhances the response time
`of reads i.e., a high write cache hit ratio can reduce the num-
`ber of destage UOs, which then lessens disk utilization by
`destage UOs.
`
`A Variation of the Cache Occupancy Level:
`Figure 7
`presents the average cache occupancy levels at each I/O re-
`quest arrival rate of the previous results. It is clearly shown
`that the AHLWM algorithm maintains the cache occupancy
`level adaptively to the current I/O workload, i.e., it keeps
`the level of the cache occupancy relatively high under light
`workloads, ranging from 82 to 140 I0s/sec. We can expect
`that HWM and LWM of the HLWM algorithm become sim-
`ilar to those of the AHLWM algorithm as the I/O workload
`increases. A series of subsequent results will prove this ex-
`pectation. Also, we observed that the cache occupancy lev-
`els of both the HLWM and the AHLWM algorithms grow
`abruptly under heavy workloads beyond 140 IOs/sec, be-
`cause the increasing rate of the cache occupancy level A(t)
`is almost equal to the decreasing rate of the cache occu-
`pancy level µ(t) under such a condition. Note that the write
`cache will be filled up shortly