`PARALLEL AND
`DISTRIBUTED SYSTEMS
`
`MARCH 2002
`
`VOLUME 13
`
`NUMBER3
`
`A publication of the IEEE Computer Society
`ITDSEO
`(ISSN 1045-9219)
`
`This resourceis also available
`on the WWW.
`Use MadCatto launch.
`
`S I
`
`Ns
`EL 2Y
`
`PDS
`
`
`
`REGULAR PAPERS
`Applications and Algorithms
`AnEfficient Partitioning Algorithm for Distributed Virtual Environment Systems
`JIC.S. LUI ANG MF. CHa v.cceecccsessssessccesseecenssececseseescsucseeeceesecsesssassneatessassnsassnsassesssseenesesassesseseenssnessssensseseaeccissscneeneneeieenaaaes
`Fast Sorting Algorithms on a Linear Array with a Reconfigurable Pipelined Bus System
`A. Datta, S. Soundaralakshmi, and R. OWENS oo... ceeeeecceeeeeeneee ene cseneeenee cs eeesenenneeneceneeneseneetesnenaanentecienaneeerienesnees
`Slustering Computing
`Dynamic Cluster Resource Allocations for Jobs with Known and Unknown Memory Demands
`L. Xiao, S. Chen, ANd X. ZHANG .....ccesssssssssscsssseccseccccsceeseeeeeesseseseseeeseseeseseesessssescsessecesenesscssessaseneassensoeseeeseasaesesseeseseseneetees
`>ompilers
`An Advanced Compiler Framework for Non-Cache-Coherent Multiprocessors
`Y. Paek, A. Navarro, E. Zapata, J. Hoeflinger, and D. Padua ......cceeceeescseeseseceececseneeenecnssseeeeeeeeesecsteessenenersiesaenennereeetes
`deterogeneous Computing
`
`Performance-Effective and Low-Complexity TaskPEROfor Heterogeneous Computing
`
`H..Topcuoglu, S. Hariri, and M.-Y. Wu ......sensionagenncnte nnresniusemmrmcanisnissenennisniecneres nie tertemenenr nannies
`nterconnection Networks
`HIPIQS: A High-Performance Switch Architecture Using Input Queuing
`R..Sivaram, C.B. Stunkel, ‘and DK, Panel .....n...e-sc0-nsncisst caresarea arene veer reenesyuerierece meee cna verwernveriteeessstonc eeeraesanmarececersosts
`Vlachine Architecture
`RAPID-Cache—A Reliable and Inexpensive Write Cache for High Performance Storage Systems
`Y. Hu, T..\Nightingalle, arid GQ. Yang...............cscecccsssseenensesesnsnsssisceneasqaenensaeesenncnssnsistatier na esear AaRFEAFTANI eanaHSrCES ATT MONRAEONTEN
`Wapping Applications to Machines
`Matching and Scheduling Algorithms for Minimizing Execution Time and Failure Probability of
`Applications in Heterogeneous Computing
`A. Dogain and F Ozgiinger ..........csssessscccessecssssssnenersseesenssnenneesesinnsoasstnsnssrsahtneintsatataisaqaataasiiesercencenstieeeeteqnseveevenssseemeay ens
`
`193
`
`212
`
`223
`
`241
`
`260
`
`275
`
`290
`
`308
`
`Theory
`Fair and Efficient Packet Scheduling Using Elastic Round Robin
`S'S. Kanhere, H.Sethu;, and.A.B. Parekh .................ccsessesasnacenenes saan tach neane. 324
`
`VsbalesellsVarsMbanobbosseallsWarlbraoslsVereadlssteal
`spoonneHEREDIET’os
`
`$902BRRUUBBEHROEE
`22072488
`032002 150
`
`ete inPs
`eAITSON WI EayOE-2605
`
`ft RANDAL!ALLAVE
`
`IEEE
`
`‘
`
`COMPUTER
`
`SOCIETY
`http://computer.org
`tpds @computer.org
`
`;
`y iE E E
`
`HPE, Exh. 1006, p. 1
`
`HPE, Exh. 1006, p. 1
`
`
`
`COMPUTER
`SOCIETY
`
`The IEEE Computer Society is an association of people with professionalinterest in the field of computers. All membersof the IEEE areeligible for membership in
`the Computer Society, as are membersof certain professionalsocieties and other computer professionals. Computer Society members will receive this Transactions
`upon paymentofthe annualSociety membership fee ($35 for IEEE members, $78forall others) plus an annual subscription fee (paperonly: $38; electronic only: $30,
`combination: $49). For additional membership and subscription information, visit our Website at http:/icomputer.org/subscribe, send email to help @computer.org, or write
`to IEEE Computer Society, 10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720-1314 USA.
`Individual subscription copies of Transactionsare for
`personal use only.
`
`piece TRANSACTIONS ON
`PARALLEL AND DISTRIBUTED SYSTEMS
`EDITOR-IN-CHIEF
`PEN-CHUNG YEW
`UNIVERSITY OF MINNESOTA
`Dept. OF COMPUTER SCIENCE
`200 UNION STREET, SE
`4-192 EE/CS BUILDING
`MINNEAPOLIS, MN 55455
`YEW@CS.UMN.EDU
`
`NANCY AMATO
`Texas A&M University
`amato@cs.tamu.edu
`Dotc BLoucH
`Georgia Institute of Technology
`doug.blough@ece.gatech.edu
`ALOK N, CHOUDHARY
`Northwestern University
`choudhar@ece.nwu.edu
`MICHEL COsNARD
`INRIA
`Michel.Cosnard@inria.fr
`SANDHYA DWARKADAS
`University of Rochester
`sandhya@cs.rochester.edu
`
`Mooraz ELNOZAHY
`IBM Austin Research Lab
`mootaz@us.ibm.com
`Bruce Maccs
`Carnegie Mellon University
`bmm@cs.cmu.edu
`MARGARET MARTONOSI
`Princeton University
`martonosi@princeton.edu
`Put. McKINity
`MichiganState University
`mekinley@cps.msu.edu
`ASHWINI NANDA
`IBMTJ Watson Research Center
`ashwini@us.ibm.com
`
`Editorial Board
`Esmonv G. Ne
`LawrenceBerkeley Nat'l Lab.
`EGNg@lbl.gov
`STEPHAN OLARIU
`Old Dominion University
`olariu@cs.odu.edu
`KrisHNa V. PALEM
`CourantInst. Of Math. Sciences
`palem@cs.nyu.edu
`KESHAV PINGALI
`Cornell University
`pingali@cs.cornell.edu
`Timotuy M.PINKSTON
`University of Southern California
`tpink@charity.usc.edu
`
`Yves ROBERT
`Ecole Normale Supérieure de Lyon
`Yves.Robert@ens-lyon.fr
`LUI SHA
`Univ.ofIllinois, Urbana-Champaign
`Irs@cs.uiuc.edu
`RicHarp D. SCHLICHTING
`AT&T Shannon Laboratory
`rick@research.att.com
`RAMESHK, SITARAMAN
`Univ. of Massachusetts
`ramesh@cs.umass.edu
`Neeraj Suri
`Chalmers University
`suri@ce.chalmers.se
`
`Jie Wu
`Florida Atlantic University
`jie@cse.fau.eduTAaoYANG
`Univ.of California, Santa Barbara
`tyang@cs.ucsb.edu
`YUANYLAN YANG
`SUNY, Stony Brook
`yang@ece.sunysb.edu
`We! ZHAO.
`Texas A & M Univ.
`zhao@cs.tamu.edu
`Tats ZNATI
`University of Pittsburgh
`znati@cs,pitt.edu
`WILLY ZWAENFPOFI
`Rice University
`willy@cs.riceedu
`
`MANUSCRIPT SUBMISSIONS/ STATUS INQUIRIES:For information on submitting a manuscript or on a paper awaiting publication, please contact: Transactions Assistant TPDS, IEEE
`Computer Saciety, 10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720-1314 USA; EMAIL: tpds@computer.org, PHONE: +1 714.821.8380; FAx: +1 714.821.4010
`
`WILuIs KING, President
`‘STEPHEN L. DIAMOND,President-Elect
`BEN Wan,Past President
`RANGACHARKasturt, VP, Publications
`Jerry ENGEL, VP, Conferences & Tutorials
`
`@ IEEE COMPUTER SOCIETY
`Officers
`
`Cart CHANG, VP, Educational Activities
`JamesH. Cross, VP, Chapter Activities
`LoweJOHNsoN, Second VP, Standards Activities
`DEBORAH SCHERRER,First VP, Technical Activities
`DeboraM.Correr,Secretary
`
`Publications Board
`Vice President: RANGACHAR KASTURI
`Vice Chair: RICHARD KEMMERER
`
`WOLFGANGGILOL, Treasurer
`JamesD.Isaak, IEEE Division V Director
`THomas W. WILLIAMS, 2001-2002 IEEE Division VII! Director
`Davip HENNAGE, Executive Director
`
`Transactions
`Editors-in-Chief
`_Editors-in-Chief
`Magazines
`Members-at-Large
`Jean-Luc GAubioT
`Computers:
`Annuis of the History ofComputing:|THOMASJ. BERGIN
`ANGELA BURGESS(ex officio)
`GABRIELLA SANNITI D1 Baja
`PuiuS$. Yu
`THomas KrerE=ANAND TRIPATHI
`Knowledge & Data Fugineering:
`Computing in Science & Engineerin;
`FRANCIS SULLIVAN
`Tom La Porta
`
`Computer:
`James H. AyLoR
`Mobile Computing:
`TSUHAN CHEN
`Multimedia:
`Computer Graphics & Applicatio
`James J. THOMAS
`
`
`
`
`Magazine Operations Chair:©GEORGE CYBENKO Networking:|MIKE LiuDesign & Test: FRANK FERRANTE
`
`Transactions Operations Chair:
`STEVEN L. TANIMOTO.
`
`Intelligent Systems:—Dantet. E. O’LEARY Parallel & Distributed Systems: PEN YEW
`
`
`Internet Computing:©MUNINDARP. SINGH Pattern Analysis & MachineIntelligence: RAMA CHELLAPPA
`
`Publications Operations Chair:|MARK CHRISTENSEN
`IEEE PABLiaison:©RANGACHAR Kasturi
`
`IT Professional:—RayesH Gupta Software Engineering:—JOHN KNIGHT
`Micro:
`Ken SAKAMURA
`Very Large Scale Integration:—Epy FRIEDMAN
`Multimedia:|FOROUZAN GOLSHANI
`Visualization & Computer Graphics:
`HANS HAGEN
`Pervasive Computing:|M. SATYANARAYANAN
`
`IEEE CS Pre
`MICHAEL WILLIAMS
`Software:—STEVEN C. MCCONNELL
`
`Davip HENNAGE, Executive Director
`ANGELA BurGess, Publisher
`Voter §. Doan, ChiefFinancial Officer
`
`Executive Staff
`
`ANNE MariF KELLY, Director, Volunteer Services
`JouHN C. KEATON, Manager, Research and Planning
`Rosert Care,Director, Information Technology & Services
`
`Transactions Department
`ALICIA L. BARRETT, Production Manager
`SUZANNE WERNER,Peer ReviewSupervisor
`PILAR HAWTHORNE, Katity SANTA MARIA, Production Editors
`Yu-1Zu TSAI, JULIE TURNBAUGH,Electronic Media Assistants
`SELINA Norman,Transactions Assistant
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTEDSYSTEMSis published monthly by the IEEE ComputerSociety. IEEE Corporate Office: Three Park Avenue, 17th Floor, New York, NY 10016-5997 USA.Responsibility for tl
`content rests upon the authors and not upon the IEEEor the IEEE Computer Society. IEEE Computer Society Publications Office: 10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720-1314 USA.
`IEEE Computer Sacie
`Headquarters: 1730 Massachusetts Ave. NW, Washington, DC 20036-1992 USA. Back issues: IEEE members10.00 (first copy only), nonmembers $20.00 per copy. (Note: Add $4.00 postage and handling chargeto any order from $1.
`to $50.00, including prepaid orders). Completeprice information available on request. Also available in. microfiche. Copyright and Reprint Permissions: Abstracting is permitted withcredit to the source. Libraries are permitted
`photocopy for private use of patrons, providedthe per-copyfee indicated in the codeat the bottom ofthé first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923 USA. For all other copyin
`reprint, or republication permission, write to: Copyrights and Permissions Department, IEEE Publications Administration, 445 Hoes Lane, PO Box 1331, Piscataway, NJ 08855-1331. Copyright © 2002 by The Institute of Electrical ar
`Electronic Engineers, Inc. Alll rights reserved. Periodicals postage paid at New York, NY, and at additional mailing offices. Postmaster: Send address changes to IEF. TRANSACTIONS ON PARALLEL ANDDISTRIBUTED SYSTEM
`IEEEService Center, 445 Hoes Lane, PO Box 1331, Piscataway, NJ 08855-1331 USA. GST Registration No. 125634188, Canada Post Publications Mail Agreement Number0487783. Printed in USA.
`
`HPE, Exh. 1006, p. 2
`
`HPE, Exh. 1006, p. 2
`
`
`
`= TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 13, NO. 3, MARCH 2002
`
`193
`
`An Efficient Partitioning Algorithm for
`Distributed Virtual Environment Systems
`John C.S. Lui, Member, IEEE, and M.F. Chan, Student Member, IEEE
`
`Abstract—Distributed virtual environment (DVE) systems model and simulate theactivities of thousandsof entities interacting in a
`virtual world over a wide area network. Possible applications for DVE systems are multiplayer video games,military and industrial
`trainings, and collaborative engineering. In general, a DVE system is composed of many servers and eachserveris responsible to
`manage multiple clients who wantto participate in the virtual world. Each server receives updatesfrom different clients (such as the
`current position and orientation of eachclient) and then delivers this information to otherclients in the virtual world. The serveralso
`needsto perform other tasks, such asobject collision detection and synchronization control. A large scale DVE system needsto
`support manyclients and this imposes a heavy requirement on networking resources and computational resources. Therefore, how to
`meet the growing requirement of bandwidth and computational resources is one of the major challengesin designing a scalable and
`cost-effective DVE system.In this paper, we proposeanefficient partitioning algorithm that addresses the scalability issue of designing
`a large scale DVE system. The mainidea is to dynamically divide the virtual world into different partitions and then efficiently assign
`these partitionsto different servers. This way, each serverwill process approximately the same amount of workload. Another objective
`of the partitioning algorithm is to reduce the server-to-server communication overhead. The theoretical foundation of our dynamic
`partitioning algorithm is based onthelinear optimization principle. We also illustrate how one can parallelize the proposedpartitioning
`algorithm so thatit can efficiently partition a very large scale DVE system. Lastly, experiments are carried outtoillustrate the
`effectiveness of the proposed partitioning algorithm under various settings of the virtual world.
`KURTF, WENDTLIU,
`IndexTerms—Distributed virtual environment,scalability issue, partitioning algorithm,load tales
`Ladamrumeatenrgduction, linear
`
`optimization.
`
`
`MAR 15 290
`
`INTRODUCTION
`
`Nopecs in multimedia systems, parallel/distributed
`database systems, and high speed networking tech-
`logies enable system designers to build a distributed
`stem which allows many users to explore and interact
`ider a three dimensional virtual environment. In general, a
`) virtual environmentisa virtual world which consists of
`any high-resolution 3D graphics sceneries to represént a
`al-life world. For example, we can have a 3D virtual
`orld to represent a lecture hall so that hundreds of
`idents and scientists can listen to a seminar presented
`' Professor Daniel C. Tsui’ or we can have a large 3D
`rtual world to represent the latest COMDEX show which
`is many customers reviewing the latest softwares and
`ectronic gadgets. This type of shared, computer-resident
`rtual worldis called a distributed virtual environment (DVE)
`4]. Like other ground-breaking computer technologies,
`VE will change the way welearn, work, and interact with
`her people in the society.
`
`1. A 1998 Noble Prize winner in Physics for the discovery of a new form
`quantumfluid with fractionally charged excitations.
`
`J.C.S. Lui is with the Department of Computer Science and Engineering,
`The Chinese University of Hong Kong, Shatin, N.T., Hong Kong.
`E-mail: cslui@cse.cuhk.edu.hk.
`M.F. Chan is with Poly-Asia Comuter Inc., Kowloon, Hong Kong.
`E-mail: fei@alumni.cse.cuhk.edu.hk.
`anuscript received 11 Jan. 2000; revised 21 Apr. 2001; accepted 13 Aug.
`‘01,
`ir information on obtaining reprints of this article, please send e-mail to:
`ds@computer.org, and reference IEEECS Log Number 111201.
`1045-9219/02/$17.00 © 2002 IEEE
`
`
`
`| VeMADISON >
`
`To illustrate how a DVE system can changeourlifestyles
`and the way we handle our business operation,
`let us
`consider the following situation. Let’s say an architect from
`New York, a civil and a structural engineer from Paris, a
`financial planner from Hong Kong,andaninterior designer
`from Tokyoall need to have a business meeting to discuss
`the designing and financing issues of a new high-rise office
`complex. Under a DVEsetting, these people can convene a
`meeting in a virtual world without leaving their respective
`homes and offices. Their meeting can be carried out in a
`DVEsystem. These participants can interact with each other
`in a virtual world of the new high-rise office complex that
`they are proposing to build. Each participant
`in this
`business meeting can virtually walk around in the proposed
`high-rise office building, interact with each other and carry
`out the discussion. For example, in this virtual high-rise
`office complex, each participant in the meeting is repre-
`sented by a 3D object, which is known as an avatar. Each
`participant can walk aroundin this virtual office building
`and,
`in the process,
`rearrange any 3D object
`in the
`environment
`(e.g., rearrange paintings and furniture or
`select different kinds of carpet). Any change to a 3D object
`in this virtual world will be visible to all participants.
`Participants in this meeting are able to interact with each
`otherin real time, as well as to inquire and to receive any
`relevant information of the virtual world. For example,
`participants can query about
`the credit history of a
`manufacturer who is responsible to produce the office
`furniture.
`
`HPE,Exh.1006,p.3..
`
`HPE, Exh. 1006, p. 3
`
`
`
`
`This material may be protected by Copyright law (Title 17 U.S. Code)
`
`
`
`290
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 13, NO. 3, MARCH 2002
`
`RAPID-Cache—A Reliable and
`Inexpensive Write Cache for
`High Performance Storage Systems
`
`Yiming Hu, Senior Member, IEEE, Tycho Nightingale, and
`Qing Yang, Senior Member, IEEE
`
`Abstract—Modern high performancedisk systems make extensive use of nonvolatile RAM (NVRAM)write caches. A single-copy
`NVRAMcachecreatesa single pointof failure while a dual-copy NVRAMcacheis very expensive becauseof the high cost of NVRAM.
`This paper presents a new cache architecture called RAPID-Cache for Redundant, Asymmetrically Parallel, and Inexpensive Disk
`Cache. A typical RAPID-Cacheconsists of two redundant write buffers on top of a disk system. Oneofthe buffers is a primary cache
`made of RAM or NVRAMandthe otheris a backup cache containing a two-level hierarchy: a small NVRAMbuffer on top of a log disk.
`The small NVRAMbuffer combines small write data and writes them into the log diskin large sizes. By exploiting the locality property of
`V/O accesses and taking advantage of well-known Log-structured File Systems, the backup cache has nearly equivalent write
`performance as the primary RAM cache. Theread performance of the backupcacheis notascritical because normal read operations
`are performed through the primary RAM cache and reads from the backup cache happen only during error recovery periods. The
`RAPID-Cache presents an asymmetric architecture with a fast-write-fast-read RAM being a primary cache anda fast-write-slow-read
`NVRAM-disk hierarchy being a backup cache. The asymmetrically parallel architecture and an algorithm that separatesactively
`accessed data from inactive data in the cachevirtually eliminate the garbage collection overhead, which are the major problems
`associated with previous solutions such as Log-structured File Systems and Disk Caching Disk. The asymmetric cacheallowscost-
`effective designs for very large write caches for high-end parallel disk systems that would otherwise have to use dual-copy,costly
`NVRAMcaches. It also makesit possible to implementreliable write caching for low-end disk I/O systems since the RAPID-Cache
`makes use of inexpensive disks to perform reliable caching. Our analysis and trace-driven simulation results show that the RAPID-
`Cachehassignificantreliability/cost advantages over conventional single NVRAM write caches and has great cost advantages over
`dual-copy NVRAM caches. The RAPID-Cachearchitecture opens a new dimensionfor disk system designers to exercise trade-offs
`among performance,reliability, and cost.
`
`Index Terms—Disks, storage systems, performance, reliability, fault-tolerance.
`
`5a
`
`1
`
`INTRODUCTION
`
`Mee disk I/O systems make extensive use of
`nonvolatile RAM (NVRAM)write caches to facilitate
`asynchronous write [2],
`[3],
`[4], ie, a write request is
`acknowledged before the write goes to disk. Such write
`caches significantly reduce response times of disk I/O
`systemsseen by users, particularly in RAID systems. Large
`write caches can also improve system throughput by taking
`advantage of both temporal and spatial localities [2], [5], as
`data maybe overwritten several times or combined together
`before been written to the disk. IO requests are very bursty
`[6], requests are often come together with longintervals of
`relative inactive periods in between. Large write caches also
`benefit from the the burstiness of write workloads since
`data coming from bursts can be quickly stored in the cache
`
`and written back to the disk later when the system is less
`busy. Treiber and Menon reported that write caches could
`reduce disk utilization for writes by an order of magnitude
`when compared to basic RAID-5 systems[3]. However, the
`use of write caches introduces two problems: poor
`reliability and high cost.
`Disks are impressively reliable today, with a Mean Time
`To Failure (MTTF) of up to one million hours. Such a low
`failure rate, coupled with possible redundancy such as
`RAID,gives a Mean Time To Data Loss (MTTDL)of several
`hundreds of millions of hours in a typical RAID-5 system
`[7]. Adding a single cache in front of a disk system creates a
`single point of failure, which is vulnerable to data loss.
`Savage and Wilkes pointed outin [7] that because typical
`NVRAMtechnology (battery backed RAM)has a quite low
`MTTF of 15K hours, a single-copy NVRAM cachesuffers
`e Y. Hu is with the Department of Electrical & Computer Engineering and
`significantly higher risk of data loss than results from disk
`Computer Science, University of Cincinnati, Cincinnati, OA 45221-0030.
`E-mail: yhu@ececs.uc.edu.
`failures. To overcome the reliability problem, some high-
`e T. Nightingale is with Sun Microsystems, 4150 Network Circle, USAN-
`end RAIDsystemsuse dual-copy cachesso thata failure in
`208, Santa Clara, CA 95054. E-mail: tycho.nightingale@sun.com.
`one cache leaves the other cache intact [2]. When a write
`© Q. Yang is with the Department of Electrical and Computer Engineering,
`University of Rhode Island, Kingston, RI 02881.
`request comes, the controller writes two copies of the data
`E-mail: qyang@ele.uri.edu.
`independently into the two caches, a primary cache and a
`Manuscript received 4 Dec. 2000; accepted 20 Sept. 2001.
`backup cache. Besides the reliability problem, NVRAM is
`For information on obtaining reprints of this article, please sende-rail to:
`also known to be very costly [7], [8], [9] so the size of the
`tpds@computer.org, and reference IEEECS Log Number 113249.
`1045-9219/02/$17.00 © 2002 IEEE
`
`HPE, Exh. 1006, p. 4
`
`
`
`HU ET AL.: RAPID-CACHE—A RELIABLE AND INEXPENSIVE WRITE CACHE FOR HIGH PERFORMANCE STORAGE SYSTEMS
`
`291
`
`lowercost compared to dual-copy NVRAMcaches, without
`NVRAM cache is often limited. For example, a major
`sacrificing performance. On the other hand, becauseofits
`NVRAM manufacturer quoted the price of NVRAM with
`low cost, with the same budget, RAPID-Caches can have
`embeddedlithium-cell batteries for $55/MB in quantity as
`of December 2000. The cost of disks, on the other hand,is
`significantly higher performance compared to conventional
`about 0.5 cents/MB, which is a difference of four orders of
`NVRAM cache architectures by affording much larger
`primary cachesizes, while still maintaining goodreliability.
`magnitude. Moreover, the cost difference is widening (the
`While the idea of RAPID-Cache can be used in any I/O
`difference was three orders of magnitudes two years ago)
`system, it is particularly suitable for parallel disk systems
`because prices of disks are falling very rapidly. For a disk
`such as RAID because RAID systems are mostlikely to be
`system with a reasonably sized write cache, the NVRAM
`used in environments which require high performance and
`may dominatethe cost of the entire system. For example, in
`high reliability. Therefore, we concentrate our study on
`a system with 16 disks (40 GB per disk) and an NVRAM
`RAPID-Caches on top of RAID-5 systems in this paper. We
`write cache of 256 MB, at $55/MB, the NVRAMcosts about
`have carried out trace-driven simulation experiments as
`$14,080, while the total cost of 16 disks is only $3,200
`well as analytical studies to evaluate the performance and
`(assuming each 40-GB disk costs $200). If we use dual-copy
`reliability of the RAPID-Cache. Using real-world traces as
`caches to ease the reliability problem of the single-copy
`well as synthetic traces generated based on realistic work-
`cache, the cost becomes prohibitively high, particularly for
`loads[6], [15], we analyze the performance of the RAPID-
`large caches. Asa result, it is only suitable for the upper
`Cache architecture and compareit with existing disk cache
`echelon of the market [10].
`architectures. Numerical
`results show that
`the RAPID-
`The standard dual-copy write cache system has a
`Cache has significant performance/cost and_reliability
`symmetric structure, where both the primary write cache
`advantages over the existing architectures.
`and the backup write cache have the same size and the
`The paper is organized as follows: The next section
`same access characteristics—fast read speed and fast write
`presents the detailed architecture and operations of the
`speed. However, the backup cache does not provide any
`RAPID-Cache. Section 3 presents our experimental metho-
`performance benefit to the system during normal opera-
`dology. Simulation results will be presented in Section 4,
`tions. Therefore,
`it
`is wasteful
`to use a backup cache
`followed by an approximatereliability and cost analysis in
`identical to the primary cache. What is needed is only a
`Section 5. We discuss related work in Section 6 and
`backup cache that can be written to very quickly while its
`conclude the paper in Section 7.
`read operations are not as critical since reads from the
`backup cache occur only during error-recovering periods.
`Based on these observations, we propose a new disk
`cache architecture called Redundant, Asymmetrically Parallel,
`Inexpensive Disk Cache, or RAPID-Cache for short,
`to
`provide fault-tolerant caching for disk I/O systems.
`in-
`expensively. The main idea of the RAPID-Cacheis to use a
`conventional, fast-write-fast-read primary cache and a non-
`volatile, fast-write-slow-read backup cache. The primary
`cache is made of normal NVRAM or DRAM,while the
`backup cache consists of a small NVRAMcacheand a log
`disk (cache disk). In the backup cache, small and random
`writes arefirst buffered in the small NVRAM buffer to form
`large logs that are written into the cache disk later in large
`transfers, similar to log structured file systems [11], [12],
`[13], [14]. Because large writes eliminate many expensive
`small writes,
`the buffer
`is quickly made available for
`additional requests so that the two-level cache appears to
`the host as a large NVRAM. Asa result, the backup cache
`can achieve the same write speed as the primary cache. The
`slow-read performance of the backup cache doesnotaffect
`the system performance since every data block in the
`backupcache has a copy in the primary cache which can be
`read at the speed of RAM. The dual cache system here is
`asymmetric since the primary cache and the backup cache
`have different sizes and structures. The reliability of the
`RAPID-Cache is expected to be high since disk is very
`reliable. The system is also inexpensive because the
`NVRAMin the backup cache can be very small, ranging
`from hundreds of KB to several MB andthecost ofthe disk
`space is significantly less than that of a large NVRAM. We
`will show that RAPID-Caches provide much higherrelia-
`bility compared to single-copy NVRAM caches and much
`
`It
`1 shows the basic structure of a RAPID-Cache.
`Fig.
`consists of a conventional primary RAM cache and a
`backup cache. The backup cache is a two-level hierarchy
`with a small NVRAMontopofa cachedisk, similar to DCD
`[16]. In a RAPID-Cache, every I/O write operation is sent to
`both the primary cache and the backup cache while read
`operations are performed using the primary cache only.
`For very high overall reliability, the primary cache can be
`NVRAMto provide redundantprotection during a power
`failure. On the other hand,
`for
`low cost systems,
`the
`primary cache can be DRAM. During normal operations,
`the DRAM primary cache and the backup cache contain
`redundantdata. If any one of the two cachesfails, data can
`be reconstructed from the other. During a powerfailure,
`data are retained in the backup NVRAMandthecachedisk.
`If both the read cache and the primary write cache are made
`of DRAM,we can use a unified read/write cache structure,
`as shownin Fig. 2a, for better cache utilization. A RAPID-
`Cache with a large unified DRAM primarycachehas higher
`throughput, lowercost, and better reliability than that of a
`single-copy conventional NVRAM cache. For many appli-
`cations that require redundantprotection during a power
`failure, a Triple RAPID-Cache (shownin Fig. 2b) can be used
`to build a highly reliable, very large cache system. The idea
`is to use two low-cost backup caches to support one large
`primary DRAM cache. During normal operations,
`the
`primary cache and the two backup caches providetriple
`redundancy protection. The two backup caches provide
`dual-redundancy protection during a powerfailure. Triple
`RAPID-Cachesare especially suitable for high-end systems
`
`2 ARCHITECTURE AND OPERATIONS
`
`HPE, Exh. 1006, p. 5
`
`HPE, Exh. 1006, p. 5
`
`
`
`292
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 13, NO. 3, MARCH 2002
`
`Incoming Requests
`—e
`
`
`eet
`
`|
`Disk Controller
` Jb
` ig
`
`Backup Cache
`
`DRAM /
`NVRAM
`DRAM
`PrimaryWrite Cache
`Read Cache
`
`JL
`
`
`S Cache-Disk
`
`caelens
`
`
`=D
`
`isks
`
`Fig. 1. RAPID-Cache ontop of a disk system.
`
`that need very large and very reliable write caches. For
`large cachesizes, a Triple RAPID-Cache has lower cost and
`better reliability than both a RAPID-Cache with a large
`NVRAMprimary cache and a conventional dual-copy
`NVRAMcache.
`
`2.1 Structures of the Backup Cache
`Fig. 3 showsthe detailed structures of the backup NVRAM
`cache and the cache disk. The NVRAMcacheconsists of an
`LRU Cache, two to four Segment Buffers, and a Hash Table.
`Another related data structure, called the Disk Segment
`Table, is located in a DRAM buffer.
`The actively-accessed data in the backup cachereside in
`the LRU cache. Theless actively-accessed data are kept in
`the cache disk. Data in the cache disk are organized in the
`format of Segments similar to that in a Log-structured File
`System suchas the Sprite LFS and the BSD LFS [12], [13]. A
`segmentcontains a numberofslots each of which can hold
`one data block. Data blocks stored in segments are
`
`addressed by their Segment IDs and Slot IDs. Data blocks
`stored in the LRU cache are addressed by their Logical Block
`Addresses (LBAs). The Hash Table contains location informa-
`tion for each of the valid data blocksin the backupcache. It
`describes whethera blockis in the NVRAM LRUcacheor in
`the cache disk, as well as the data address in the LRU cache
`or the cache disk. In our current design,the size of the hash
`entry is 16 bytes. Since data in the backup cacheis the exact
`image of the data in the primary write cache, the total
`numberofvalid data blocks in the backup cacheis the same
`as in the primary write cache, regardless of the sizes of the
`backup NVRAMandthecache disk.If the data block size is
`8 KB, then, for a 32 MB write cache, there are 4,096 blocks in
`total. Since each valid block has a corresponding hash entry
`of 16 bytes, the total hash table size is 64 KB, which is
`compact enoughto be placed in the NVRAM.
`For the purpose of speeding up garbagecollection, we
`also keep track of a data structure called the Disk Segment
`Table. Since this table contains redundantinformation that
`
`Incoming Requests
`Incoming Requests
`
`oe
`Js
`
`
`
`
`
`ee
`Es
` x2a 93 g
`RAID Controller
`
`
`ress
`cE
`|
`
`
`
`
`
`eee
`See
`See
`
`
`———=
`S
`Se
`Se
`
`
`
`NVRAM
`DRAM
`
`Backup a
`Unified Write Cache|’|Write Cache
`
`
`SeeWrite Cache
`Primary
`Primary
`Read/Write
`
`Cache
`Cache
`
`
`BS=
`
`
`
`ic
`Zz
`
`SZ EeSeaia pe
`
`
`
`eS ee
`<<
`ah
`(ae
`=
`J eT pe
`7
`SSeS = =
`
`pa) Ga 4
`
`Soke
`olan
`ieee NaeSeeee ee
`RAID-S
`Cache-Disk Cache-Disk
`RAID-5
`Cache-Disk
`
`
`
`(a)
`
`(b)
`
`Fig. 2. Unified RAPID-Cache andtriple RAPID-Cache.(a) A RAPID-Cachewith a unified read/write cache.(b) A triple RAPID-Cache with two cache
`disks.
`
`HPE, Exh. 1006, p. 6
`
`HPE, Exh. 1006, p. 6
`
`
`
`HU ET AL.: RAPID-CACHE—A RELIABLE AND INEXPENSIVE WRITE CACHE FOR HIGH PERFORMANCE STORAGE SYSTEMS
`
`293
`
`Incoming Write Data
`
`|
`NVRAM
`!
`|
`DRAM
`
`i
`Fhe
`
`
`LRU Cache
`
`|
`
`{
`a
`
`Hash
`
`Table
`
`'
`|Disk
`Segment
`
`|
`|
`bet
`ue
`
`
`
`
`t
`1
`
`
`
`
`
`
`teeRio Table '
`
`
`| SegmentBuffer
`Segment Buffer
`|
`|
`Disk
`S Disk
`Disk
`Disk
`Disk
`
`
`
`to dis
`
`
`Cache-Disk Space
`
`Segment
`
`Segment
`
`Segment
`
`Segment
`
`Seg
`
`Fig. 3. The detailed structure of the backup cache and the cache disk.
`rw<— Ss
`Incoming Requests
`
`
`
`
`
`
`RAID Controller
`|
`Read Cache
`
`
`—
`
`
`
` SE
`wl
`NV.
`
`
`DRAM /
`Backp
`|
`Backup
`DRAM
`Write Cache
`|
`DRAM
`|NVRAM
`Write Cache
`
`Read Cache |||PrimaryWrite Cache
`
` Se ee aiI>,eeCache-Partitions
`
`Incoming Requests
`aSs
`NE
`RAID Controller
`
`|
`
`. FeEAT
`
`
`
`Cache-DDisk
`
`(a)
`
`(b)
`
`Fig. 4. Physical RAPID-Cache and logical RAPID-Cache.(a) ‘A RAPID-Cache with a dedicated physical cache disk. (b) A RAPID-Cache with a
`distributed logical cache disk.
`
`can be quickly and completely reconstructed from the Hash
`Table in case of a crash,it is stored in the DRAM. Wewill
`discuss the structure of the Disk Segment Table in detail
`later in this section.
`The cache disk in the backup cache can be a dedicated
`physical disk, as shown inFig.4a. It can also be distributed
`among the data disks of the RAID system, each data disk
`having a small partition acting as a partof a large distributed
`logical cache disk, as shown in Fig. 4b. In modern RAID
`systems, the physical cache disk can often be implemented
`without extra cost since many modern RAID systems
`include one or several spare disks that can be put into
`service whenanactive disk fails [17]. However, as pointed
`out by Wilkes et al.
`[17], during normal psrations, the
`spare disks are not used in many systems’ and contribute
`nothing to the performanceof the system.It is also hard to
`tell if the spare disksarestill working