throbber
IEEE TRANSACTIONS ON
`PARALLEL AND
`DISTRIBUTED SYSTEMS
`
`MARCH 2002
`
`VOLUME 13
`
`NUMBER3
`
`A publication of the IEEE Computer Society
`ITDSEO
`(ISSN 1045-9219)
`
`This resourceis also available
`on the WWW.
`Use MadCatto launch.
`
`S I
`
`Ns
`EL 2Y
`
`PDS
`
`
`
`REGULAR PAPERS
`Applications and Algorithms
`AnEfficient Partitioning Algorithm for Distributed Virtual Environment Systems
`JIC.S. LUI ANG MF. CHa v.cceecccsessssessccesseecenssececseseescsucseeeceesecsesssassneatessassnsassnsassesssseenesesassesseseenssnessssensseseaeccissscneeneneeieenaaaes
`Fast Sorting Algorithms on a Linear Array with a Reconfigurable Pipelined Bus System
`A. Datta, S. Soundaralakshmi, and R. OWENS oo... ceeeeecceeeeeeneee ene cseneeenee cs eeesenenneeneceneeneseneetesnenaanentecienaneeerienesnees
`Slustering Computing
`Dynamic Cluster Resource Allocations for Jobs with Known and Unknown Memory Demands
`L. Xiao, S. Chen, ANd X. ZHANG .....ccesssssssssscsssseccseccccsceeseeeeeesseseseseeeseseeseseesessssescsessecesenesscssessaseneassensoeseeeseasaesesseeseseseneetees
`>ompilers
`An Advanced Compiler Framework for Non-Cache-Coherent Multiprocessors
`Y. Paek, A. Navarro, E. Zapata, J. Hoeflinger, and D. Padua ......cceeceeescseeseseceececseneeenecnssseeeeeeeeesecsteessenenersiesaenennereeetes
`deterogeneous Computing
`
`Performance-Effective and Low-Complexity TaskPEROfor Heterogeneous Computing
`
`H..Topcuoglu, S. Hariri, and M.-Y. Wu ......sensionagenncnte nnresniusemmrmcanisnissenennisniecneres nie tertemenenr nannies
`nterconnection Networks
`HIPIQS: A High-Performance Switch Architecture Using Input Queuing
`R..Sivaram, C.B. Stunkel, ‘and DK, Panel .....n...e-sc0-nsncisst caresarea arene veer reenesyuerierece meee cna verwernveriteeessstonc eeeraesanmarececersosts
`Vlachine Architecture
`RAPID-Cache—A Reliable and Inexpensive Write Cache for High Performance Storage Systems
`Y. Hu, T..\Nightingalle, arid GQ. Yang...............cscecccsssseenensesesnsnsssisceneasqaenensaeesenncnssnsistatier na esear AaRFEAFTANI eanaHSrCES ATT MONRAEONTEN
`Wapping Applications to Machines
`Matching and Scheduling Algorithms for Minimizing Execution Time and Failure Probability of
`Applications in Heterogeneous Computing
`A. Dogain and F Ozgiinger ..........csssessscccessecssssssnenersseesenssnenneesesinnsoasstnsnssrsahtneintsatataisaqaataasiiesercencenstieeeeteqnseveevenssseemeay ens
`
`193
`
`212
`
`223
`
`241
`
`260
`
`275
`
`290
`
`308
`
`Theory
`Fair and Efficient Packet Scheduling Using Elastic Round Robin
`S'S. Kanhere, H.Sethu;, and.A.B. Parekh .................ccsessesasnacenenes saan tach neane. 324
`
`VsbalesellsVarsMbanobbosseallsWarlbraoslsVereadlssteal
`spoonneHEREDIET’os
`
`$902BRRUUBBEHROEE
`22072488
`032002 150
`
`ete inPs
`eAITSON WI EayOE-2605
`
`ft RANDAL!ALLAVE
`
`IEEE
`
`‘
`
`COMPUTER
`
`SOCIETY
`http://computer.org
`tpds @computer.org
`
`;
`y iE E E
`
`HPE, Exh. 1006, p. 1
`
`HPE, Exh. 1006, p. 1
`
`

`

`COMPUTER
`SOCIETY
`
`The IEEE Computer Society is an association of people with professionalinterest in the field of computers. All membersof the IEEE areeligible for membership in
`the Computer Society, as are membersof certain professionalsocieties and other computer professionals. Computer Society members will receive this Transactions
`upon paymentofthe annualSociety membership fee ($35 for IEEE members, $78forall others) plus an annual subscription fee (paperonly: $38; electronic only: $30,
`combination: $49). For additional membership and subscription information, visit our Website at http:/icomputer.org/subscribe, send email to help @computer.org, or write
`to IEEE Computer Society, 10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720-1314 USA.
`Individual subscription copies of Transactionsare for
`personal use only.
`
`piece TRANSACTIONS ON
`PARALLEL AND DISTRIBUTED SYSTEMS
`EDITOR-IN-CHIEF
`PEN-CHUNG YEW
`UNIVERSITY OF MINNESOTA
`Dept. OF COMPUTER SCIENCE
`200 UNION STREET, SE
`4-192 EE/CS BUILDING
`MINNEAPOLIS, MN 55455
`YEW@CS.UMN.EDU
`
`NANCY AMATO
`Texas A&M University
`amato@cs.tamu.edu
`Dotc BLoucH
`Georgia Institute of Technology
`doug.blough@ece.gatech.edu
`ALOK N, CHOUDHARY
`Northwestern University
`choudhar@ece.nwu.edu
`MICHEL COsNARD
`INRIA
`Michel.Cosnard@inria.fr
`SANDHYA DWARKADAS
`University of Rochester
`sandhya@cs.rochester.edu
`
`Mooraz ELNOZAHY
`IBM Austin Research Lab
`mootaz@us.ibm.com
`Bruce Maccs
`Carnegie Mellon University
`bmm@cs.cmu.edu
`MARGARET MARTONOSI
`Princeton University
`martonosi@princeton.edu
`Put. McKINity
`MichiganState University
`mekinley@cps.msu.edu
`ASHWINI NANDA
`IBMTJ Watson Research Center
`ashwini@us.ibm.com
`
`Editorial Board
`Esmonv G. Ne
`LawrenceBerkeley Nat'l Lab.
`EGNg@lbl.gov
`STEPHAN OLARIU
`Old Dominion University
`olariu@cs.odu.edu
`KrisHNa V. PALEM
`CourantInst. Of Math. Sciences
`palem@cs.nyu.edu
`KESHAV PINGALI
`Cornell University
`pingali@cs.cornell.edu
`Timotuy M.PINKSTON
`University of Southern California
`tpink@charity.usc.edu
`
`Yves ROBERT
`Ecole Normale Supérieure de Lyon
`Yves.Robert@ens-lyon.fr
`LUI SHA
`Univ.ofIllinois, Urbana-Champaign
`Irs@cs.uiuc.edu
`RicHarp D. SCHLICHTING
`AT&T Shannon Laboratory
`rick@research.att.com
`RAMESHK, SITARAMAN
`Univ. of Massachusetts
`ramesh@cs.umass.edu
`Neeraj Suri
`Chalmers University
`suri@ce.chalmers.se
`
`Jie Wu
`Florida Atlantic University
`jie@cse.fau.eduTAaoYANG
`Univ.of California, Santa Barbara
`tyang@cs.ucsb.edu
`YUANYLAN YANG
`SUNY, Stony Brook
`yang@ece.sunysb.edu
`We! ZHAO.
`Texas A & M Univ.
`zhao@cs.tamu.edu
`Tats ZNATI
`University of Pittsburgh
`znati@cs,pitt.edu
`WILLY ZWAENFPOFI
`Rice University
`willy@cs.riceedu
`
`MANUSCRIPT SUBMISSIONS/ STATUS INQUIRIES:For information on submitting a manuscript or on a paper awaiting publication, please contact: Transactions Assistant TPDS, IEEE
`Computer Saciety, 10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720-1314 USA; EMAIL: tpds@computer.org, PHONE: +1 714.821.8380; FAx: +1 714.821.4010
`
`WILuIs KING, President
`‘STEPHEN L. DIAMOND,President-Elect
`BEN Wan,Past President
`RANGACHARKasturt, VP, Publications
`Jerry ENGEL, VP, Conferences & Tutorials
`
`@ IEEE COMPUTER SOCIETY
`Officers
`
`Cart CHANG, VP, Educational Activities
`JamesH. Cross, VP, Chapter Activities
`LoweJOHNsoN, Second VP, Standards Activities
`DEBORAH SCHERRER,First VP, Technical Activities
`DeboraM.Correr,Secretary
`
`Publications Board
`Vice President: RANGACHAR KASTURI
`Vice Chair: RICHARD KEMMERER
`
`WOLFGANGGILOL, Treasurer
`JamesD.Isaak, IEEE Division V Director
`THomas W. WILLIAMS, 2001-2002 IEEE Division VII! Director
`Davip HENNAGE, Executive Director
`
`Transactions
`Editors-in-Chief
`_Editors-in-Chief
`Magazines
`Members-at-Large
`Jean-Luc GAubioT
`Computers:
`Annuis of the History ofComputing:|THOMASJ. BERGIN
`ANGELA BURGESS(ex officio)
`GABRIELLA SANNITI D1 Baja
`PuiuS$. Yu
`THomas KrerE=ANAND TRIPATHI
`Knowledge & Data Fugineering:
`Computing in Science & Engineerin;
`FRANCIS SULLIVAN
`Tom La Porta
`
`Computer:
`James H. AyLoR
`Mobile Computing:
`TSUHAN CHEN
`Multimedia:
`Computer Graphics & Applicatio
`James J. THOMAS
`
`
`
`
`Magazine Operations Chair:©GEORGE CYBENKO Networking:|MIKE LiuDesign & Test: FRANK FERRANTE
`
`Transactions Operations Chair:
`STEVEN L. TANIMOTO.
`
`Intelligent Systems:—Dantet. E. O’LEARY Parallel & Distributed Systems: PEN YEW
`
`
`Internet Computing:©MUNINDARP. SINGH Pattern Analysis & MachineIntelligence: RAMA CHELLAPPA
`
`Publications Operations Chair:|MARK CHRISTENSEN
`IEEE PABLiaison:©RANGACHAR Kasturi
`
`IT Professional:—RayesH Gupta Software Engineering:—JOHN KNIGHT
`Micro:
`Ken SAKAMURA
`Very Large Scale Integration:—Epy FRIEDMAN
`Multimedia:|FOROUZAN GOLSHANI
`Visualization & Computer Graphics:
`HANS HAGEN
`Pervasive Computing:|M. SATYANARAYANAN
`
`IEEE CS Pre
`MICHAEL WILLIAMS
`Software:—STEVEN C. MCCONNELL
`
`Davip HENNAGE, Executive Director
`ANGELA BurGess, Publisher
`Voter §. Doan, ChiefFinancial Officer
`
`Executive Staff
`
`ANNE MariF KELLY, Director, Volunteer Services
`JouHN C. KEATON, Manager, Research and Planning
`Rosert Care,Director, Information Technology & Services
`
`Transactions Department
`ALICIA L. BARRETT, Production Manager
`SUZANNE WERNER,Peer ReviewSupervisor
`PILAR HAWTHORNE, Katity SANTA MARIA, Production Editors
`Yu-1Zu TSAI, JULIE TURNBAUGH,Electronic Media Assistants
`SELINA Norman,Transactions Assistant
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTEDSYSTEMSis published monthly by the IEEE ComputerSociety. IEEE Corporate Office: Three Park Avenue, 17th Floor, New York, NY 10016-5997 USA.Responsibility for tl
`content rests upon the authors and not upon the IEEEor the IEEE Computer Society. IEEE Computer Society Publications Office: 10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720-1314 USA.
`IEEE Computer Sacie
`Headquarters: 1730 Massachusetts Ave. NW, Washington, DC 20036-1992 USA. Back issues: IEEE members10.00 (first copy only), nonmembers $20.00 per copy. (Note: Add $4.00 postage and handling chargeto any order from $1.
`to $50.00, including prepaid orders). Completeprice information available on request. Also available in. microfiche. Copyright and Reprint Permissions: Abstracting is permitted withcredit to the source. Libraries are permitted
`photocopy for private use of patrons, providedthe per-copyfee indicated in the codeat the bottom ofthé first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923 USA. For all other copyin
`reprint, or republication permission, write to: Copyrights and Permissions Department, IEEE Publications Administration, 445 Hoes Lane, PO Box 1331, Piscataway, NJ 08855-1331. Copyright © 2002 by The Institute of Electrical ar
`Electronic Engineers, Inc. Alll rights reserved. Periodicals postage paid at New York, NY, and at additional mailing offices. Postmaster: Send address changes to IEF. TRANSACTIONS ON PARALLEL ANDDISTRIBUTED SYSTEM
`IEEEService Center, 445 Hoes Lane, PO Box 1331, Piscataway, NJ 08855-1331 USA. GST Registration No. 125634188, Canada Post Publications Mail Agreement Number0487783. Printed in USA.
`
`HPE, Exh. 1006, p. 2
`
`HPE, Exh. 1006, p. 2
`
`

`

`= TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 13, NO. 3, MARCH 2002
`
`193
`
`An Efficient Partitioning Algorithm for
`Distributed Virtual Environment Systems
`John C.S. Lui, Member, IEEE, and M.F. Chan, Student Member, IEEE
`
`Abstract—Distributed virtual environment (DVE) systems model and simulate theactivities of thousandsof entities interacting in a
`virtual world over a wide area network. Possible applications for DVE systems are multiplayer video games,military and industrial
`trainings, and collaborative engineering. In general, a DVE system is composed of many servers and eachserveris responsible to
`manage multiple clients who wantto participate in the virtual world. Each server receives updatesfrom different clients (such as the
`current position and orientation of eachclient) and then delivers this information to otherclients in the virtual world. The serveralso
`needsto perform other tasks, such asobject collision detection and synchronization control. A large scale DVE system needsto
`support manyclients and this imposes a heavy requirement on networking resources and computational resources. Therefore, how to
`meet the growing requirement of bandwidth and computational resources is one of the major challengesin designing a scalable and
`cost-effective DVE system.In this paper, we proposeanefficient partitioning algorithm that addresses the scalability issue of designing
`a large scale DVE system. The mainidea is to dynamically divide the virtual world into different partitions and then efficiently assign
`these partitionsto different servers. This way, each serverwill process approximately the same amount of workload. Another objective
`of the partitioning algorithm is to reduce the server-to-server communication overhead. The theoretical foundation of our dynamic
`partitioning algorithm is based onthelinear optimization principle. We also illustrate how one can parallelize the proposedpartitioning
`algorithm so thatit can efficiently partition a very large scale DVE system. Lastly, experiments are carried outtoillustrate the
`effectiveness of the proposed partitioning algorithm under various settings of the virtual world.
`KURTF, WENDTLIU,
`IndexTerms—Distributed virtual environment,scalability issue, partitioning algorithm,load tales
`Ladamrumeatenrgduction, linear
`
`optimization.
`
`
`MAR 15 290
`
`INTRODUCTION
`
`Nopecs in multimedia systems, parallel/distributed
`database systems, and high speed networking tech-
`logies enable system designers to build a distributed
`stem which allows many users to explore and interact
`ider a three dimensional virtual environment. In general, a
`) virtual environmentisa virtual world which consists of
`any high-resolution 3D graphics sceneries to represént a
`al-life world. For example, we can have a 3D virtual
`orld to represent a lecture hall so that hundreds of
`idents and scientists can listen to a seminar presented
`' Professor Daniel C. Tsui’ or we can have a large 3D
`rtual world to represent the latest COMDEX show which
`is many customers reviewing the latest softwares and
`ectronic gadgets. This type of shared, computer-resident
`rtual worldis called a distributed virtual environment (DVE)
`4]. Like other ground-breaking computer technologies,
`VE will change the way welearn, work, and interact with
`her people in the society.
`
`1. A 1998 Noble Prize winner in Physics for the discovery of a new form
`quantumfluid with fractionally charged excitations.
`
`J.C.S. Lui is with the Department of Computer Science and Engineering,
`The Chinese University of Hong Kong, Shatin, N.T., Hong Kong.
`E-mail: cslui@cse.cuhk.edu.hk.
`M.F. Chan is with Poly-Asia Comuter Inc., Kowloon, Hong Kong.
`E-mail: fei@alumni.cse.cuhk.edu.hk.
`anuscript received 11 Jan. 2000; revised 21 Apr. 2001; accepted 13 Aug.
`‘01,
`ir information on obtaining reprints of this article, please send e-mail to:
`ds@computer.org, and reference IEEECS Log Number 111201.
`1045-9219/02/$17.00 © 2002 IEEE
`
`
`
`| VeMADISON >
`
`To illustrate how a DVE system can changeourlifestyles
`and the way we handle our business operation,
`let us
`consider the following situation. Let’s say an architect from
`New York, a civil and a structural engineer from Paris, a
`financial planner from Hong Kong,andaninterior designer
`from Tokyoall need to have a business meeting to discuss
`the designing and financing issues of a new high-rise office
`complex. Under a DVEsetting, these people can convene a
`meeting in a virtual world without leaving their respective
`homes and offices. Their meeting can be carried out in a
`DVEsystem. These participants can interact with each other
`in a virtual world of the new high-rise office complex that
`they are proposing to build. Each participant
`in this
`business meeting can virtually walk around in the proposed
`high-rise office building, interact with each other and carry
`out the discussion. For example, in this virtual high-rise
`office complex, each participant in the meeting is repre-
`sented by a 3D object, which is known as an avatar. Each
`participant can walk aroundin this virtual office building
`and,
`in the process,
`rearrange any 3D object
`in the
`environment
`(e.g., rearrange paintings and furniture or
`select different kinds of carpet). Any change to a 3D object
`in this virtual world will be visible to all participants.
`Participants in this meeting are able to interact with each
`otherin real time, as well as to inquire and to receive any
`relevant information of the virtual world. For example,
`participants can query about
`the credit history of a
`manufacturer who is responsible to produce the office
`furniture.
`
`HPE,Exh.1006,p.3..
`
`HPE, Exh. 1006, p. 3
`
`

`

`
`This material may be protected by Copyright law (Title 17 U.S. Code)
`
`
`
`290
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 13, NO. 3, MARCH 2002
`
`RAPID-Cache—A Reliable and
`Inexpensive Write Cache for
`High Performance Storage Systems
`
`Yiming Hu, Senior Member, IEEE, Tycho Nightingale, and
`Qing Yang, Senior Member, IEEE
`
`Abstract—Modern high performancedisk systems make extensive use of nonvolatile RAM (NVRAM)write caches. A single-copy
`NVRAMcachecreatesa single pointof failure while a dual-copy NVRAMcacheis very expensive becauseof the high cost of NVRAM.
`This paper presents a new cache architecture called RAPID-Cache for Redundant, Asymmetrically Parallel, and Inexpensive Disk
`Cache. A typical RAPID-Cacheconsists of two redundant write buffers on top of a disk system. Oneofthe buffers is a primary cache
`made of RAM or NVRAMandthe otheris a backup cache containing a two-level hierarchy: a small NVRAMbuffer on top of a log disk.
`The small NVRAMbuffer combines small write data and writes them into the log diskin large sizes. By exploiting the locality property of
`V/O accesses and taking advantage of well-known Log-structured File Systems, the backup cache has nearly equivalent write
`performance as the primary RAM cache. Theread performance of the backupcacheis notascritical because normal read operations
`are performed through the primary RAM cache and reads from the backup cache happen only during error recovery periods. The
`RAPID-Cache presents an asymmetric architecture with a fast-write-fast-read RAM being a primary cache anda fast-write-slow-read
`NVRAM-disk hierarchy being a backup cache. The asymmetrically parallel architecture and an algorithm that separatesactively
`accessed data from inactive data in the cachevirtually eliminate the garbage collection overhead, which are the major problems
`associated with previous solutions such as Log-structured File Systems and Disk Caching Disk. The asymmetric cacheallowscost-
`effective designs for very large write caches for high-end parallel disk systems that would otherwise have to use dual-copy,costly
`NVRAMcaches. It also makesit possible to implementreliable write caching for low-end disk I/O systems since the RAPID-Cache
`makes use of inexpensive disks to perform reliable caching. Our analysis and trace-driven simulation results show that the RAPID-
`Cachehassignificantreliability/cost advantages over conventional single NVRAM write caches and has great cost advantages over
`dual-copy NVRAM caches. The RAPID-Cachearchitecture opens a new dimensionfor disk system designers to exercise trade-offs
`among performance,reliability, and cost.
`
`Index Terms—Disks, storage systems, performance, reliability, fault-tolerance.
`
`5a
`
`1
`
`INTRODUCTION
`
`Mee disk I/O systems make extensive use of
`nonvolatile RAM (NVRAM)write caches to facilitate
`asynchronous write [2],
`[3],
`[4], ie, a write request is
`acknowledged before the write goes to disk. Such write
`caches significantly reduce response times of disk I/O
`systemsseen by users, particularly in RAID systems. Large
`write caches can also improve system throughput by taking
`advantage of both temporal and spatial localities [2], [5], as
`data maybe overwritten several times or combined together
`before been written to the disk. IO requests are very bursty
`[6], requests are often come together with longintervals of
`relative inactive periods in between. Large write caches also
`benefit from the the burstiness of write workloads since
`data coming from bursts can be quickly stored in the cache
`
`and written back to the disk later when the system is less
`busy. Treiber and Menon reported that write caches could
`reduce disk utilization for writes by an order of magnitude
`when compared to basic RAID-5 systems[3]. However, the
`use of write caches introduces two problems: poor
`reliability and high cost.
`Disks are impressively reliable today, with a Mean Time
`To Failure (MTTF) of up to one million hours. Such a low
`failure rate, coupled with possible redundancy such as
`RAID,gives a Mean Time To Data Loss (MTTDL)of several
`hundreds of millions of hours in a typical RAID-5 system
`[7]. Adding a single cache in front of a disk system creates a
`single point of failure, which is vulnerable to data loss.
`Savage and Wilkes pointed outin [7] that because typical
`NVRAMtechnology (battery backed RAM)has a quite low
`MTTF of 15K hours, a single-copy NVRAM cachesuffers
`e Y. Hu is with the Department of Electrical & Computer Engineering and
`significantly higher risk of data loss than results from disk
`Computer Science, University of Cincinnati, Cincinnati, OA 45221-0030.
`E-mail: yhu@ececs.uc.edu.
`failures. To overcome the reliability problem, some high-
`e T. Nightingale is with Sun Microsystems, 4150 Network Circle, USAN-
`end RAIDsystemsuse dual-copy cachesso thata failure in
`208, Santa Clara, CA 95054. E-mail: tycho.nightingale@sun.com.
`one cache leaves the other cache intact [2]. When a write
`© Q. Yang is with the Department of Electrical and Computer Engineering,
`University of Rhode Island, Kingston, RI 02881.
`request comes, the controller writes two copies of the data
`E-mail: qyang@ele.uri.edu.
`independently into the two caches, a primary cache and a
`Manuscript received 4 Dec. 2000; accepted 20 Sept. 2001.
`backup cache. Besides the reliability problem, NVRAM is
`For information on obtaining reprints of this article, please sende-rail to:
`also known to be very costly [7], [8], [9] so the size of the
`tpds@computer.org, and reference IEEECS Log Number 113249.
`1045-9219/02/$17.00 © 2002 IEEE
`
`HPE, Exh. 1006, p. 4
`
`

`

`HU ET AL.: RAPID-CACHE—A RELIABLE AND INEXPENSIVE WRITE CACHE FOR HIGH PERFORMANCE STORAGE SYSTEMS
`
`291
`
`lowercost compared to dual-copy NVRAMcaches, without
`NVRAM cache is often limited. For example, a major
`sacrificing performance. On the other hand, becauseofits
`NVRAM manufacturer quoted the price of NVRAM with
`low cost, with the same budget, RAPID-Caches can have
`embeddedlithium-cell batteries for $55/MB in quantity as
`of December 2000. The cost of disks, on the other hand,is
`significantly higher performance compared to conventional
`about 0.5 cents/MB, which is a difference of four orders of
`NVRAM cache architectures by affording much larger
`primary cachesizes, while still maintaining goodreliability.
`magnitude. Moreover, the cost difference is widening (the
`While the idea of RAPID-Cache can be used in any I/O
`difference was three orders of magnitudes two years ago)
`system, it is particularly suitable for parallel disk systems
`because prices of disks are falling very rapidly. For a disk
`such as RAID because RAID systems are mostlikely to be
`system with a reasonably sized write cache, the NVRAM
`used in environments which require high performance and
`may dominatethe cost of the entire system. For example, in
`high reliability. Therefore, we concentrate our study on
`a system with 16 disks (40 GB per disk) and an NVRAM
`RAPID-Caches on top of RAID-5 systems in this paper. We
`write cache of 256 MB, at $55/MB, the NVRAMcosts about
`have carried out trace-driven simulation experiments as
`$14,080, while the total cost of 16 disks is only $3,200
`well as analytical studies to evaluate the performance and
`(assuming each 40-GB disk costs $200). If we use dual-copy
`reliability of the RAPID-Cache. Using real-world traces as
`caches to ease the reliability problem of the single-copy
`well as synthetic traces generated based on realistic work-
`cache, the cost becomes prohibitively high, particularly for
`loads[6], [15], we analyze the performance of the RAPID-
`large caches. Asa result, it is only suitable for the upper
`Cache architecture and compareit with existing disk cache
`echelon of the market [10].
`architectures. Numerical
`results show that
`the RAPID-
`The standard dual-copy write cache system has a
`Cache has significant performance/cost and_reliability
`symmetric structure, where both the primary write cache
`advantages over the existing architectures.
`and the backup write cache have the same size and the
`The paper is organized as follows: The next section
`same access characteristics—fast read speed and fast write
`presents the detailed architecture and operations of the
`speed. However, the backup cache does not provide any
`RAPID-Cache. Section 3 presents our experimental metho-
`performance benefit to the system during normal opera-
`dology. Simulation results will be presented in Section 4,
`tions. Therefore,
`it
`is wasteful
`to use a backup cache
`followed by an approximatereliability and cost analysis in
`identical to the primary cache. What is needed is only a
`Section 5. We discuss related work in Section 6 and
`backup cache that can be written to very quickly while its
`conclude the paper in Section 7.
`read operations are not as critical since reads from the
`backup cache occur only during error-recovering periods.
`Based on these observations, we propose a new disk
`cache architecture called Redundant, Asymmetrically Parallel,
`Inexpensive Disk Cache, or RAPID-Cache for short,
`to
`provide fault-tolerant caching for disk I/O systems.
`in-
`expensively. The main idea of the RAPID-Cacheis to use a
`conventional, fast-write-fast-read primary cache and a non-
`volatile, fast-write-slow-read backup cache. The primary
`cache is made of normal NVRAM or DRAM,while the
`backup cache consists of a small NVRAMcacheand a log
`disk (cache disk). In the backup cache, small and random
`writes arefirst buffered in the small NVRAM buffer to form
`large logs that are written into the cache disk later in large
`transfers, similar to log structured file systems [11], [12],
`[13], [14]. Because large writes eliminate many expensive
`small writes,
`the buffer
`is quickly made available for
`additional requests so that the two-level cache appears to
`the host as a large NVRAM. Asa result, the backup cache
`can achieve the same write speed as the primary cache. The
`slow-read performance of the backup cache doesnotaffect
`the system performance since every data block in the
`backupcache has a copy in the primary cache which can be
`read at the speed of RAM. The dual cache system here is
`asymmetric since the primary cache and the backup cache
`have different sizes and structures. The reliability of the
`RAPID-Cache is expected to be high since disk is very
`reliable. The system is also inexpensive because the
`NVRAMin the backup cache can be very small, ranging
`from hundreds of KB to several MB andthecost ofthe disk
`space is significantly less than that of a large NVRAM. We
`will show that RAPID-Caches provide much higherrelia-
`bility compared to single-copy NVRAM caches and much
`
`It
`1 shows the basic structure of a RAPID-Cache.
`Fig.
`consists of a conventional primary RAM cache and a
`backup cache. The backup cache is a two-level hierarchy
`with a small NVRAMontopofa cachedisk, similar to DCD
`[16]. In a RAPID-Cache, every I/O write operation is sent to
`both the primary cache and the backup cache while read
`operations are performed using the primary cache only.
`For very high overall reliability, the primary cache can be
`NVRAMto provide redundantprotection during a power
`failure. On the other hand,
`for
`low cost systems,
`the
`primary cache can be DRAM. During normal operations,
`the DRAM primary cache and the backup cache contain
`redundantdata. If any one of the two cachesfails, data can
`be reconstructed from the other. During a powerfailure,
`data are retained in the backup NVRAMandthecachedisk.
`If both the read cache and the primary write cache are made
`of DRAM,we can use a unified read/write cache structure,
`as shownin Fig. 2a, for better cache utilization. A RAPID-
`Cache with a large unified DRAM primarycachehas higher
`throughput, lowercost, and better reliability than that of a
`single-copy conventional NVRAM cache. For many appli-
`cations that require redundantprotection during a power
`failure, a Triple RAPID-Cache (shownin Fig. 2b) can be used
`to build a highly reliable, very large cache system. The idea
`is to use two low-cost backup caches to support one large
`primary DRAM cache. During normal operations,
`the
`primary cache and the two backup caches providetriple
`redundancy protection. The two backup caches provide
`dual-redundancy protection during a powerfailure. Triple
`RAPID-Cachesare especially suitable for high-end systems
`
`2 ARCHITECTURE AND OPERATIONS
`
`HPE, Exh. 1006, p. 5
`
`HPE, Exh. 1006, p. 5
`
`

`

`292
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 13, NO. 3, MARCH 2002
`
`Incoming Requests
`—e
`
`
`eet
`
`|
`Disk Controller
` Jb
` ig
`
`Backup Cache
`
`DRAM /
`NVRAM
`DRAM
`PrimaryWrite Cache
`Read Cache
`
`JL
`
`
`S Cache-Disk
`
`caelens
`
`
`=D
`
`isks
`
`Fig. 1. RAPID-Cache ontop of a disk system.
`
`that need very large and very reliable write caches. For
`large cachesizes, a Triple RAPID-Cache has lower cost and
`better reliability than both a RAPID-Cache with a large
`NVRAMprimary cache and a conventional dual-copy
`NVRAMcache.
`
`2.1 Structures of the Backup Cache
`Fig. 3 showsthe detailed structures of the backup NVRAM
`cache and the cache disk. The NVRAMcacheconsists of an
`LRU Cache, two to four Segment Buffers, and a Hash Table.
`Another related data structure, called the Disk Segment
`Table, is located in a DRAM buffer.
`The actively-accessed data in the backup cachereside in
`the LRU cache. Theless actively-accessed data are kept in
`the cache disk. Data in the cache disk are organized in the
`format of Segments similar to that in a Log-structured File
`System suchas the Sprite LFS and the BSD LFS [12], [13]. A
`segmentcontains a numberofslots each of which can hold
`one data block. Data blocks stored in segments are
`
`addressed by their Segment IDs and Slot IDs. Data blocks
`stored in the LRU cache are addressed by their Logical Block
`Addresses (LBAs). The Hash Table contains location informa-
`tion for each of the valid data blocksin the backupcache. It
`describes whethera blockis in the NVRAM LRUcacheor in
`the cache disk, as well as the data address in the LRU cache
`or the cache disk. In our current design,the size of the hash
`entry is 16 bytes. Since data in the backup cacheis the exact
`image of the data in the primary write cache, the total
`numberofvalid data blocks in the backup cacheis the same
`as in the primary write cache, regardless of the sizes of the
`backup NVRAMandthecache disk.If the data block size is
`8 KB, then, for a 32 MB write cache, there are 4,096 blocks in
`total. Since each valid block has a corresponding hash entry
`of 16 bytes, the total hash table size is 64 KB, which is
`compact enoughto be placed in the NVRAM.
`For the purpose of speeding up garbagecollection, we
`also keep track of a data structure called the Disk Segment
`Table. Since this table contains redundantinformation that
`
`Incoming Requests
`Incoming Requests
`
`oe
`Js
`
`
`
`
`
`ee
`Es
` x2a 93 g
`RAID Controller
`
`
`ress
`cE
`|
`
`
`
`
`
`eee
`See
`See
`
`
`———=
`S
`Se
`Se
`
`
`
`NVRAM
`DRAM
`
`Backup a
`Unified Write Cache|’|Write Cache
`
`
`SeeWrite Cache
`Primary
`Primary
`Read/Write
`
`Cache
`Cache
`
`
`BS=
`
`
`
`ic
`Zz
`
`SZ EeSeaia pe
`
`
`
`eS ee
`<<
`ah
`(ae
`=
`J eT pe
`7
`SSeS = =
`
`pa) Ga 4
`
`Soke
`olan
`ieee NaeSeeee ee
`RAID-S
`Cache-Disk Cache-Disk
`RAID-5
`Cache-Disk
`
`
`
`(a)
`
`(b)
`
`Fig. 2. Unified RAPID-Cache andtriple RAPID-Cache.(a) A RAPID-Cachewith a unified read/write cache.(b) A triple RAPID-Cache with two cache
`disks.
`
`HPE, Exh. 1006, p. 6
`
`HPE, Exh. 1006, p. 6
`
`

`

`HU ET AL.: RAPID-CACHE—A RELIABLE AND INEXPENSIVE WRITE CACHE FOR HIGH PERFORMANCE STORAGE SYSTEMS
`
`293
`
`Incoming Write Data
`
`|
`NVRAM
`!
`|
`DRAM
`
`i
`Fhe
`
`
`LRU Cache
`
`|
`
`{
`a
`
`Hash
`
`Table
`
`'
`|Disk
`Segment
`
`|
`|
`bet
`ue
`
`
`
`
`t
`1
`
`
`
`
`
`
`teeRio Table '
`
`
`| SegmentBuffer
`Segment Buffer
`|
`|
`Disk
`S Disk
`Disk
`Disk
`Disk
`
`
`
`to dis
`
`
`Cache-Disk Space
`
`Segment
`
`Segment
`
`Segment
`
`Segment
`
`Seg
`
`Fig. 3. The detailed structure of the backup cache and the cache disk.
`rw<— Ss
`Incoming Requests
`
`
`
`
`
`
`RAID Controller
`|
`Read Cache
`
`
`—
`
`
`
` SE
`wl
`NV.
`
`
`DRAM /
`Backp
`|
`Backup
`DRAM
`Write Cache
`|
`DRAM
`|NVRAM
`Write Cache
`
`Read Cache |||PrimaryWrite Cache
`
` Se ee aiI>,eeCache-Partitions
`
`Incoming Requests
`aSs
`NE
`RAID Controller
`
`|
`
`. FeEAT
`
`
`
`Cache-DDisk
`
`(a)
`
`(b)
`
`Fig. 4. Physical RAPID-Cache and logical RAPID-Cache.(a) ‘A RAPID-Cache with a dedicated physical cache disk. (b) A RAPID-Cache with a
`distributed logical cache disk.
`
`can be quickly and completely reconstructed from the Hash
`Table in case of a crash,it is stored in the DRAM. Wewill
`discuss the structure of the Disk Segment Table in detail
`later in this section.
`The cache disk in the backup cache can be a dedicated
`physical disk, as shown inFig.4a. It can also be distributed
`among the data disks of the RAID system, each data disk
`having a small partition acting as a partof a large distributed
`logical cache disk, as shown in Fig. 4b. In modern RAID
`systems, the physical cache disk can often be implemented
`without extra cost since many modern RAID systems
`include one or several spare disks that can be put into
`service whenanactive disk fails [17]. However, as pointed
`out by Wilkes et al.
`[17], during normal psrations, the
`spare disks are not used in many systems’ and contribute
`nothing to the performanceof the system.It is also hard to
`tell if the spare disksarestill working

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket