throbber
IEEE TRANSACTIONS ON
`PARALLEL AND
`DISTRIBUTED SYSTEMS
`
`A publication of the IEEE Computer Society
`NUMBER 3
`ITDSEO
`(ISSN 1045-9219)
`
`MARCH 2002
`
`VOLUME 13
`
`INS
`tL 24
`
`12 D5
`
`This resource is also available
`on the WWW.
`Use MadCat to launch.
`
`ZEGULAR PAPERS
`kpplications and Algorithms
`An Efficient Partitioning Algorithm for Distributed Virtual Environment Systems
`J.C.S. Lui and M.F. Chan
`Fast Sorting Algorithms on a Linear Array with a Reconfigurable Pipelined Bus System
`A. Datta, S. Soundaralakshmi, and R. Owens
`:lustering Computing
`Dynamic Cluster Resource Allocations for Jobs with Known and Unknown Memory Demands
`L. Xiao, S. Chen, and X. Zhang
`:ompilers
`An Advanced Compiler Framework for Non-Cache-Coherent Multiprocessors
`Y. Paek, A. Navarro, E. Zapata, J. Hoeflinger, and D. Padua
`ieterogeneous Computing
`Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing
`H. Topcuoglu, S. Hariri, and M.-Y. Wu
`nterconnection Networks
`HIPIQS: A High-Performance Switch Architecture Using Input Queuing
`R. Sivaram, C.B. Stunkel, and D.K. Panda
`Machine Architecture
`RAPID-Cache—A Reliable and Inexpensive Write Cache for High Performance Storage Systems
`Y. Hu, T. Nightingale, and Q. Yang
`Vlapping Applications to Machines
`Matching and Scheduling Algorithms for Minimizing Execution Time and Failure Probability of
`Applications in Heterogeneous Computing
`A. Dogan and F Ozgunger
`
`Theory
`Fair and Efficient Packet Scheduling Using Elastic Round Robin
`S.S. Kanhere, H. Sethu, and A.B. Parekh
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`193
`
`212
`
`223
`
`241
`
`26O
`
`275
`
`29O
`
`3O8
`
`324
`
`I
`I
`itfidimiiiititiniiiiiim******Kia**************3-116I 537
`379 58 33
`22072488 032002 150
`VISCONTI UNIV
`Elan LII
`Fa It RIMINI AVE
`MADISON UII 53706-1605
`
` 6
`IEEE C
`)
`COMPUTER
`SOCIETY
`
`http://computerorg
`tpds@computer.org
`
`IEEE
`
`HPE, Exh. 1006, p. 1
`
`

`

`IEEE COMPUTER
`SOCIETY
`
`The IEEE Computer Society is an association of people with professional interest in the field of computers. All members of the IEEE are eligible for membership in
`the Computer Society, as are members of certain professional societies and other computer professionals. Computer Society members will receive this Transactions
`upon payment of the annual Society membership fee ($35 for IEEE members, $78 for all others) plus an annual subscription fee (paper only: $38; electronic only: $30;
`combination: $49). For additional membership and subscription information, visit our Web site at http://computerorg/subscribe, send email to heipocomputer.org, or writs
`to IEEE Computer Society, 10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720-1314 USA. Individual subscription copies of Transactions are for
`personal use only.
`
`*IEEE
`
`TRANSACTIONS ON
`
`PARALLEL AND DISTRIBUTED SYSTEMS
`
`EDITOR-IN-CHIEF
`PEN-Cr-RING YEW
`UNIVERSITY OF MINNESOTA
`DEPT. OF COMPUTER SCIENCE
`200 UNION SIR6i.r, SE
`4-192 EE /CS BUILDING
`MINNEAPOLIS, MN 55455
`YEW@CS.LIMN.EDU
`
`Editorial Board
`
`ESMOND G. NC
`Lawrence Berkeley Nat'l Lab.
`EGNgO1111 goy
`
`STEPHAN CCLARIU
`Old Dominion University
`olariatics.odu.edu
`
`KRISHNA V. PALEM
`Courant Inst. Of Math. Sciences
`palemecs.nyu.edu
`
`KESHAV PINGAD
`Comet University
`pingaliffm.cornelLedu
`
`YVES ROBERT
`Ecole Normale Sup6rieure de Lyon
`Yves.Robertfiens-Iyonir
`
`LEI SRA
`Univ. of Illinois, Urbana-Champaign
`lrell)cs.uiue.edu
`
`RICHARD D. Somocreic
`AT&T Shannon Laboratory
`rick@research.attcom
`
`RANA6H K. SITARAMAN
`Univ. of Massachusetts.
`rameshdcsurnass.edu
`
`MOOTAZ EENOZAHT
`IBM Austin Research Lab
`mootaseusibm.com
`
`BRUCE MAGGS
`Carnegie Mellon University
`bmm@cs.cmu.edu
`
`MARGARET MARIONOSI
`Princeton University
`martonosieprincetonedu
`
`Pin McKelley
`Michigan State University
`mckinley0cps.msu.edu
`
`ASHWINI NANDA
`IBM TJ Watson Research Center
`ashwiniieus.ibmitan
`
`TIMOTHY M. Peassiom
`University of Southern California
`tpink@charity.usc.edu
`
`,
`
`NEERAI Sum
`Chalmers University
`suriffeechalmers.se
`
`BE WU
`Florida Atlantic University
`jie@cse.fau.eduTno YANG
`Univ. of California, Santa Barbara
`tyang@csawsb.edu
`
`YuAkvuem YANG
`SUNY, Stony Brook
`yangfferesonysb.edu
`
`WEI ZHAO
`Texas A & M Univ.
`zhao@cs.tamu.edu
`
`TRIER ZNATI
`University of Pittsburgh
`enati@cs.pittedu
`
`WILLY ZWAENEECIEL
`Rice University
`willyellcs.rice.edu
`
`NANCY AMATO
`Texas A&M University
`amatoecs.tamu.edu
`
`DOUG BLOUGH
`Georgia Institute of Tedanology
`doug.blough@ece.gatech.edu
`
`ALOK N. CHOUDHARS
`Northwestern University
`chouciharCece.nwu.edu
`
`MICHEL COSNARD
`INRIA
`MicheiCosnardffinria.fr
`
`SANDHI, DWARKADAS
`University of Rochester
`sandhyaers.rochesteredu
`
`MANUSCRIPT SUBMISSIONS STATUS INQUIRIES: For information on submitting a manuscript or on a paper awaiting publication, please contact: Transactions Assistant TPDS, IEEE
`Computer Society, 10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720-1314 USA; EMAIL: tpds th computerorg, PHONE: +1 714.821.8380; FAX: +1 714.821.4010
`
`WILLIS KING, President
`STEPHEN L DIAMOND, President-Elect
`BEN WAH, Past President
`RANGACHAR KASTURE VP, Publications
`JERRY Emcei, VP, Conferenres & Tutorials
`
`Members-at-Large
`
`ANGELA BURGESS (EX officio)
`THOMAS KEEFE
`
`GABRIELLA SANNITI DI BATA
`ANAND TRIPATEI
`
`Magazine Operations Chair: GEORGE CTBENKO
`STEVEN L TANIMOTO
`Transactions Operations Chair:
`Publications Operations Chair: MARK CHRISTENSEN
`RANGACHAN KASH lir
`IEEE PAP Liaison:
`
`IEEE COMPUTER SOCIETY
`
`Officers
`
`CARL CHANG, VP, Educational Activities
`JAMIE H. CROSS, VP, Chapter Activities
`Lowe. Jouesom, Second VP, Standards Activities
`DEBORAH SCHERRER, First VP, Technical Activities
`DmuRsti M. COPPER, Secretory
`
`Publications Board
`Vice President: RANGACHAR KASTURI
`Vice Chair RICHARD ffieuiTERER
`
`Magazines
`Annals of the History of Computing:
`Computing in Science & Engineering:
`Computer:
`Corriputer Graphics & Applications:
`Design & Test:
`Intelligent Systems:
`Internet Computing:
`IT Professional:
`Micro:
`Multimedia:
`Pervasive Computing:
`Software:
`
`Editors-in-Chief
`Tuomes J. BERGEN
`FRANCIS SULLIVAN
`lanes H. An. OE
`JAMES]. TIIONIAS
`FRANK FERRANTE
`DANIEL E. O1,ss.
`MUNINDAR P SINGH
`RAJESH GUPTA
`KEN SARANAC.
`HM011E/AN GOLSHANI
`M. SATTAHARAAANAN
`STEVEN C. MCCONNELL
`
`WOUGANG GIDE, Treasurer
`JAMES D. Ism., IEEE Division V Director
`THOMAS W. WIITIANIN 21M1-2002 IEEE Division VIII Director
`DAVID HENNAGE, Executive Director
`
`Transactions
`Computers:
`Knowledge & Data Engineering:
`Mobile Computing,
`Multimedia.
`Netuarking:
`Parallel & Distributed Systems:
`Pattern Analysis &Machnte Intelligence
`Software Engineering:
`Very Large Scale Integration:
`Visualization & Computer Graphics,
`
`Editors -M-Chief
`JEAN-LUC GAMED,
`PHILIP S. EL
`Tom LA PORTA
`TSLHAN CHEN
`MIKE LIU
`PEN YEW
`RAMA CHIELAPPA
`JOHN KNIGI IT
`En FRIEDMAN
`Hass HAGEN
`
`IEEE CA Press:
`
`MR HAP: WUHAN'S
`
`Demo HENNA., Executive Director
`ANGFIA &ARENS Publisher
`VIOLET S. DOAN, Chief Financial Officer
`
`ANNE MARrE KELLY, Director, Volunteer Senses
`Jon's C. KEATON, Manage, Research and Planning
`ROBERT CAM, Director, Information Technology & Services
`
`Executive Staff
`
`Transactions Department
`ALICIA L. BARRErr, Production Manager
`SUZANNE. WERNER, Peer Review Supervisor
`HEAR HAWTHORNE, KATHY SANTA MARIA, Production Editors
`Ye-Tzu TSAI, JULIE TURNBAUGH, Electronic Media Assistants
`SELNA NORMAN, Transactions Assistant
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS is published monthly by the IEEE Computer Society. IEEE Corporate Office Three Park Avenue, 17th Floor, New York, NY 10016-5997 USA. Responsibility for tl
`content rests upon the authors and not upon the IEEE or the IEEE Computer Society. IEEE Computer Society Publications Office: 10662 Las Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720-1314 USA. IEEE Computer Soda
`Headquarters: 1711 Massachusetts Ave. NW, Washington, DC 20036-1992 USA. Back issues: IEEE members-610.00 (first copy only), nonmembers $20.00 per copy. (Note. Add $4.00 postage and handling charge to any order from $11
`to $50.00, including prepaid orders). Complete price information available on request. Also available inefficrofiche. Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries am permitted
`photocopy for private use of patrons, provided the per-copy fee indicated in the code at the bottom of 016 first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923 USA. For all other copyin
`reprint, or republication permission, write to: Copyrights and Permissions Department, IEEE Publications Administration, 445 Hoes Lane, PO Box 1331, Piscataway, NJ 08855-1331. Copyright © 2002 by The Institute of Electrical ar
`Electronic Engineers, Inc. All rights ',served. Periodicals postage paid at New York, NY, and at additional mailing offices. Postmaster Send address changes to IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM
`IEEE Service Center, 445 Hoes Lane, PO Box 1331, Piscataway, NJ 0815-1331 USA. CST Registration No. 125634188. Canada Post Publications Mail Agreement Number 0487783. Printed in USA.
`
`HPE, Exh. 1006, p. 2
`
`

`

`TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 13, NO. 3, MARCH 2002
`
`193
`
`An Efficient Partitioning Algorithm for
`Distributed Virtual Environment Systems
`
`John C.S. Lui, Member, IEEE, and M.F. Chan, Student Member, IEEE
`
`Abstract—Distributed virtual environment (DVE) systems model and simulate the activities of thousands of entities interacting in a
`virtual world over a wide area network. Possible applications for DVE systems are multiplayer video games, military and industrial
`trainings, and collaborative engineering. In general, a DVE system is composed of many servers and each server is responsible to
`manage multiple clients who want to participate in the virtual world. Each server receives updates from different clients (such as the
`current position and orientation of each client) and then delivers this information to other clients in the virtual world. The server also
`needs to perform other tasks, such as object collision detection and synchronization control. A large scale DVE system needs to
`support many clients and this imposes a heavy requirement on networking resources and computational resources. Therefore, how to
`meet the growing requirement of bandwidth and computational resources is one of the major challenges in designing a scalable and
`cost-effective DVE system. In this paper, we propose an efficient partitioning algorithm that addresses the scalability issue of designing
`a large scale DVE system. The main idea is to dynamically divide the virtual world into different partitions and then efficiently assign
`these partitions to different servers. This way, each server will process approximately the same amount of workload. Another objective
`of the partitioning algorithm is to reduce the server-to-server communication overhead. The theoretical foundation of our dynamic
`partitioning algorithm is based on the linear optimization principle. We also illustrate how one can parallelize the proposed partitioning
`algorithm so that it can efficiently partition a very large scale DVE system. Lastly, experiments are carried out to illustrate the
`effectiveness of the proposed partitioning algorithm under various settings of the virtual world.
`
`KURT F. WENDT LIU,
`Index Terms-Distributed virtual environment, scalability issue, partitioning algorithm, load balareire,14,01AvatticatiowsKiletion, linear
`optimization.
`
`INTRODUCTION
`DVANCES in multimedia systems, parallel/distributed
`database systems, and high speed networking tech-
`logies enable system designers to build a distributed
`stem which allows many users to explore and interact
`tder a three dimensional virtual environment. In general, a
`) virtual environment is a virtual world which consists of
`any high-resolution 3D graphics sceneries to represent a
`I-life world. For example, we can have a 3D virtual
`3rld to represent a lecture hall so that hundreds of
`tdents and scientists can listen to a seminar presented
`Professor Daniel C. Tsuil or we can have a large 3D
`rtual world to represent the latest COMDEX show which
`Is many customers reviewing the latest softwares and
`..ctronic gadgets. This type of shared, computer-resident
`rtual world is called a distributed virtual environment (DVE)
`4]. Like other ground-breaking computer technologies,
`VE will change the way we learn, work, and interact with
`her people in the society.
`
`1. A 1998 Noble Prize winner in Physics for the discovery of a new form
`quantum fluid with fractionally charged excitations.
`
`J.C.S. Lui is with the Department of Computer Science and Engineering,
`The Chinese University of Hong Kong, Shatin, N.T., Hong Kong.
`E-mail: cslui@cse.cuhk.edu.hk.
`M.F. Chan is with Poly-Asia Comuter Inc., Kowloon, Hong Kong.
`E-mail: fei@alumni.cse.cuhk.edu.hk.
`anuscript received 11 Jan. 2000; revised 21 Apr. 2001; accepted 13 Aug.
`
`T information on obtaining reprints of this article, please send e-mail to:
`ds@computer.org, and reference IEEECS Log Number 111201.
`
`1 7, 20e
`v-MADIses,' . .
`To illustrate how a DVE system can change our lifestyles
`and the way we handle our business operation, let us
`consider the following situation. Let's say an architect from
`New York, a civil and a structural engineer from Paris, a
`financial planner from Hong Kong, and an interior designer
`from Tokyo all need to have a business meeting to discuss
`the designing and financing issues of a new high-rise office
`complex. Under a DVE setting, these people can convene a
`meeting in a virtual world without leaving their respective
`homes and offices. Their meeting can be carried out in a
`DVE system. These participants can interact with each other
`in a virtual world of the new high-rise office complex that
`they are proposing to build. Each participant in this
`business meeting can virtually walk around in the proposed
`high-rise office building, interact with each other and carry
`out the discussion. For example, in this virtual high-rise
`office complex, each participant in the meeting is repre-
`sented by a 3D object, which is known as an avatar. Each
`participant can walk around in this virtual office building
`and, in the process, rearrange any 3D object in the
`environment (e.g., rearrange paintings and furniture or
`select different kinds of carpet). Any change to a 3D object
`in this virtual world will be visible to all participants.
`Participants in this meeting are able to interact with each
`other in real time, as well as to inquire and to receive any
`relevant information of the virtual world. For example,
`participants can query about the credit history of a
`manufacturer who is responsible to produce the office
`furniture.
`
`1045-9219/02/$17.00 0 2002 IEEE
`
`HPE, Exh. 1006, p. 3
`
`

`

`290
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 13, NO. 3, MARCH 2002
`
`RAPID-Cache A Reliable and
`Inexpensive Write Cache for
`High Performance Storage Systems
`
`Yiming Hu, Senior Member, IEEE, Tycho Nightingale, and
`Qing Yang, Senior Member, IEEE
`
`Abstract—Modern high performance disk systems make extensive use of nonvolatile RAM (NVRAM) write caches. A single-copy
`NVRAM cache creates a single point of failure while a dual-copy NVRAM cache is very expensive because of the high cost of NVRAM.
`This paper presents a new cache architecture called RAPID-Cache for Redundant, Asymmetrically Parallel, and Inexpensive Disk
`Cache. A typical RAPID-Cache consists of two redundant write buffers on top of a disk system. One of the buffers is a primary cache
`made of RAM or NVRAM and the other is a backup cache containing a two-level hierarchy: a small NVRAM buffer on top of a log disk.
`The small NVRAM buffer combines small write data and writes them into the log disk in large sizes. By exploiting the locality property of
`110 accesses and taking advantage of well-known Log-structured File Systems, the backup cache has nearly equivalent write
`performance as the primary RAM cache. The read performance of the backup cache is not as critical because normal read operations
`are performed through the primary RAM cache and reads from the backup cache happen only during error recovery periods. The
`RAPID-Cache presents an asymmetric architecture with a fast-write-fast-read RAM being a primary cache and a fast-write-slow-read
`NVRAM-disk hierarchy being a backup cache. The asymmetrically parallel architecture and an algorithm that separates actively
`accessed data from inactive data in the cache virtually eliminate the garbage collection overhead, which are the major problems
`associated with previous solutions such as Log-structured File Systems and Disk Caching Disk. The asymmetric cache allows cost-
`effective designs for very large write caches for high-end parallel disk systems that would otherwise have to use dual-copy, costly
`NVRAM caches. It also makes it possible to implement reliable write caching for low-end disk I/O systems since the RAPID-Cache
`makes use of inexpensive disks to perform reliable caching. Our analysis and trace-driven simulation results show that the RAPID-
`Cache has significant reliability/cost advantages over conventional single NVRAM write caches and has great cost advantages over
`dual-copy NVRAM caches. The RAPID-Cache architecture opens a new dimension for disk system designers to exercise trade-offs
`among performance, reliability, and cost.
`
`Index Terms—Disks, storage systems, performance, reliability, fault-tolerance.
`
`1
`
`INTRODUCTION
`
`MODERN disk I/O systems make extensive use of
`
`nonvolatile RAM (NVRAM) write caches to facilitate
`asynchronous write [2], [3], [4], i.e., a write request is
`acknowledged before the write goes to disk. Such write
`caches significantly reduce response times of disk I/O
`systems seen by users, particularly in RAID systems. Large
`write caches can also improve system throughput by taking
`advantage of both temporal and spatial localities [2], [5], as
`data may be overwritten several times or combined together
`before been written to the disk. IO requests are very bursty
`[6], requests are often come together with long intervals of
`relative inactive periods in between. Large write caches also
`benefit from the the burstiness of write workloads since
`data coming from bursts can be quickly stored in the cache
`
`• Y. Hu is with the Department of Electrical & Computer Engineering and
`Computer Science, University of Cincinnati, Cincinnati, OH 45221-0030.
`E-mail: yhu@ececs.uc.edu.
`• T. Nightingale is with Sun Microsystems, 4150 Network Circle, USAN-
`208, Santa Clara, CA 95054. E-mail: tycho.nightingale@sun.com.
`• Q. Yang is with the Department of Electrical and Computer Engineering,
`University of Rhode Island, Kingston, RI 02881.
`E-mail: qyang@ele.uri.edu.
`Manuscript received 4 Dec. 2000; accepted 20 Sept. 2001.
`For information on obtaining reprints of this article, please send e4hail to:
`tpds@computer.org, and reference IFEECS Log Number 113249.
`
`and written back to the disk later when the system is less
`busy. Treiber and Menon reported that write caches could
`reduce disk utilization for writes by an order of magnitude
`when compared to basic RAID-5 systems [3]. However, the
`use of write caches introduces two problems: poor
`reliability and high cost.
`Disks are impressively reliable today, with a Mean Time
`To Failure (MTTF) of up to one million hours. Such a low
`failure rate, coupled with possible redundancy such as
`RAID, gives a Mean Time To Data Loss (MTTDL) of several
`hundreds of millions of hours in a typical RAID-5 system
`[7]. Adding a single cache in front of a disk system creates a
`single point of failure, which is vulnerable to data loss.
`Savage and Wilkes pointed out in [7] that because typical
`NVRAM technology (battery backed RAM) has a quite low
`MTTF of 15K hours, a single-copy NVRAM cache suffers
`significantly higher risk of data loss than results from disk
`failures. To overcome the reliability problem, some high-
`end RAID systems use dual-copy caches so that a failure in
`one cache leaves the other cache intact [2]. When a write
`request comes, the controller writes two copies of the data
`independently into the two caches, a primary cache and a
`backup cache. Besides the reliability problem, NVRAM is
`also known to be very costly [7], [8], [9] so the size of the
`
`1045-9219/02/$17.00 C 2002 IEEE
`
`HPE, Exh. 1006, p. 4
`
`

`

`HU ET AL: RAPID-CACHE-A RELIABLE AND INEXPENSIVE WRITE CACHE FOR HIGH PERFORMANCE STORAGE SYSTEMS
`
`291
`
`NVRAM cache is often limited. For example, a major
`NVRAM manufacturer quoted the price of NVRAM with
`embedded lithium-cell batteries for $55 /MB in quantity as
`of December 2000. The cost of disks, on the other hand, is
`about 0.5 cents/MB, which is a difference of four orders of
`magnitude. Moreover, the cost difference is widening (the
`difference was three orders of magnitudes two years ago)
`because prices of disks are falling very rapidly. For a disk
`system with a reasonably sized write cache, the NVRAM
`may dominate the cost of the entire system. For example, in
`a system with 16 disks (40 GB per disk) and an NVRAM
`write cache of 256 MB, at $55 /MB, the NVRAM costs about
`$14,080, while the total cost of 16 disks is only $3,200
`(assuming each 40-GB disk costs $200). If we use dual-copy
`caches to ease the reliability problem of the single-copy
`cache, the cost becomes prohibitively high, particularly for
`large caches. As a result, it is only suitable for the upper
`echelon of the market [10].
`The standard dual-copy write cache system has a
`symmetric structure, where both the primary write cache
`and the backup write cache have the same size and the
`same access characteristics—fast read speed and fast write
`speed. However, the backup cache does not provide any
`performance benefit to the system during normal opera-
`tions. Therefore, it is wasteful to use a backup cache
`identical to the primary cache. What is needed is only a
`backup cache that can be written to very quickly while its
`read operations are not as critical since reads from the
`backup cache occur only during error-recovering periods.
`Based on these observations, we propose a new disk
`cache architecture called Redundant, Asymmetrically Parallel,
`Inexpensive Disk Cache, or RAPID-Cache for short, to
`provide fault-tolerant caching for disk I/O systems- in-
`expensively. The main idea of the RAPID-Cache is to use a
`conventional, fast-write-fast-read primary cache and a non-
`volatile, fast-write-slow-read backup cache. The primary
`cache is made of normal NVRAM or DRAM, while the
`backup cache consists of a small NVRAM cache and a log
`disk (cache disk). In the backup cache, small and random
`writes are first buffered in the small NVRAM buffer to form
`large logs that are written into the cache disk later in large
`transfers, similar to log structured file systems [11], [12],
`[13], [14]. Because large writes eliminate many expensive
`small writes, the buffer is quickly made available for
`additional requests so that the two-level cache appears to
`the host as a large NVRAM. As a result, the backup cache
`can achieve the same write speed as the primary cache. The
`slow-read performance of the backup cache does not affect
`the system performance since every data block in the
`backup cache has a copy in the primary cache which can be
`read at the speed of RAM. The dual cache system here is
`asymmetric since the primary cache and the backup cache
`have different sizes and structures. The reliability of the
`RAPID-Cache is expected to be high since disk is very
`reliable. The system is also inexpensive because the
`NVRAM in the backup cache can be very small, ranging
`from hundreds of KB to several MB and the cost of the disk
`space is significantly less than that of a large NVRAM. We
`will show that RAPID-Caches provide much higher relia-
`bility compared to single-copy NVRAM caches and much
`
`lower cost compared to dual-copy NVRAM caches, without
`sacrificing performance. On the other hand, because of its
`low cost, with the same budget, RAPID-Caches can have
`significantly higher performance compared to conventional
`NVRAM cache architectures by affording much larger
`primary cache sizes, while still maintaining good reliability.
`While the idea of RAPID-Cache can be used in any I/O
`system, it is particularly suitable for parallel disk systems
`such as RAID because RAID systems are most likely to be
`used in environments which require high performance and
`high reliability. Therefore, we concentrate our study on
`RAPID-Caches on top of RAID-5 systems in this paper. We
`have carried out trace-driven simulation experiments as
`well as analytical studies to evaluate the performance and
`reliability of the RAPID-Cache. Using real-world traces as
`well as synthetic traces generated based on realistic work-
`loads [6], [15], we analyze the performance of the RAPID-
`Cache architecture and compare it with existing disk cache
`architectures. Numerical results show that the RAPID-
`Cache has significant performance/cost and reliability
`advantages over the existing architectures.
`The paper is organized as follows: The next section
`presents the detailed architecture and operations of the
`RAPID-Cache. Section 3 presents our experimental metho-
`dology. Simulation results will be presented in Section 4,
`followed by an approximate reliability and cost analysis in
`Section 5. We discuss related work in Section 6 and
`conclude the paper in Section 7.
`
`2 ARCHITECTURE AND OPERATIONS
`Fig. 1 shows the basic structure of a RAPID-Cache. It
`consists of a conventional primary RAM cache and a
`backup cache. The backup cache is a two-level hierarchy
`with a small NVRAM on top of a cache disk, similar to DCD
`[16]. In a RAPID-Cache, every I/O write operation is sent to
`both the primary cache and the backup cache while read
`operations are performed using the primary cache only.
`For very high overall reliability, the primary cache can be
`NVRAM to provide redundant protection during a power
`failure. On the other hand, for low cost systems, the
`primary cache can be DRAM. During normal operations,
`the DRAM primary cache and the backup cache contain
`redundant data. If any one of the two caches fails, data can
`be reconstructed from the other. During a power failure,
`data are retained in the backup NVRAM and the cache disk.
`If both the read cache and the primary write cache are made
`of DRAM, we can use a unified read/write cache structure,
`as shown in Fig. 2a, for better cache utilization. A RAPID-
`Cache with a large unified DRAM primary cache has higher
`throughput, lower cost, and better reliability than that of a
`single-copy conventional NVRAM cache. For many appli-
`cations that require redundant protection during a power
`failure, a Triple RAPID-Cache (shown in Fig. 2b) can be used
`to build a highly reliable, very large cache system. The idea
`is to use two low-cost backup caches to support one large
`primary DRAM cache. During normal operations, the
`primary cache and the two backup caches provide triple
`redundancy protection. The two backup caches provide
`dual-redundancy protection during a power failure. Triple
`RAPID-Caches are especially suitable for high-end systems
`
`HPE, Exh. 1006, p. 5
`
`

`

`292
`
`IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 13, NO. 3, MARCH 2002
`
`Incoming Requests
`
`Disk Controller
`
`DRAIN
`Read Cache
`
`DRAM /
`NVRAM
`Primary
`Write Cache
`
`,
`
`Backup Cache
`
`Fig. 1. RAPID-Cache on top of a disk system.
`
`Disks
`
`Cache -Disk
`
`that need very large and very reliable write caches. For
`large cache sizes, a Triple RAPID-Cache has lower cost and
`better reliability than both a RAPID-Cache with a large
`NVRAM primary cache and a conventional dual-copy
`NVRAM cache.
`
`2.1 Structures of the Backup Cache
`Fig. 3 shows the detailed structures of the backup NVRAM
`cache and the cache disk. The NVRAM cache consists of an
`LRLI Cache, two to four Segment Buffers, and a Hash Table.
`Another related data structure, called the Disk Segment
`Table, is located in a DRAM buffer.
`The actively-accessed data in the backup cache reside in
`the LRU cache. The less actively-accessed data are kept in
`the cache disk. Data in the cache disk are organized in the
`format of Segments similar to that in a Log-structured File
`System such as the Sprite LFS and the BSD LFS [12], [13]. A
`segment contains a number of slots each of which can hold
`one data block. Data blocks stored in segments are
`
`addressed by their Segment IDs and Slot IDs. Data blocks
`stored in the LRU cache are addressed by their Logical Block
`Addresses (LBAs). The Hash Table contains location informa-
`tion for each of the valid data blocks in the backup cache. It
`describes whether a block is in the NVRAM LRU cache or in
`the cache disk, as well as the data address in the LRU cache
`or the cache disk. In our current design, the size of the hash
`entry is 16 bytes. Since data in the backup cache is the exact
`image of the data in the primary write cache, the total
`number of valid data blocks in the backup cache is the same
`as in the primary write cache, regardless of the sizes of the
`backup NVRAM and the cache disk. If the data block size is
`8 KB, then, for a 32 MB write cache, there are 4,096 blocks in
`total. Since each valid block has a corresponding hash entry
`of 16 bytes, the total hash table size is 64 KB, which is
`compact enough to be placed in the NVRAM.
`For the purpose of speeding up garbage collection, we
`also keep track of a data structure called the Disk Segment
`Table. Since this table contains redundant information that
`
`Incoming Requests
`
`Incoming Requests
`
`RAID Controller
`
`RAID Controller
`
`NVRAM
`
`VRA
`Backup
`k uac
`
`Write Cache Write Cache
`
`pM
`
`BN
`
`DRAM
`Primary
`Read/Write
`Cache
`
`Write Cache
`
`DRAM
`Unified
`Primary
`Cache
`
`RAID -5
`
`Cache -Disk
`
`RAID-5
`
`Cache-Disk Cache-Disk
`
`(a)
`
`(b)
`
`Fig. 2. Unified RAPID-Cache and triple RAPID-Cache. (a) A RAPID-Cache with a unified read/write cache. (b) A triple RAPID-Cache with two cache
`disks.
`
`HPE, Exh. 1006, p. 6
`
`

`

`HU ET AL : RAPID-CACHE-A RELIABLE AND INEXPENSIVE WRITE CACHE FOR HIGH PERFORMANCE STORAGE SYSTEMS
`
`293
`
`Incoming Write Data
`
`NVRAM
`
`LRU Cache
`
`
`
`/
`
`Hash
`Table
`
`DRAM
`
`/
`
` /
`
`Disk
`
`Segment
`
`Table
`
`Segmen Buffer
`
`Segment Buffer
`
`to disk
`
`Cache-Disk Space
`Disk
`Disk
`Segment
`Segment
`
`Disk
`Segment
`
`Disk
`Segment
`
`Disk
`Segn
`
`Fig. 3. The detailed structure of the backup cache and the cache disk.
`
`Incoming Requests
`
`Incoming Requests
`
`RAID Controller
`
`L
`
`RAID Controller
`
`DRAM
`Read Cache
`
`DRAM /
`NVRAM
`Printery
`Write Cache
`
`"RA1
`BB
`kup
`Write Cache
`
`DRAM
`Read Cache
`
`DRAM I
`NVRAM
`Primary
`Write Cache
`
`Back ; M
`Write Cache
`
`Cache-Partitions
`
`RAID-5
`
`Cache-Disk
`
`(a)
`
`RAID-5
`
`(b)
`
`Fig. 4. Physical RAPID-Cache and logical RAPID-Cache. (a) A RAPID-Cache with a dedicated physical cache disk. (b) A RAPID-Cache with a
`distributed logical cache disk.
`
`can be quickly and completely reconstructed from the Hash
`Table in case of a crash, it is stored in the DRAM. We will
`discuss the structure of the Disk Segment Table in detail
`later in this section.
`The cache disk in the backup cache can be a dedicated
`physical disk, as shown in Fig. 4a. It can also be distributed
`among the data disks of the RAID system, each data disk
`having a small partition acting as a part of a large distributed
`logical cache disk, as shown in Fig. 4b. In modern RAID
`systems, the physical cache disk can often be implemented
`without extra cost since many modern RAID systems
`include one or several spare disks that can be put into
`service when an active disk fails [17]. However, as pointed
`out by Wilkes et al. [17], during normal operations, the
`spare disks are not used in many systems1 and contribute
`nothing to the performance of the system. It is also hard to
`tell if the spare disks are still working since they are not in
`use. Such a spare disk can therefore be used as a physical
`cache disk of a RAPID-Cache. A secondary benefit here is
`
`1. In a system using distributed sparing, the spare disk is utilized during
`normal operation.
`
`that now we are aware of whether the spare disk is in
`working condition or not, and we are able to replace the
`failed one before it is too late. When a spare disk becomes
`an active disk to replace a failed one, the RAPID-Cache can
`be degraded to the logical cache disk mode by using a
`partition residing on the spare disk until the failed disk is
`replaced and a new spa

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket