throbber
HPE, Exh.1007, p.1
`
`|
`
`HPE, Exh. 1007, p. 1
`
`

`

`Email: acm_europe@acm.org
`
`on Com puter
`SystemSs
`
`Editor-in-Chief
`
`ACM EuropeanService Center
`
`(212) 869-7440
`Avenue MarcelThiry 204
`
`1200 Brussels, Belgium
`Tel. 32 2 774 9602
`Fax: 32 2 774 9690
`
`Kenneth P. Birman
`.
`Electronic Editor
`
`Department of Computer Science/Cornell University/Ithaca, NY 14853/607-
`255-9199/Intemet address: ken@cs.cornell.edu
`
`Faculty of Computer Science/University of Twente/P.O. Box 217/7500 AE
`Enschede/Netherlands/mullender@cs.utwente.n|
`
`Sape Mullender
`.
`.
`Associate Editors
`Ozalp Babaoglu
`
`Maurice Herlihy
`
`Frans Kaashoek
`
`Monica Lam
`Henry Levy
`
`Larry L. Peterson
`
`Department of Computer Science/University of Bologna Piazza di Porta S.
`Donato, 5/40127 Bologna,Italy
`BrownUniversity/Computer Science Department/Box 1910/Providence,RI
`029 12/401 -863-7646/herlihy@cs.brown.edu
`Laboratory for Computer Science/MIT/545 Technology Square/Cambridge, MA
`02139/kaashoek@lcs.mit.edu
`CIS 210/Stanford University/Stanford, CA 94305/415-725-3714
`Department of Computer Science and Engineering, FR-35/University of Wash-
`ington/Seattle, WA 98195/206-543-9204/levy@cs.washington.edu
`Department of Computer Science/University of Arizona/Tucson, AZ 85721/
`602-621-423 1/llp>@cs.arizona.edu
`
`
`Headquarters Quarterlies Staff
`Mark Mandelbaum
`Nhora Cortes-Comerer
`Roma Simon
`George Criscione
`
`Director of Publications
`Associate Director of Publications
`Managing Editor, ACM Quarterlies
`Associate Managing Editor, ACM Quarterlies
`
`ACMTransactions on Computer Systems (ISSN 0734-2071) is published quarterly by ACM, 1515 Broadway, NewYork,
`NY 10036. Second-class postage paid at New York, NY 10001, and at additional mailing offices. Postmaster: Printed in
`the U.S.A. Send address changes to Transactions on Computer Systems, ACM, 1515 Broadway, New York, NY 10036.
`
`Copyright © 1995 by theAssociation for Computing Machinery, Inc. (ACM). Copying withoutfee is permitted provided that
`copies are not madeor distributed for profit or commercial advantage and credit to the sourceis given. Abstracting with
`credit is permitted. To copy otherwise, to republish, to post or servers, or to distribute to lists requires prior specific
`permission and/or a fee. For other copying ofarticles that carry a code at the bottom ofthefirst or last page, copying is
`permitted provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222
`Rosewood Drive, Danvers, MA01923. Request permission to republish from: Publications Dept., ACM,Inc. Fax: +1 (212)
`869-0481 or permission@acm.org.
`
`For subscription information, see inside back cover.
`
`HPE, Exh. 1007, p. 2
`
`HPE, Exh. 1007, p. 2
`
`

`

`The Zebra Striped Network File System
`
`JOHN H. HARTMAN
`University of Arizona
`and
`
`JOHN K. OUSTERHOUT
`Sun Microsystems Laboratories,Inc.
`
`
`Zebra is a network file system that increases throughput by striping the file data across multiple
`servers. Rather than striping eachfile separately, Zebra formsall the new data from eachclient
`into a single stream, which it then stripes using an approach similar to a log-structuredfile
`system. This provides high performancefor writes of smallfiles as well as for reads and writes of
`largefiles. Zebra also writes parity information in each stripe in the style of RAID disk arrays;
`this increases storage costs slightly, but allows the system to continue operation while a single
`storage server is unavailable. A prototype implementation of Zebra, built in the Sprite operating
`system, provides 4—5 times the throughput of the standard Sprite file system or NFS for large
`files and a 15-300% improvementfor writing smallfiles.
`
`Categories and Subject Descriptors: D.4.2 [Operating Systems]: Storage Management—alloc-
`ation/deallocation strategies; secondary storage; D.4.3 [Operating Systems]: File Systems
`Management—access methods; distributed file systems;
`file organization; D.4.5 [Operating
`Systems]: Reliability—fault-tolerance; D.4.7 [Operating Systems]: Organization and
`Design—distributed systems; D.4.8 [Operating Systems]: Performance—measurements; E.5
`[Data]: Files—organization /structure
`General Terms: Design, Measurement, Performance,Reliability
`Additional Key Words and Phrases: Log-basedstriping, log-structuredfile system, parity compu-
`tation, RAID
`
`
`1. INTRODUCTION
`
`Zebra is a network file system that uses multiple file servers to provide
`greater throughput and availability than can be achieved with a single
`server. Clients stripe file data across servers so that different pieces of data
`
`aaeter)Bo)aoGkoeMiMalAeee
`This work was supported in part by NSF grant CCR-89-00029, NASA/ARPA grant NAG2-591,
`NSF grant MIP-87-15235, ARPA contract N00600-93-C-2481, and the California MICRO Pro-
`gram.
`Authors’ addresses: J. H. Hartman, Department of Computer Science, Gould-Simpson Building,
`The University of Arizona, Tucson, AZ 85721; email: jhh@cs.arizona.edu; J. K. Ousterhout, Sun
`Microsystems Laboratories, Inc., 2550 Garcia Avenue, MS UMTV29-232, Mountain View, CA
`94043-1100; email: john.ousterhout@eng.sun.com.
`Permission to makedigital/hard copy of part or all of this work for personalor classroom useis
`granted without fee provided that copies are not madeor distributed for profit or commercial
`advantage, the copyright notice, the title of the publication, and its date appear, and notice is
`given that copying is by permission of ACM,Inc. To copy otherwise, to republish, to post on
`servers, or to redistribute to lists, requires prior specific permission and/ora fee.
`© 1995 ACM 0734-2071/95 /0800-0274 $03.50
`
`ACM Transactions on Computer Systems, Vol. 13, No. 3, August 1995, Pages 274-310.
`
`
`
`HPE, Exh. 1007, p. 3
`
`Mii
`
`HPE, Exh. 1007, p. 3
`
`

`

`
`
`The Zebra Striped Network File System
`
`.
`
`275
`
`are stored on different servers. Striping makesit possible for a single client to
`keep several servers busy, and it distributes the load among the servers to
`reduce the likelihood of hot spots. Zebra also stores parity information in each
`stripe, allowing it to continue operation while any one server is unavailable.
`In current networkfile systems the read and write bandwidth for a single
`file is limited by the performanceof a single server, including its memory
`bandwidth and the speedof its processor, network interface,
`J/O buses, and
`disks. It is possible to split a file system among multiple servers; but each file
`must reside on a single server, and it is difficult to balance the loads of the
`different servers. For example, system directories often lie on a single server,
`makingthat server a hot spot.
`In the future, new styles of computing such as multimedia and parallel
`computation are likely to demand much greater throughput than today’s
`applications, making the limitations of a single server even more severe. For
`example, a single video playback can consumea substantialfraction ofa file
`server's bandwidth even whenthevideo is compressed. A cluster of worksta-
`tions can easily exceed the bandwidth ofa file server if they all run video
`applications simultaneously, and the problems will become much worse when
`video resolution increases with the arrival of HDTV. Another example is
`parallel applications. Several research groups are exploring the possibility of
`using collections of workstations connected by high-speed low-latency net-
`works to run massively parallel applications [Anderson et al. 1995; Freeh
`et al. 1994]. These “distributed supercomputers” are likely to present 1/0
`loads equivalent to traditional supercomputers, which cannot be handled by
`today’s networkfile servers.
`A striping file system offers the potential to achieve very high performance
`using collections of inexpensive computers and disks. Several striping file
`systems havealready been built, such as Swift [Cabrera and Long 1991] and
`Bridge [Dibbleet al. 1988]. These systems are similar in that they stripe data
`within individualfiles, so that only large files benefit from the striping. Zebra
`uses a different approach borrowed from log-structuredfile systems (LFS).
`[Rosenblum and Ousterhout 1991]. Each client formsits new dataforall files
`into a sequential log that it stripes across the storage servers. This not only
`improveslarge-file performance through striping, but it also improves small-
`file writes by batching them together and writing them to the servers in
`large, efficient transfers. It also reduces network overhead, simplifies the
`storage servers, and spreads write traffic uniformly across the servers.
`Zebra’s style of striping also makesit easy to use redundancytechniques
`from RAID disk arrays to improve availability and data integrity [Patterson
`et al. 1988]. One of the fragments of eachstripe stores parity for the rest of
`the stripe, allowing the stripe’s data to be reconstructedin the event of a disk
`or server failure. Zebra can continue operation while a server is unavailable.
`Evenif a serveris totally destroyed, Zebra can reconstructthe lost data.
`We haveconstructed a prototype implementation of Zebra as part of the
`Sprite operating system [Ousterhout et al. 1988]. Although it does not
`incorporateall the reliability and recovery aspects of the Zebra architecture,
`it does demonstrate the performance benefits. For reads and writes of large
`ACM Transactions on Computer Systems, Vol. 13, No. 3, August 1995.
`
`HPE, Exh. 1007, p. 4
`
`HPE, Exh. 1007, p. 4
`
`

`

`276
`
`.
`
`John H. Hartman and John K. Ousterhout
`
`files the prototype achieves up to 2.6MB/secondfor a single client with five
`servers, which is 4-5 times the throughput of either NFS or the standard
`Sprite file system on the same hardware. For small files the Zebra prototype
`improves performance by more thana factor of 3 over NFS. The improvement
`over Sprite is only about 15%, however. This is because both Zebra and
`Sprite require the client to notify the file server ofthefile opensandcloses,
`and when writing smallfiles these notifications dominate the running time.
`With the addition of file name caching to both systems, Zebra is expected to
`have more of an advantage over Sprite.
`The rest of the article is organized as follows. Section 2 describes the
`computing environmentfor which Zebra is intended, and thetypesoffailures
`it is designed to withstand. Section 3 describes the RAID andlog-structured
`file system technologies used in Zebra and introduces Zebra’s logging ap-
`proach. Section 4 describes the structure of Zebra, which consists of clients,
`storage servers, a file manager, and a stripe cleaner. Section 5 shows how the
`componentsof the system work together in normal operation; communication
`between the components is based on deltas, which describe file block cre-
`ations, updates, and deletions. Section 6 describes how Zebra restores consis-
`tency to its data structures after crashes. Section 7 shows how the system
`provides service while components are down.Section 8 gives the status of the
`Zebra prototype and presents some performance measurements. Section 9
`discusses related work. Section 10 concludes.
`
`2. ZEBRA APPLICABILITY
`
`Zebra makes several assumptions concerningits computing environment and
`the types of failures that it will withstand. Zebra is designed to support UNIX
`workloads as foundinoffice and engineering environments. These workloads
`are characterized by short file lifetimes, sequentialfile accesses, infrequent
`write-sharingof files by different clients, and many small files [Bakeretal.
`1991]. This environmentis also notable because of the behavior it does not
`exhibit: namely, random accesses to files. Zebra is, therefore, designed to
`handle sequential file accesses well, perhaps at the expense of random file
`accesses. In particular,
`this means that Zebra may not be suitable for
`running database applications, which tend to update and read large files
`randomly. This is not to say that the Zebra design precludes good perfor-
`mance on such a workload, but that the current design has not been tuned to
`improve random-access performance.
`Zebrais also targeted at high-speed local area networks; it assumesthat in
`a data transfer between a client and server the point-to-point bandwidth of
`the networkis not a bottleneck. Zebra is also not designed to handle network
`partitions. New point-to-point network architectures, such as ATM, typically
`include redundantlinks that reduce the probability of a network partition
`and makepartitions less of a concern in the design of a network file system
`for use on a local area network.
`Zebra also assumes that clients and servers will have large main-memory
`cachesto store file data. These caches serve two purposes: to allow frequently
`ACM Transactions on Computer Systems, Vol. 13, No. 3, August 1995.
`
`
`
`HPE, Exh. 1007, p. 5
`
`HPE, Exh. 1007, p. 5
`
`

`

`
`
`The Zebra Striped Network File System
`
`:
`
`277
`
`used data to be buffered and accessed in memory, without requiring an access
`to the serveror the disk, and to buffer newly written file data prior to writing
`it to the server or the disk. The formerfilters out accesses to data that are
`frequently read, whereas the latter filters out short-lived data and allows
`Zebra to batch together many small writes by application programsinto large
`writes to the servers.
`Zebra is designed to provide file service despite the loss of any single
`machine in the system. Multiple server failures are not handled; the loss of a
`second server causes the system to cease functioning, and data maybelost if
`disks fail catastrophically on two servers at the same time. Any numberof
`clients may fail, however, without affecting the availability of file data. A
`client crash may lose newly written data cached on that client; but it cannot
`lose data older than a time limit, nor can it lose data written by another
`client. This is analogous to losing the data stored in a UNIX file system cache
`when the machinecrashes.
`
`3. STRIPING IN ZEBRA
`Zebra distributes file data over several file servers while ensuring that the
`loss of a single server does not affect the availability of the data. To do this,
`Zebra borrows from two recent innovations in the management of disk
`storage systems: RAID technology (Redundant Arrays of Inexpensive Disks)
`[Patterson et al. 1988] andlog-structured file systems (LFS) [Rosenblum and
`Ousterhout 1991]. RAID technology allows Zebra to provide scalable file
`access performance while using parity instead of redundant copies to guard
`against server failures. The log-structured approach simplifi& the parity
`implementation, reduces the impact of managing and storing parity, and
`allows clients to batch together small writes to improve the efficiency of
`writing to the servers.
`
`3.1 RAID
`
`RAID is a storage system architecture in which many small disks work
`together to provide increased performance and data availability. A RAID
`appears to higher-level software as a single very large and fast disk. Trans-
`fers to or from the disk array are divided into blocks called striping units.
`Consecutive striping units are assigned to different disks in the array as
`shown in Figure 1 and can be transferred in parallel. A group of consecutive
`striping units that spans the arrayis called a stripe. Large transfers can
`proceed at the aggregate bandwidthofall the disks in the array, or multiple
`small transfers can be serviced concurrently by different disks.
`Because a RAID has more disks than a traditional disk storage system,
`disk failures will occur more often. Furthermore, a disk failure anywhere in a
`RAID canpotentially make the entire disk array unusable. To improve data
`integrity, a RAID reserves one of the striping units within each stripe for
`parity instead of data (see Figure 1); each bit of the parity striping unit
`contains the exclusive ORof the correspondingbits of the other striping units
`in the stripe. If a disk fails, each of its striping units can be recovered using
`ACMTransactions on Computer Systems,Vol. 13, No.3, August 1995.
`
`HPE, Exh. 1007, p. 6
`
`HPE, Exh. 1007, p. 6
`
`

`

`278
`
`.
`
`John H. Hartman and John K. Ousterhout
`
`Data
`
`Parity
`
`Fig. 1. Striping with parity. The storage space of a RAID diskarrayis divided into stripes, each
`stripe containing a striping unit on each disk of the array. All but one of the striping units hold
`data; the remaining striping unit holds parity information that can be used to recover after a
`disk failure.
`
`the data and parity from the other striping units of the stripe. Thefile system
`can continue operation during recovery by reconstructing data on-the-fly.
`A RAID offers large improvements in throughput, data integrity, and
`availability, but it presents potential problems. First, the parity mechanism
`makes small writes expensive. If all write operations are in units of whole
`stripes, then it is easy to compute the new parity for each stripe and write it
`along with the data. This increases the cost of writes by only 1/(N —- 1)
`relative to a system without parity, where N is the numberof disks in the
`array. However, the overhead of small writes is much higher.In order to keep
`the stripe’s parity consistent with its data, it is necessary to read the current
`value of the data block that is being updated, read the current value of the
`corresponding parity block, use this information to compute a new parity
`block, then rewrite both parity and data. This makes small writes in a RAID
`about four times as expensive as they would be in a disk array without
`parity, since they require two reads and two writes to complete. Unfortu-
`nately, the best size for a striping unit appears to be tens of kilobytes or more
`[Chen and Patterson 1990], whichis larger than the averagefile size in many
`environments[Bakeret al. 1991; Hartman and Ousterhout 1993]; therefore,
`writes will often be smaller than a full stripe.
`The second problem with a disk array is that all the disks are attached to a
`single machine, so its memory and I/O system are likely to be a performance
`bottleneck. For example, it is possible to attach multiple disks, each with a
`bandwidth of 1-2MB/second,to a single SCSI I/O bus, but the SCSI bus has
`a total bandwidth of only 2-10MB/second. Additional SCSI buses can be
`added, but data mustbe copied from the SCSI channelinto memory and from
`there to a network interface. On the DECstation 5000/200 machines used for
`the Zebra prototype, these copies to and from the SCSI and network con-
`trollers can only proceed at about 6-8MB/second. The Berkeley RAID project
`has built a special-purpose memory system with a dedicated high-bandwidth
`path between the network and the disks [Drapeau et al. 1994], but even this
`system can support only a few dozen disksat full speed.
`The fundamental problem with using a disk array to improve server
`bandwidth is that the server itself becomes a performance bottleneck. In
`ACM Transactions on Computer Systems, Vol. 13, No. 3, August 1995.
`
`
`
`
`POane
`HPE, Exh. 1007, p. 7
`
`HPE, Exh. 1007, p. 7
`
`

`

`The Zebra Striped Network File System
`
`.
`
`279
`
` Fig. 2. File-based striping for a large file. The file is divided into stripe fragments that are
`40506
`
`10203
`
`File Servers
`
`distributed among the servers. One fragment of each stripe contains the parity of the stripe’s
`contents.
`
`order to eliminate the bottlenecks presented by centralized resources, multi-
`ple paths must exist between the source or sink of data and the disks so that
`different paths can be used to reach different disks. For example, this might
`be done by spreading the disks among different machines on a single very
`high speed network, or even by using different networks to reach different
`disks. Unfortunately, this turns the disk array into a distributed system and
`introduces issues such as who should allocate disk space or compute parity.
`Nonetheless, this distribution is necessary to avoid the bottleneck presented
`by having multiple data paths share the sameresource. One of our goals for
`Zebra was to solve these distributed-system problems in a simple andeffi-
`cient way.
`
`3.2 File-Based Striping in a Network File System
`A striped network file system is one that distributes file data over multiple
`file servers in the same way that a RAID distributes data over multiple disks.
`This allows several servers to participate in the transfer of a single file. The
`terminology we use to describe a striped network file system is similar to
`RAID’s: a collection of file data that spans the serversis a called a stripe, and
`the portion of a stripe stored on a single serveris called a stripe fragment.
`The most-obvious way to organize a striped networkfile system is to stripe
`each file separately, as shown in Figure 2. We refer to this method as
`file-based striping. Eachfile is stored in its own set of stripes. As a result,
`parity is computed for each file because each stripe contains data from only
`one file. Although conceptually simple, file-based striping has drawbacks.
`First, small files are difficult to handle efficiently. If a small file is striped
`across all the servers, as in Figure 3(a), then each server will only store a
`very small piece of the file. This provides little performance benefit, since
`most of the access cost is due to network and disk latency; yet it incurs
`overhead on every serverfor every file access. Thus it seems better to handle
`small files differently than large files and to store each small file on a single
`ACMTransactions on Computer Systems, Vol. 13, No. 3, August 1995.
`
`HPE, Exh. 1007, p. 8
`
`HPE, Exh. 1007, p. 8
`
`

`

`
`
`280
`
`.
`
`John H. Hartman and John K. Ousterhout
`
`File
`
`File
`
`r r ry rTa . : ; id4
`
`File Servers
`(a)
`
`File Servers
`(b)
`
`Fig. 3. File-based striping for a small file. In (a) the file is striped evenly across the servers,
`resulting in small fragments on each server. In (b) the entire file is placed on one server, but the
`parity requires as much spaceasthefile data.
`
`server, as in Figure 3(b). This leads to problems in parity management,
`however. If a small file is stored on a single server, then its parity will
`consume as muchspaceasthefile itself, resulting in high storage overhead.
`In addition, the approach in Figure 3(b) can result in unbalanced disk
`utilization and server loading.
`The second problem with file-based striping is that it requires a parity
`fragment to be updated each time an existing file block is modified. As with
`RAIDs,small updates such as this require two reads (the old data andthe old
`parity) followed by two writes (the new data and the new parity). Further-
`more, the two writes must be carried out atomically. If one write should
`complete but not the other(e.g., because a client or server crashed), then the
`parity will be inconsistent with the data; if this parity is used later for
`reconstructing lost data,
`incorrect results will be produced. There exist
`protocols for ensuring that two writes to two different file servers are carried
`out atomically [Bernstein and Goodman 1981], but they are complex and
`expensive.
`
`3.3 Log-Structured File Systems and Log-BasedStriping
`
`Zebra uses techniques from log-structured file systems (LFS) [Rosenblum and
`Ousterhout 1991] to avoid the problemsoffile-based striping. LFS is a disk
`management technique that treats the disk like an append-only log. When
`newfile blocks are created or existing file blocks modified, the new data are
`batched together and written to the end of the log in large sequential
`transfers. The metadata for the affected files are also updatedto reflect the
`newlocationsof the file blocks. LFS is particularly effective for writing small
`files since it can write manyfiles in a single transfer. In contrast, traditional
`file systems require at least two independent disk transfers for each file.
`Rosenblum and Ousterhout reported a tenfold speedup over traditional file
`systems for writing small files. LFS is also well suited for RAIDs becauseit
`ACM Transactions on Computer Systems, Vol. 13, No. 3, August 1995.
`
`HPE, Exh. 1007, p. 9
`
`HPE, Exh. 1007, p. 9
`
`

`

`The Zebra Striped Network File System
`
`:
`
`281
`
`File B
`
`File C
`
`aeat akg
`
`. li 1314s] — Client’s Log
`
`
` Fig. 4. Log-basedstriping in Zebra. Eachclient formsits new file data into a single append-only
`
`ra
`hon
`2|3. 1@2@3
`
`5|6|48586
`ae
`File Servers
`
`log andstripes this log across the servers. In this example, File A spans several servers whereas
`File B is stored entirely on a single server. Parity is computedfor the log, not for individualfiles.
`
`batches small writes together into large sequential transfers and thus avoids
`the expensive parity updates associated with small random writes.
`Zebra can be thought of as a log-structured network file system: whereas
`LFS uses the logging approach at the interface betweena file server and its
`disks, Zebra uses the logging approach at the interface between a client and
`its servers. Figure 4 illustrates this approach, which we call
`log-based
`striping. Each Zebra client organizes its new file data into an append-only
`log, which it then stripes across the servers. The client computes parity for
`the log, not for individual files. Each client creates its own log, so that each
`stripe in the file system contains data written by a single client.
`Log-based striping has several advantages over file-based striping. The
`first is that the servers are used efficiently regardless offile sizes: large
`writes are striped, allowing them to be completed in parallel, and small
`writes are batched together and written to the servers in large transfers; no
`special handling is needed for either case. Second, the parity mechanism is
`simplified. Each client computes parity for its own log without fear of
`interactions with other clients. Small files do not have excessive parity
`overhead because parity is computed for the logs, not individual files. Fur-
`thermore, once a stripe is complete, its parity is never updated becausefile
`data are not overwritten in place.
`The preceding description of log-based striping leaves several questions
`unanswered. For example, how canfiles be shared amongclient workstations
`if each client is writing its own log? Zebra solves this problem by introducing
`a central file manager, separate from the storage servers, that manages
`metadata such as directories and file attributes and supervises interactions
`between clients. Also, how is free space reclaimed from the logs? Zebra solves
`this problem with a stripe cleaner, which is analogous to the cleaner in a
`log-structured file system. Section 4 provides a more-detailed discussion of
`these issues and severalothers.
`
`ACMTransactions on Computer Systems,Vol. 13, No. 3, August 1995.
`
`HPE, Exh. 1007, p. 10
`
`HPE, Exh. 1007, p. 10
`
`

`

`4. ZEBRA COMPONENTS
`
`The Zebra file system contains four main components as shown in Figure 5:
`clients, which are the machines that run application programs; storage
`servers, which store file data; a file manager, which managesthefile and
`directory structure of the file system; and a stripe cleaner, which reclaims
`unused space on the storage servers. There may be any numberofclients and
`storage servers but only a single file manager and stripe cleaner. More than
`one of these components may share a single physical machine; for example, it
`is possible for one machineto be both a storage server anda client. If care is
`not taken, the single file manager and stripe cleaner may both be potential
`single points of failure and performancebottlenecks. Section 7 describes how
`the system is able to continue operation even if the file manager or stripe
`cleaner crashes, and the bottleneck issue is addressed in Section 8, which
`provides performance measurements of the prototype. The remainderofthis
`section describes each of the components in isolation. Section 5 shows how
`the components work together to implement operations such as reading and
`writing files, and Sections 6 and 7 describe how Zebra deals with crashes.
`Wewill describe Zebra under the assumption that there are several storage
`servers, each with a single disk. However, this need not be the case. For
`example, storage servers could each contain several disks managed as a
`RAID,thereby giving the appearanceto clients of a single disk with higher
`capacity and throughput. Doing so would also provide additional redundancy:
`the parity maintained in the RAID would protect against disk failures,
`whereasthe parity maintained by Zebra would protect against server failures
`as well. It is also possible to put all the disks on a single server; clients would
`treat it as several logical servers, all implemented by the same physical
`machine. This approach would still provide many of Zebra’s benefits: clients
`would still batch small files for transfer over the network, and it would still
`ACM Transactions on Computer Systems, Vol. 13, No. 3, August 1995.
`
`
`HPE, Exh. 1007, p. 11
`
`HPE, Exh. 1007, p. 11
`
`

`

`for stripe fragments. As far as a storage server is concerned, a stripe
`fragmentis a large block of bytes with a unique identifier. The identifier for a
`fragment consists of an identifier for the client that wrote the fragment, a
`sequence numberthat identifies the stripe uniquely amongall those written
`by the client, and an offset for the fragment within its stripe. All fragments in
`Zebra are the samesize, which should be chosen large enough to minimize
`the network and disk overheads of transferring data among theclients and
`the storage servers. The Zebra prototype uses 512KB fragments.
`Storage servers provide five operations:
`
`—Store a fragment: This operation allocates space for the fragment, writes
`the fragment to disk, and records on disk the fragment identifier and disk
`location for use in subsequent accesses. The operation is synchronous: it
`does not complete until the fragment has been safely stored. The fragment’
`must not already exist unlessit is a parity fragment, in which case the new
`copy of the fragment replaces the old. This is done in a nonoverwrite
`mannerto avoid corruption of a parity fragment in the event of a crash.
`—Append to an existing fragment: This operation is similar to storing a
`fragment except that it allows a client to write out a fragmentin piecesif it
`does not have enough data to fill the entire fragment at once (this can
`happen, for example, if an application invokes the fsync system call to force
`data to disk). Appends are implemented atomically so that a crash during
`an append cannot cause the previous contents of the fragmentto belost.
`—Retrieve a fragment: This operation returns part or all of the data from a
`fragment. It is not necessary to read the entire fragment; a fragment
`identifier, offset, and length specify the desired range of bytes.
`—Delete a fragment: This operation is invoked by the stripe cleaner when the
`fragment no longer contains any useful data. It makes the fragment’s disk
`space available for new fragments.
`ACMTransactions on Computer Systems, Vol. 13, No. 3, August 1995.
`
`
`
`
`HPE, Exh. 1007, p. 12
`
`HPE, Exh. 1007, p. 12
`
`

`

`.
`
`John H. Hartman and John K. Ousterhout
`
`—Identify fragments: This operation provides information about the frag-
`ments stored by the server, such as the most-recent fragment written by a
`client. It is used to find the ends of the clients’ logs after a crash.
`
`Stripes are immutable once they are complete. A stripe may be created
`with a sequence of append operations, but nonparity fragments are never
`overwritten, and once the stripe is complete it is never modified except to
`delete the entire stripe. A parity fragment, however, can be overwritten if
`data are appendedto a partial stripe (see Section 5.2).
`
` 284
`
`4.3 File Manager
`The file manager is responsible for all the information in the file system
`except for file data. We refer to this information as metadata: it includesfile
`attributes such as protection information, block pointers that tell wherefile
`data are stored, directories, symbolic links, and specialfiles for I/O devices.
`Thefile manager performsall the usual functionsofa file server in a network
`file system, such as name lookup and maintaining the consistency of client
`file caches. However, the Zebra file manager does not store any file data;
`wherea traditionalfile server would manipulate data, the Zebra file manager
`manipulates block pointers. For example, consider a read operation. In a
`traditional file system, the client requests the data from thefile server; in
`Zebra, the client requests block pointers from the file manager, then it reads
`the data from the storage servers.
`In the Zebra prototype, we implementedthe file managerusing a Sprite file
`server with a log-structuredfile system. For each Zebrafile there is onefile in
`the file manager’s file system, and the “data”in thisfile are an array of block
`pointers that indicate where the blocks of data for the Zebra file are stored.
`This allows Zebra to use almostall the existing Sprite networkfile protocols
`without modification. Clients open, read, and ca

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket