`
`|
`
`HPE, Exh. 1007, p. 1
`
`
`
`Email: acm_europe@acm.org
`
`on Com puter
`SystemSs
`
`Editor-in-Chief
`
`ACM EuropeanService Center
`
`(212) 869-7440
`Avenue MarcelThiry 204
`
`1200 Brussels, Belgium
`Tel. 32 2 774 9602
`Fax: 32 2 774 9690
`
`Kenneth P. Birman
`.
`Electronic Editor
`
`Department of Computer Science/Cornell University/Ithaca, NY 14853/607-
`255-9199/Intemet address: ken@cs.cornell.edu
`
`Faculty of Computer Science/University of Twente/P.O. Box 217/7500 AE
`Enschede/Netherlands/mullender@cs.utwente.n|
`
`Sape Mullender
`.
`.
`Associate Editors
`Ozalp Babaoglu
`
`Maurice Herlihy
`
`Frans Kaashoek
`
`Monica Lam
`Henry Levy
`
`Larry L. Peterson
`
`Department of Computer Science/University of Bologna Piazza di Porta S.
`Donato, 5/40127 Bologna,Italy
`BrownUniversity/Computer Science Department/Box 1910/Providence,RI
`029 12/401 -863-7646/herlihy@cs.brown.edu
`Laboratory for Computer Science/MIT/545 Technology Square/Cambridge, MA
`02139/kaashoek@lcs.mit.edu
`CIS 210/Stanford University/Stanford, CA 94305/415-725-3714
`Department of Computer Science and Engineering, FR-35/University of Wash-
`ington/Seattle, WA 98195/206-543-9204/levy@cs.washington.edu
`Department of Computer Science/University of Arizona/Tucson, AZ 85721/
`602-621-423 1/llp>@cs.arizona.edu
`
`
`Headquarters Quarterlies Staff
`Mark Mandelbaum
`Nhora Cortes-Comerer
`Roma Simon
`George Criscione
`
`Director of Publications
`Associate Director of Publications
`Managing Editor, ACM Quarterlies
`Associate Managing Editor, ACM Quarterlies
`
`ACMTransactions on Computer Systems (ISSN 0734-2071) is published quarterly by ACM, 1515 Broadway, NewYork,
`NY 10036. Second-class postage paid at New York, NY 10001, and at additional mailing offices. Postmaster: Printed in
`the U.S.A. Send address changes to Transactions on Computer Systems, ACM, 1515 Broadway, New York, NY 10036.
`
`Copyright © 1995 by theAssociation for Computing Machinery, Inc. (ACM). Copying withoutfee is permitted provided that
`copies are not madeor distributed for profit or commercial advantage and credit to the sourceis given. Abstracting with
`credit is permitted. To copy otherwise, to republish, to post or servers, or to distribute to lists requires prior specific
`permission and/or a fee. For other copying ofarticles that carry a code at the bottom ofthefirst or last page, copying is
`permitted provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center, 222
`Rosewood Drive, Danvers, MA01923. Request permission to republish from: Publications Dept., ACM,Inc. Fax: +1 (212)
`869-0481 or permission@acm.org.
`
`For subscription information, see inside back cover.
`
`HPE, Exh. 1007, p. 2
`
`HPE, Exh. 1007, p. 2
`
`
`
`The Zebra Striped Network File System
`
`JOHN H. HARTMAN
`University of Arizona
`and
`
`JOHN K. OUSTERHOUT
`Sun Microsystems Laboratories,Inc.
`
`
`Zebra is a network file system that increases throughput by striping the file data across multiple
`servers. Rather than striping eachfile separately, Zebra formsall the new data from eachclient
`into a single stream, which it then stripes using an approach similar to a log-structuredfile
`system. This provides high performancefor writes of smallfiles as well as for reads and writes of
`largefiles. Zebra also writes parity information in each stripe in the style of RAID disk arrays;
`this increases storage costs slightly, but allows the system to continue operation while a single
`storage server is unavailable. A prototype implementation of Zebra, built in the Sprite operating
`system, provides 4—5 times the throughput of the standard Sprite file system or NFS for large
`files and a 15-300% improvementfor writing smallfiles.
`
`Categories and Subject Descriptors: D.4.2 [Operating Systems]: Storage Management—alloc-
`ation/deallocation strategies; secondary storage; D.4.3 [Operating Systems]: File Systems
`Management—access methods; distributed file systems;
`file organization; D.4.5 [Operating
`Systems]: Reliability—fault-tolerance; D.4.7 [Operating Systems]: Organization and
`Design—distributed systems; D.4.8 [Operating Systems]: Performance—measurements; E.5
`[Data]: Files—organization /structure
`General Terms: Design, Measurement, Performance,Reliability
`Additional Key Words and Phrases: Log-basedstriping, log-structuredfile system, parity compu-
`tation, RAID
`
`
`1. INTRODUCTION
`
`Zebra is a network file system that uses multiple file servers to provide
`greater throughput and availability than can be achieved with a single
`server. Clients stripe file data across servers so that different pieces of data
`
`aaeter)Bo)aoGkoeMiMalAeee
`This work was supported in part by NSF grant CCR-89-00029, NASA/ARPA grant NAG2-591,
`NSF grant MIP-87-15235, ARPA contract N00600-93-C-2481, and the California MICRO Pro-
`gram.
`Authors’ addresses: J. H. Hartman, Department of Computer Science, Gould-Simpson Building,
`The University of Arizona, Tucson, AZ 85721; email: jhh@cs.arizona.edu; J. K. Ousterhout, Sun
`Microsystems Laboratories, Inc., 2550 Garcia Avenue, MS UMTV29-232, Mountain View, CA
`94043-1100; email: john.ousterhout@eng.sun.com.
`Permission to makedigital/hard copy of part or all of this work for personalor classroom useis
`granted without fee provided that copies are not madeor distributed for profit or commercial
`advantage, the copyright notice, the title of the publication, and its date appear, and notice is
`given that copying is by permission of ACM,Inc. To copy otherwise, to republish, to post on
`servers, or to redistribute to lists, requires prior specific permission and/ora fee.
`© 1995 ACM 0734-2071/95 /0800-0274 $03.50
`
`ACM Transactions on Computer Systems, Vol. 13, No. 3, August 1995, Pages 274-310.
`
`
`
`HPE, Exh. 1007, p. 3
`
`Mii
`
`HPE, Exh. 1007, p. 3
`
`
`
`
`
`The Zebra Striped Network File System
`
`.
`
`275
`
`are stored on different servers. Striping makesit possible for a single client to
`keep several servers busy, and it distributes the load among the servers to
`reduce the likelihood of hot spots. Zebra also stores parity information in each
`stripe, allowing it to continue operation while any one server is unavailable.
`In current networkfile systems the read and write bandwidth for a single
`file is limited by the performanceof a single server, including its memory
`bandwidth and the speedof its processor, network interface,
`J/O buses, and
`disks. It is possible to split a file system among multiple servers; but each file
`must reside on a single server, and it is difficult to balance the loads of the
`different servers. For example, system directories often lie on a single server,
`makingthat server a hot spot.
`In the future, new styles of computing such as multimedia and parallel
`computation are likely to demand much greater throughput than today’s
`applications, making the limitations of a single server even more severe. For
`example, a single video playback can consumea substantialfraction ofa file
`server's bandwidth even whenthevideo is compressed. A cluster of worksta-
`tions can easily exceed the bandwidth ofa file server if they all run video
`applications simultaneously, and the problems will become much worse when
`video resolution increases with the arrival of HDTV. Another example is
`parallel applications. Several research groups are exploring the possibility of
`using collections of workstations connected by high-speed low-latency net-
`works to run massively parallel applications [Anderson et al. 1995; Freeh
`et al. 1994]. These “distributed supercomputers” are likely to present 1/0
`loads equivalent to traditional supercomputers, which cannot be handled by
`today’s networkfile servers.
`A striping file system offers the potential to achieve very high performance
`using collections of inexpensive computers and disks. Several striping file
`systems havealready been built, such as Swift [Cabrera and Long 1991] and
`Bridge [Dibbleet al. 1988]. These systems are similar in that they stripe data
`within individualfiles, so that only large files benefit from the striping. Zebra
`uses a different approach borrowed from log-structuredfile systems (LFS).
`[Rosenblum and Ousterhout 1991]. Each client formsits new dataforall files
`into a sequential log that it stripes across the storage servers. This not only
`improveslarge-file performance through striping, but it also improves small-
`file writes by batching them together and writing them to the servers in
`large, efficient transfers. It also reduces network overhead, simplifies the
`storage servers, and spreads write traffic uniformly across the servers.
`Zebra’s style of striping also makesit easy to use redundancytechniques
`from RAID disk arrays to improve availability and data integrity [Patterson
`et al. 1988]. One of the fragments of eachstripe stores parity for the rest of
`the stripe, allowing the stripe’s data to be reconstructedin the event of a disk
`or server failure. Zebra can continue operation while a server is unavailable.
`Evenif a serveris totally destroyed, Zebra can reconstructthe lost data.
`We haveconstructed a prototype implementation of Zebra as part of the
`Sprite operating system [Ousterhout et al. 1988]. Although it does not
`incorporateall the reliability and recovery aspects of the Zebra architecture,
`it does demonstrate the performance benefits. For reads and writes of large
`ACM Transactions on Computer Systems, Vol. 13, No. 3, August 1995.
`
`HPE, Exh. 1007, p. 4
`
`HPE, Exh. 1007, p. 4
`
`
`
`276
`
`.
`
`John H. Hartman and John K. Ousterhout
`
`files the prototype achieves up to 2.6MB/secondfor a single client with five
`servers, which is 4-5 times the throughput of either NFS or the standard
`Sprite file system on the same hardware. For small files the Zebra prototype
`improves performance by more thana factor of 3 over NFS. The improvement
`over Sprite is only about 15%, however. This is because both Zebra and
`Sprite require the client to notify the file server ofthefile opensandcloses,
`and when writing smallfiles these notifications dominate the running time.
`With the addition of file name caching to both systems, Zebra is expected to
`have more of an advantage over Sprite.
`The rest of the article is organized as follows. Section 2 describes the
`computing environmentfor which Zebra is intended, and thetypesoffailures
`it is designed to withstand. Section 3 describes the RAID andlog-structured
`file system technologies used in Zebra and introduces Zebra’s logging ap-
`proach. Section 4 describes the structure of Zebra, which consists of clients,
`storage servers, a file manager, and a stripe cleaner. Section 5 shows how the
`componentsof the system work together in normal operation; communication
`between the components is based on deltas, which describe file block cre-
`ations, updates, and deletions. Section 6 describes how Zebra restores consis-
`tency to its data structures after crashes. Section 7 shows how the system
`provides service while components are down.Section 8 gives the status of the
`Zebra prototype and presents some performance measurements. Section 9
`discusses related work. Section 10 concludes.
`
`2. ZEBRA APPLICABILITY
`
`Zebra makes several assumptions concerningits computing environment and
`the types of failures that it will withstand. Zebra is designed to support UNIX
`workloads as foundinoffice and engineering environments. These workloads
`are characterized by short file lifetimes, sequentialfile accesses, infrequent
`write-sharingof files by different clients, and many small files [Bakeretal.
`1991]. This environmentis also notable because of the behavior it does not
`exhibit: namely, random accesses to files. Zebra is, therefore, designed to
`handle sequential file accesses well, perhaps at the expense of random file
`accesses. In particular,
`this means that Zebra may not be suitable for
`running database applications, which tend to update and read large files
`randomly. This is not to say that the Zebra design precludes good perfor-
`mance on such a workload, but that the current design has not been tuned to
`improve random-access performance.
`Zebrais also targeted at high-speed local area networks; it assumesthat in
`a data transfer between a client and server the point-to-point bandwidth of
`the networkis not a bottleneck. Zebra is also not designed to handle network
`partitions. New point-to-point network architectures, such as ATM, typically
`include redundantlinks that reduce the probability of a network partition
`and makepartitions less of a concern in the design of a network file system
`for use on a local area network.
`Zebra also assumes that clients and servers will have large main-memory
`cachesto store file data. These caches serve two purposes: to allow frequently
`ACM Transactions on Computer Systems, Vol. 13, No. 3, August 1995.
`
`
`
`HPE, Exh. 1007, p. 5
`
`HPE, Exh. 1007, p. 5
`
`
`
`
`
`The Zebra Striped Network File System
`
`:
`
`277
`
`used data to be buffered and accessed in memory, without requiring an access
`to the serveror the disk, and to buffer newly written file data prior to writing
`it to the server or the disk. The formerfilters out accesses to data that are
`frequently read, whereas the latter filters out short-lived data and allows
`Zebra to batch together many small writes by application programsinto large
`writes to the servers.
`Zebra is designed to provide file service despite the loss of any single
`machine in the system. Multiple server failures are not handled; the loss of a
`second server causes the system to cease functioning, and data maybelost if
`disks fail catastrophically on two servers at the same time. Any numberof
`clients may fail, however, without affecting the availability of file data. A
`client crash may lose newly written data cached on that client; but it cannot
`lose data older than a time limit, nor can it lose data written by another
`client. This is analogous to losing the data stored in a UNIX file system cache
`when the machinecrashes.
`
`3. STRIPING IN ZEBRA
`Zebra distributes file data over several file servers while ensuring that the
`loss of a single server does not affect the availability of the data. To do this,
`Zebra borrows from two recent innovations in the management of disk
`storage systems: RAID technology (Redundant Arrays of Inexpensive Disks)
`[Patterson et al. 1988] andlog-structured file systems (LFS) [Rosenblum and
`Ousterhout 1991]. RAID technology allows Zebra to provide scalable file
`access performance while using parity instead of redundant copies to guard
`against server failures. The log-structured approach simplifi& the parity
`implementation, reduces the impact of managing and storing parity, and
`allows clients to batch together small writes to improve the efficiency of
`writing to the servers.
`
`3.1 RAID
`
`RAID is a storage system architecture in which many small disks work
`together to provide increased performance and data availability. A RAID
`appears to higher-level software as a single very large and fast disk. Trans-
`fers to or from the disk array are divided into blocks called striping units.
`Consecutive striping units are assigned to different disks in the array as
`shown in Figure 1 and can be transferred in parallel. A group of consecutive
`striping units that spans the arrayis called a stripe. Large transfers can
`proceed at the aggregate bandwidthofall the disks in the array, or multiple
`small transfers can be serviced concurrently by different disks.
`Because a RAID has more disks than a traditional disk storage system,
`disk failures will occur more often. Furthermore, a disk failure anywhere in a
`RAID canpotentially make the entire disk array unusable. To improve data
`integrity, a RAID reserves one of the striping units within each stripe for
`parity instead of data (see Figure 1); each bit of the parity striping unit
`contains the exclusive ORof the correspondingbits of the other striping units
`in the stripe. If a disk fails, each of its striping units can be recovered using
`ACMTransactions on Computer Systems,Vol. 13, No.3, August 1995.
`
`HPE, Exh. 1007, p. 6
`
`HPE, Exh. 1007, p. 6
`
`
`
`278
`
`.
`
`John H. Hartman and John K. Ousterhout
`
`Data
`
`Parity
`
`Fig. 1. Striping with parity. The storage space of a RAID diskarrayis divided into stripes, each
`stripe containing a striping unit on each disk of the array. All but one of the striping units hold
`data; the remaining striping unit holds parity information that can be used to recover after a
`disk failure.
`
`the data and parity from the other striping units of the stripe. Thefile system
`can continue operation during recovery by reconstructing data on-the-fly.
`A RAID offers large improvements in throughput, data integrity, and
`availability, but it presents potential problems. First, the parity mechanism
`makes small writes expensive. If all write operations are in units of whole
`stripes, then it is easy to compute the new parity for each stripe and write it
`along with the data. This increases the cost of writes by only 1/(N —- 1)
`relative to a system without parity, where N is the numberof disks in the
`array. However, the overhead of small writes is much higher.In order to keep
`the stripe’s parity consistent with its data, it is necessary to read the current
`value of the data block that is being updated, read the current value of the
`corresponding parity block, use this information to compute a new parity
`block, then rewrite both parity and data. This makes small writes in a RAID
`about four times as expensive as they would be in a disk array without
`parity, since they require two reads and two writes to complete. Unfortu-
`nately, the best size for a striping unit appears to be tens of kilobytes or more
`[Chen and Patterson 1990], whichis larger than the averagefile size in many
`environments[Bakeret al. 1991; Hartman and Ousterhout 1993]; therefore,
`writes will often be smaller than a full stripe.
`The second problem with a disk array is that all the disks are attached to a
`single machine, so its memory and I/O system are likely to be a performance
`bottleneck. For example, it is possible to attach multiple disks, each with a
`bandwidth of 1-2MB/second,to a single SCSI I/O bus, but the SCSI bus has
`a total bandwidth of only 2-10MB/second. Additional SCSI buses can be
`added, but data mustbe copied from the SCSI channelinto memory and from
`there to a network interface. On the DECstation 5000/200 machines used for
`the Zebra prototype, these copies to and from the SCSI and network con-
`trollers can only proceed at about 6-8MB/second. The Berkeley RAID project
`has built a special-purpose memory system with a dedicated high-bandwidth
`path between the network and the disks [Drapeau et al. 1994], but even this
`system can support only a few dozen disksat full speed.
`The fundamental problem with using a disk array to improve server
`bandwidth is that the server itself becomes a performance bottleneck. In
`ACM Transactions on Computer Systems, Vol. 13, No. 3, August 1995.
`
`
`
`
`POane
`HPE, Exh. 1007, p. 7
`
`HPE, Exh. 1007, p. 7
`
`
`
`The Zebra Striped Network File System
`
`.
`
`279
`
` Fig. 2. File-based striping for a large file. The file is divided into stripe fragments that are
`40506
`
`10203
`
`File Servers
`
`distributed among the servers. One fragment of each stripe contains the parity of the stripe’s
`contents.
`
`order to eliminate the bottlenecks presented by centralized resources, multi-
`ple paths must exist between the source or sink of data and the disks so that
`different paths can be used to reach different disks. For example, this might
`be done by spreading the disks among different machines on a single very
`high speed network, or even by using different networks to reach different
`disks. Unfortunately, this turns the disk array into a distributed system and
`introduces issues such as who should allocate disk space or compute parity.
`Nonetheless, this distribution is necessary to avoid the bottleneck presented
`by having multiple data paths share the sameresource. One of our goals for
`Zebra was to solve these distributed-system problems in a simple andeffi-
`cient way.
`
`3.2 File-Based Striping in a Network File System
`A striped network file system is one that distributes file data over multiple
`file servers in the same way that a RAID distributes data over multiple disks.
`This allows several servers to participate in the transfer of a single file. The
`terminology we use to describe a striped network file system is similar to
`RAID’s: a collection of file data that spans the serversis a called a stripe, and
`the portion of a stripe stored on a single serveris called a stripe fragment.
`The most-obvious way to organize a striped networkfile system is to stripe
`each file separately, as shown in Figure 2. We refer to this method as
`file-based striping. Eachfile is stored in its own set of stripes. As a result,
`parity is computed for each file because each stripe contains data from only
`one file. Although conceptually simple, file-based striping has drawbacks.
`First, small files are difficult to handle efficiently. If a small file is striped
`across all the servers, as in Figure 3(a), then each server will only store a
`very small piece of the file. This provides little performance benefit, since
`most of the access cost is due to network and disk latency; yet it incurs
`overhead on every serverfor every file access. Thus it seems better to handle
`small files differently than large files and to store each small file on a single
`ACMTransactions on Computer Systems, Vol. 13, No. 3, August 1995.
`
`HPE, Exh. 1007, p. 8
`
`HPE, Exh. 1007, p. 8
`
`
`
`
`
`280
`
`.
`
`John H. Hartman and John K. Ousterhout
`
`File
`
`File
`
`r r ry rTa . : ; id4
`
`File Servers
`(a)
`
`File Servers
`(b)
`
`Fig. 3. File-based striping for a small file. In (a) the file is striped evenly across the servers,
`resulting in small fragments on each server. In (b) the entire file is placed on one server, but the
`parity requires as much spaceasthefile data.
`
`server, as in Figure 3(b). This leads to problems in parity management,
`however. If a small file is stored on a single server, then its parity will
`consume as muchspaceasthefile itself, resulting in high storage overhead.
`In addition, the approach in Figure 3(b) can result in unbalanced disk
`utilization and server loading.
`The second problem with file-based striping is that it requires a parity
`fragment to be updated each time an existing file block is modified. As with
`RAIDs,small updates such as this require two reads (the old data andthe old
`parity) followed by two writes (the new data and the new parity). Further-
`more, the two writes must be carried out atomically. If one write should
`complete but not the other(e.g., because a client or server crashed), then the
`parity will be inconsistent with the data; if this parity is used later for
`reconstructing lost data,
`incorrect results will be produced. There exist
`protocols for ensuring that two writes to two different file servers are carried
`out atomically [Bernstein and Goodman 1981], but they are complex and
`expensive.
`
`3.3 Log-Structured File Systems and Log-BasedStriping
`
`Zebra uses techniques from log-structured file systems (LFS) [Rosenblum and
`Ousterhout 1991] to avoid the problemsoffile-based striping. LFS is a disk
`management technique that treats the disk like an append-only log. When
`newfile blocks are created or existing file blocks modified, the new data are
`batched together and written to the end of the log in large sequential
`transfers. The metadata for the affected files are also updatedto reflect the
`newlocationsof the file blocks. LFS is particularly effective for writing small
`files since it can write manyfiles in a single transfer. In contrast, traditional
`file systems require at least two independent disk transfers for each file.
`Rosenblum and Ousterhout reported a tenfold speedup over traditional file
`systems for writing small files. LFS is also well suited for RAIDs becauseit
`ACM Transactions on Computer Systems, Vol. 13, No. 3, August 1995.
`
`HPE, Exh. 1007, p. 9
`
`HPE, Exh. 1007, p. 9
`
`
`
`The Zebra Striped Network File System
`
`:
`
`281
`
`File B
`
`File C
`
`aeat akg
`
`. li 1314s] — Client’s Log
`
`
` Fig. 4. Log-basedstriping in Zebra. Eachclient formsits new file data into a single append-only
`
`ra
`hon
`2|3. 1@2@3
`
`5|6|48586
`ae
`File Servers
`
`log andstripes this log across the servers. In this example, File A spans several servers whereas
`File B is stored entirely on a single server. Parity is computedfor the log, not for individualfiles.
`
`batches small writes together into large sequential transfers and thus avoids
`the expensive parity updates associated with small random writes.
`Zebra can be thought of as a log-structured network file system: whereas
`LFS uses the logging approach at the interface betweena file server and its
`disks, Zebra uses the logging approach at the interface between a client and
`its servers. Figure 4 illustrates this approach, which we call
`log-based
`striping. Each Zebra client organizes its new file data into an append-only
`log, which it then stripes across the servers. The client computes parity for
`the log, not for individual files. Each client creates its own log, so that each
`stripe in the file system contains data written by a single client.
`Log-based striping has several advantages over file-based striping. The
`first is that the servers are used efficiently regardless offile sizes: large
`writes are striped, allowing them to be completed in parallel, and small
`writes are batched together and written to the servers in large transfers; no
`special handling is needed for either case. Second, the parity mechanism is
`simplified. Each client computes parity for its own log without fear of
`interactions with other clients. Small files do not have excessive parity
`overhead because parity is computed for the logs, not individual files. Fur-
`thermore, once a stripe is complete, its parity is never updated becausefile
`data are not overwritten in place.
`The preceding description of log-based striping leaves several questions
`unanswered. For example, how canfiles be shared amongclient workstations
`if each client is writing its own log? Zebra solves this problem by introducing
`a central file manager, separate from the storage servers, that manages
`metadata such as directories and file attributes and supervises interactions
`between clients. Also, how is free space reclaimed from the logs? Zebra solves
`this problem with a stripe cleaner, which is analogous to the cleaner in a
`log-structured file system. Section 4 provides a more-detailed discussion of
`these issues and severalothers.
`
`ACMTransactions on Computer Systems,Vol. 13, No. 3, August 1995.
`
`HPE, Exh. 1007, p. 10
`
`HPE, Exh. 1007, p. 10
`
`
`
`4. ZEBRA COMPONENTS
`
`The Zebra file system contains four main components as shown in Figure 5:
`clients, which are the machines that run application programs; storage
`servers, which store file data; a file manager, which managesthefile and
`directory structure of the file system; and a stripe cleaner, which reclaims
`unused space on the storage servers. There may be any numberofclients and
`storage servers but only a single file manager and stripe cleaner. More than
`one of these components may share a single physical machine; for example, it
`is possible for one machineto be both a storage server anda client. If care is
`not taken, the single file manager and stripe cleaner may both be potential
`single points of failure and performancebottlenecks. Section 7 describes how
`the system is able to continue operation even if the file manager or stripe
`cleaner crashes, and the bottleneck issue is addressed in Section 8, which
`provides performance measurements of the prototype. The remainderofthis
`section describes each of the components in isolation. Section 5 shows how
`the components work together to implement operations such as reading and
`writing files, and Sections 6 and 7 describe how Zebra deals with crashes.
`Wewill describe Zebra under the assumption that there are several storage
`servers, each with a single disk. However, this need not be the case. For
`example, storage servers could each contain several disks managed as a
`RAID,thereby giving the appearanceto clients of a single disk with higher
`capacity and throughput. Doing so would also provide additional redundancy:
`the parity maintained in the RAID would protect against disk failures,
`whereasthe parity maintained by Zebra would protect against server failures
`as well. It is also possible to put all the disks on a single server; clients would
`treat it as several logical servers, all implemented by the same physical
`machine. This approach would still provide many of Zebra’s benefits: clients
`would still batch small files for transfer over the network, and it would still
`ACM Transactions on Computer Systems, Vol. 13, No. 3, August 1995.
`
`
`HPE, Exh. 1007, p. 11
`
`HPE, Exh. 1007, p. 11
`
`
`
`for stripe fragments. As far as a storage server is concerned, a stripe
`fragmentis a large block of bytes with a unique identifier. The identifier for a
`fragment consists of an identifier for the client that wrote the fragment, a
`sequence numberthat identifies the stripe uniquely amongall those written
`by the client, and an offset for the fragment within its stripe. All fragments in
`Zebra are the samesize, which should be chosen large enough to minimize
`the network and disk overheads of transferring data among theclients and
`the storage servers. The Zebra prototype uses 512KB fragments.
`Storage servers provide five operations:
`
`—Store a fragment: This operation allocates space for the fragment, writes
`the fragment to disk, and records on disk the fragment identifier and disk
`location for use in subsequent accesses. The operation is synchronous: it
`does not complete until the fragment has been safely stored. The fragment’
`must not already exist unlessit is a parity fragment, in which case the new
`copy of the fragment replaces the old. This is done in a nonoverwrite
`mannerto avoid corruption of a parity fragment in the event of a crash.
`—Append to an existing fragment: This operation is similar to storing a
`fragment except that it allows a client to write out a fragmentin piecesif it
`does not have enough data to fill the entire fragment at once (this can
`happen, for example, if an application invokes the fsync system call to force
`data to disk). Appends are implemented atomically so that a crash during
`an append cannot cause the previous contents of the fragmentto belost.
`—Retrieve a fragment: This operation returns part or all of the data from a
`fragment. It is not necessary to read the entire fragment; a fragment
`identifier, offset, and length specify the desired range of bytes.
`—Delete a fragment: This operation is invoked by the stripe cleaner when the
`fragment no longer contains any useful data. It makes the fragment’s disk
`space available for new fragments.
`ACMTransactions on Computer Systems, Vol. 13, No. 3, August 1995.
`
`
`
`
`HPE, Exh. 1007, p. 12
`
`HPE, Exh. 1007, p. 12
`
`
`
`.
`
`John H. Hartman and John K. Ousterhout
`
`—Identify fragments: This operation provides information about the frag-
`ments stored by the server, such as the most-recent fragment written by a
`client. It is used to find the ends of the clients’ logs after a crash.
`
`Stripes are immutable once they are complete. A stripe may be created
`with a sequence of append operations, but nonparity fragments are never
`overwritten, and once the stripe is complete it is never modified except to
`delete the entire stripe. A parity fragment, however, can be overwritten if
`data are appendedto a partial stripe (see Section 5.2).
`
` 284
`
`4.3 File Manager
`The file manager is responsible for all the information in the file system
`except for file data. We refer to this information as metadata: it includesfile
`attributes such as protection information, block pointers that tell wherefile
`data are stored, directories, symbolic links, and specialfiles for I/O devices.
`Thefile manager performsall the usual functionsofa file server in a network
`file system, such as name lookup and maintaining the consistency of client
`file caches. However, the Zebra file manager does not store any file data;
`wherea traditionalfile server would manipulate data, the Zebra file manager
`manipulates block pointers. For example, consider a read operation. In a
`traditional file system, the client requests the data from thefile server; in
`Zebra, the client requests block pointers from the file manager, then it reads
`the data from the storage servers.
`In the Zebra prototype, we implementedthe file managerusing a Sprite file
`server with a log-structuredfile system. For each Zebrafile there is onefile in
`the file manager’s file system, and the “data”in thisfile are an array of block
`pointers that indicate where the blocks of data for the Zebra file are stored.
`This allows Zebra to use almostall the existing Sprite networkfile protocols
`without modification. Clients open, read, and ca