`
`by
`
`Reagan Moore, Chaitan Baru, Amarnath Gupta, Bertram
`Ludaescher, Richard Marciano, Arcot Rajasekar
`
`San Diego Supercomputer Center
`San Diego, California
`
`Submitted to
`National Archives and Records Administration
`
`June 1999
`
`DISTRIBUTION STATEMENT A
`Approved for Public Release
`Distribution Unlimited
`
`DTIC QUALITY INSPECTED 4
`
`19990716 068
`
`fir£\^ <=h 'l-lO-i^^
`
`HPE, Exh. 1006, p. 1
`
`
`
`Any opinions, findings, and conclusions or recommendations expressed in this
`publication are those of the authors and do not necessarily reflect the views of the
`National Archives and Records Administration or others supporting the San Diego
`Supercomputer Center.
`
`11
`
`HPE, Exh. 1006, p. 2
`
`
`
`Table of Contents
`
`1. INTRODUCTION 1
`
`2. TECHNICAL ISSUES .3
`
`2.1 MANAGING CONTEXT 3
`2.2 MANAGING PERSISTENCE 4
`2.3 MANAGING SCALABILITY 6
`2.4 MANAGING HETEROGENEITY OF DATA RESOURCES 8
`
`3. IMPLEMENTATION STRATEGY 11
`
`3.1 GENERAL ARCHITECTURE 11
`3.1.1 Archive 12
`3.1.2 Data Handling System 14
`3.1.3 Collection Management 15
`3.2 PERSISTENT ARCHIVE ASSESSMENT 18
`3.2.1 Usage Models 19
`3.2.2 Operational Systems 24
`
`4. COLLECTION SUPPORT, GENERAL REQUIREMENTS 27
`
`4.1 COLLECTION PROCESS DEFINITION 27
`4.2 SUMMARY INFORMATION ACROSS ALL COLLECTIONS 30
`
`5. COLLECTION SUPPORT - E-MAIL POSTINGS 35
`
`5.1 "LONG-TERM PRESERVATION" INFORMATION MODEL 36
`5.2 INGESTION PROCESS 42
`5.3 STORAGE REQUIREMENTS 44
`5.4 DATA ACCESS REQUIREMENTS 44
`5.5 LONG TERM PRESERVATION REQUIREMENTS 46
`
`6. COLLECTION SUPPORT - TIGER/LINE '92 (CENSUS BUREAU) 47
`
`6.1 INFORMATION MODEL 47
`6.2 INGESTION PROCESS 52
`7. COLLECTION SUPPORT -104™ CONGRESS 53
`
`7.1 INFORMATION MODEL 54
`7.2 INGESTION PROCESS 58
`7.3 STORAGE REQUIREMENTS 59
`
`8. COLLECTION SUPPORT - VOTE ARCHIVE DEMO 1997 (VAD97) 61
`
`8.1 INFORMATION MODEL 62
`8.2 INGESTION PROCESS 62
`8.3 LONG TERM PRESERVATION REQUIREMENTS 62
`
`9. COLLECTION SUPPORT - ELECTRONIC ARCHIVING PROJECT (EAP) 67
`
`9.1 INFORMATION MODEL 68
`9.2 INGESTION PROCESS 68
`
`10. COLLECTION SUPPORT - COMBAT AREA CASUALTIES CURRENT FDLE (CACCF) 69
`
`10.1 INFORMATION MODEL 70
`10.2 INGESTION PROCESS 72
`10.3 STORAGE REQUIREMENTS 73
`
`in
`
`HPE, Exh. 1006, p. 3
`
`
`
`11. COLLECTION SUPPORT - PATENT DATA (USPTO) 77
`
`12. COLLECTION SUPPORT - IMAGE COLLECTION (AMICO) 81
`
`12.1 INFORMATION MODEL 82
`12.2 INGESTION PROCESS 84
`12.3 DATA ACCESS REQUIREMENTS 84
`
`13. COLLECTION SUPPORT - JITC COLLECTION , 87
`
`13.1 INFORMATION MODEL 88
`13.2 INGESTION PROCESS 88
`
`14. REMAINING TECHNICAL ISSUES 91
`
`14.1 RESEARCH OPPORTUNITIES 93
`14.2 RESEARCH AND DEVELOPMENT TASKS 93
`14.3 SUMMARY 94
`
`REFERENCES 95
`
`APPENDIX A: E-MAIL POSTINGS 97
`
`APPENDIX B: TIGER/LINE'92 ADDITIONAL INFORMATION 99
`
`APPENDIX C: 104TH ; 123
`
`APPENDIX D: VAD97 126
`
`APPENDIXE: EAP 128
`
`APPENDIX F: VffiTNAM 130
`
`APPENDIX G: AMICO 132
`
`APPENDIX H: JTIC 134
`
`ill
`
`HPE, Exh. 1006, p. 4
`
`
`
`Abstract
`
`The preservation of digital information for long periods of time is becoming feasible
`through the integration of archival storage technology from supercomputer centers,
`information models from the digital library community, and preservation models from the
`archivist's community. The supercomputer centers provide the technology needed to
`store the immense amounts of digital data that are being created, while the digital library
`community provides the mechanisms to define the context needed to interpret the data.
`The coordination of these technologies with preservation and management policies
`defines the infrastructure for a collection based persistent archive [1]. This report
`demonstrates the feasibility of maintaining digital data for hundreds of years through
`detailed prototyping of persistent archives for nine different data collections.
`
`1. Introduction
`
`Supercomputer centers, digital libraries, and archival storage communities have common
`persistent archival storage requirements. Each of these communities is building software
`infrastructure to organize and store large collections of data. An emerging common
`requirement is the ability to maintain data collections for long periods of time. The
`challenge is to maintain the ability to discover, access, and display digital objects that are
`stored within the archive, while the technology used to manage the archive evolves. We
`have implemented an approach based upon the storage of the digital objects that comprise
`the collection, augmented with the meta-data attributes needed to dynamically recreate
`the data collection. This approach builds upon the technology needed to support
`extensible database schema, which in turn enables the creation of data handling systems
`that interconnect legacy storage systems.
`
`The long-term storage and access of digital information is a major challenge for federal
`agencies. The rapid change of technology resulting in obsolescence of storage media,
`coupled with the very large volumes of data (terabytes to petabytes in size) appears to
`make the problem intractable. The concern is that when the data storage technology
`becomes obsolete, the time needed to migrate to new technology may exceed the lifetime
`of the hardware and software systems that are being used. This is exacerbated by the
`need to be able to retrieve information from the archived data. The organization of the
`data into collections must also be preserved in the face of rapidly changing technology.
`Thus each collection must be migrated forward in time onto new management systems,
`simultaneously with the migration of the individual data objects onto new media. The
`ultimate goal is to maintain not only the bits associated with the original data, but also the
`context that permits the data to be interpreted. In this paper we present a scalable
`architecture for managing media migration, and an information model for managing
`migration of the structure of the context. For relational databases, the information model
`includes the schema for organizing attributes and the data dictionary for defining
`semantics. For hierarchical databases, the information model includes a representation of
`the hierarchical structure along with the data dictionary.
`
`HPE, Exh. 1006, p. 5
`
`
`
`We rely on the use of collections to define the context to associate with digital data. The
`context is defined through the creation of hierarchical representations for both the digital
`objects and the associated data collection. Each digital object is maintained as a tagged
`structure that includes the original bytes of data, as well as attributes that have been
`defined as relevant for the data collection. The collection context is defined through use
`of both hierarchical and relational representations for organizing the collection attributes.
`By using infrastructure independent representations, the original context for the archived
`data can be maintained. A collection-based persistent archive is therefore one in which
`the organization of the collection is archived simultaneously with the digital objects that
`comprise the collection [1].
`
`A persistent collection requires the ability to dynamically recreate the collection on new
`technology. For a solution, we consider the integration of scalable archival storage
`technology from supercomputer centers, infrastructure independent information models
`from the digital library community, and preservation models from the archivist's
`community. An infrastructure that supports the continuous migration of both the digital
`objects and the data collections is needed. Scalable archival storage systems are used to
`ensure that sufficient resources are available for continual migration of digital objects to
`new media. The software systems that interpret the infrastructure independent
`representation for the collections are based upon generic digital library systems, and are
`migrated explicitly to new platforms. In this system, the original representation of the
`digital objects and of the collections does not change. The maintenance of the persistent
`archive is then achieved through application of archivist policies that govern the rate of
`migration of the objects and the collection instantiation software.
`
`The goal is to preserve digital information for at least 400 years. This report examines
`the technical issues that must be addressed, evaluates possible implementations, assigns
`metrics for success, and examines business models for managing a collection-based
`persistent archive. The applicability of the results is demonstrated by examination of
`nine different data collections, provided by the USPTO and other federal/state agencies.
`The report is organized into sections to provide a description of the scaling issues, a
`generic description of the technology, a comparative synopsis of the nine collections, and
`detailed descriptions of the approaches that were used for each collection.
`
`HPE, Exh. 1006, p. 6
`
`
`
`2. Technical Issues
`
`The preservation of the context to associate with digital objects is the dominant issue for
`collection-based persistent archives. The context is traditionally defined through
`specification of attributes that are associated with each digital object. The context is also
`defined through the implied relationships that exist between the attributes, and the
`preferred organization of the attributes in user interfaces for accessing the data collection.
`We identify three levels of context that must be preserved:
`
`• Digital object representation. For semi-structured data, the organization of the
`components must be specified, as well as the elements that are used to define the
`attributes to associate with the collection. An example is a multi-media digital
`object that has associated text, images, and video. A hierarchical representation is
`needed to define the relationship between the components, including elements
`consisting of tagged meta-data attributes.
`• Data collection representation. The collection also has an implied organization,
`which today may be specified through a relational schema or a hierarchical
`structure. A schema is used to support relational queries of the attributes or meta-
`data. It is possible to reorganize a collection into multiple tables to improve
`access by building new indexes, and in the more general case, by adding
`attributes. The structure used to define the collection attributes can be different
`from the structure used to specify a digital object within the collection. While
`relational representations can be used today, in the future, alternate
`representations may be based upon hierarchical or multi-dimensional algorithms.
`• Presentation representation. The user interface to the collection can present an
`organization of the collection attributes that is tuned to meet the needs of a
`particular community. Researchers may need access to all of the meta-data
`attributes, while students are interested in a subset. The structure used to define
`the user interface again can be different from the schema used for the collection
`organization. Each of these presentations represents a different view of the
`collection. Re-creation of the original view of a collection may or may not be
`possible.
`
`Digital objects are used to encapsulate each data set. Collections are used to organize the
`context for the digital objects. Presentation interfaces are the structure through which
`collection interactions are defined. The challenge is to preserve all three levels of context
`for each collection.
`2.1 Managing Context
`
`Management of the collection context is made difficult by the rapid change of
`technology. Software systems used to manage collections are changing on five to ten-
`year time scale. It is possible to make a copy of a database through a vendor specific
`dump or backup routine. The copy can then be written into an archive for long term
`storage. This approach fails when the database is retrieved from storage, as the database
`software may no longer exist. The archivist is then faced with migrating the data
`collection onto a new database system. Since this can happen for every data collection,
`
`HPE, Exh. 1006, p. 7
`
`
`
`the archivist will have to continually transform the entire archive. A better approach
`is needed.
`
`An infrastructure independent representation is required for the collection that can be
`maintained for the life of the collection. If possible, a common information model should
`be used to define the hierarchical structures associated with the digital objects, the
`collection organization, and the presentation interface. An emerging standard is the
`extended Markup Language (XML) [2]. XML provides an information model for
`describing hierarchical structures through use of nested elements. Elements are
`essentially tagged pieces of data. A Document Type Definition (DTD) provides the
`particular hierarchical organization that is associated with a given document or digital
`object. XML Style Sheet Language (XSL) can be used to define the presentation style to
`associate with a DTD. It is possible to use multiple style sheets for a given DTD. This
`provides the flexibility needed to represent the context for the user interface into a
`collection, as well as the structure of the digital objects within the collection.
`
`Although DTDs were originally applied to documents, they are now being applied to
`arbitrary digital objects, including the collections themselves. XML DTDs can be used to
`define the structure of digital objects, specify inheritance properties of digital objects,
`and define the collection organization and user interface structure. DTDs can also be
`used to define the structure of highly regular data or semi-structured data. Thus DTDs
`are a strong candidate for a uniform information model.
`
`While XML DTDs provide a tagged structure for organizing information, the semantic
`meaning of the tags used within a DTD is arbitrary, and depends upon the collection. A
`data dictionary is needed for each collection to define the semantics. A persistent
`collection therefore needs the following components to define the context:
`• Data dictionary for collection semantics,
`• DTD for digital object hierarchical structure,
`• DTD for collection hierarchical structure,
`• DTD for user interface structure,
`• XSL style sheets for presentation of each DTD.
`
`2.2 Managing Persistence
`
`Persistence is achieved by providing the ability to dynamically reconstruct a data
`collection on new technology. While the software tools that do the reconstruction have to
`be ported to work with each new hardware platform or database, the collection can
`remain in its original format within an archive. The choice of the appropriate standard
`for the information model is vital for minimizing the support requirements for a
`collection-based persistent archive. The goal is to store the digital objects comprising the
`collection and the collection context in an archive a single time. This is possible if any
`changes to the information model standard contain a superset of the prior information
`model. The knowledge required to manipulate a prior version of the information model
`can then be encapsulated in the software system that is used to reconstruct the collection.
`
`HPE, Exh. 1006, p. 8
`
`
`
`With this caveat, the persistent collection never needs to be modified, and can be
`held as infrastructure independent bit-files in an archive.
`
`The re-creation or instantiation of the data collection is done with a software program that
`uses the DTDs that define the digital object and collection structure to generate the
`collection. While the current prototypes rely on a different collection instantiation
`program for each collection, the goal is to build a generic program that works with any
`DTD. This will reduce the effort required to support dynamic reconstruction of a
`persistent data collection to the maintenance of a single software system.
`
`Maintaining persistent digital objects requires the ability to migrate data to new media.
`The reasons for continuing to refresh the media on which the collection is maintained are:
`• Avoid loss of data because of the finite lifetime and resulting degradation of the
`media.
`• Minimize storage costs. New media typically store at least twice as much data as
`the prior version, usually at the same cost per cartridge. Thus migration to new
`media results in the need for half as many cartridges, decreased floor space, and
`decreased operating costs for managing the cartridges. Note that for this scenario,
`the media costs for a continued migration will remain bounded, and will be less
`than twice the original media cost. The dominate cost to support a continued
`migration onto new media is the operational support needed to handle the media.
`• Maximize the ability to handle exponentially increasing data growth. Many data
`collections are doubling in size in time periods shorter than a year. This means
`the effort to read the entire collection for migration to new media will be less than
`the effort to store the new data that is being collected within that year. Migration
`to higher density media is the only way to keep the number of cartridges to a
`manageable level.
`
`To facilitate migration and access, supercomputer centers keep all data in tape robots.
`For currently available tape (cartridges holding 20 GB to 50 GB of data), a single tape
`robot is able to store 120 terabytes to 300 terabytes of uncompressed data. By year 2003,
`a single tape robot is expected to hold 6000 terabytes, using 1-terabyte capacity
`cartridges. The storage of petabytes (thousands of terabytes) of data is now feasible.
`Given that the collection context and the digital objects can be migrated to new media,
`the remaining system that must be migrated is the archival storage system itself. The
`software that controls the tape archive is composed of databases to store the storage
`location and name of each data set, logging systems to track the completion of
`transactions, and bitfile movers for accessing the storage peripherals. Of these
`components, the most critical resource is the database or nameserver directory that is
`used to manage the names and locations of the data sets. At the San Diego
`Supercomputer, the migration of the nameserver directory to a new system has been done
`twice, from the DataTree archival storage system to the UniTree archival storage system,
`and from UniTree to the IBM High Performance Storage System [3]. Each migration
`required the read of the old directory, and the ingestion of each data set into the new
`system. In Table 2-1, the times and number of data sets migrated are listed. Note that
`even though the number of files dramatically increases, the time required for the
`
`HPE, Exh. 1006, p. 9
`
`
`
`migration decreased. This reflects advances in vendor supplied systems for
`managing the name space. Based on this experience, it is possible to migrate to new
`archival storage systems, without loss of data.
`
`System Migration
`DataTree to UniTree
`UniTree to HPSS
`
`Number of files
`4 million
`7 million
`
`Time (days)
`4
`1
`
`Table 2-1. Migration of archival storage system nameserver directory
`
`One advantage of archival storage systems is their ability to manage the data movement
`independently from the use of the data. Each time the archival storage system was
`upgraded, the new version of the archive was built with a driver that allowed tapes to be
`read from the old system. Thus migration of data between the archival storage systems
`could be combined with migration onto new media, minimizing the number of times a
`tape had to be read.
`
`The creation of a persistent collection can be viewed as the design of a system that
`supports the independent migration of each internal hardware and software component to
`new technology. Management of the migration process then becomes one of the major
`tasks for the archivist.
`
`2.3 Managing Scalability
`
`A persistent archive can be expected to increase in size through either addition of new
`collections, or extensions to existing collections. Hence the architecture must be
`scalable, supporting growth in the total amount of archived data, the number of archived
`data sets, the number of digital objects, the number of collections, and the number of
`accesses per day. These requirements are similar to the demands that are placed on
`supercomputer center archival storage systems. We propose a scalable solution that uses
`supercomputer technology, based on the use of parallel applications running on parallel
`computers.
`
`Archival storage systems are used to manage the storage media and the migration to new
`media. Database management systems are used to manage the collections. Web servers
`are used to manage access to the system. A scalable system is built by identifying both
`the capabilities that are best provided by each component, and the constraints that are
`implicit within each technology. Interfaces are then constructed between the components
`to match the data flow through the architecture to the available capabilities. Table 2-2
`lists the major constraints that the architecture must manage for a scalable system to be
`possible.
`
`Archival storage systems excel at storing large amounts of data on tape, but at the cost of
`relatively slow access times. The time to retrieve a tape from within a tape silo, mount
`
`HPE, Exh. 1006, p. 10
`
`
`
`the tape into a tape drive, and ready the tape for reading is on the order of 15-20
`seconds for current tape silos. The time required to spin the tape forward to the position
`of the
`
`Component
`Archive
`
`Database
`
`Web Server
`
`Capability
`Massive storage
`
`Large number of objects
`
`Ubiquitous access
`
`Constraints
`Latency of access
`Number of data sets
`Access optimization
`Query language
`Presentation format
`Storage capacity
`
`Table 2-2. Architecture Components
`
`desired file is on the order of 1-2 minutes. The total time can be doubled if the tape drive
`is already in use. Thus the access time to data on tape can be 2-4 minutes. To overcome
`this high latency, data is transferred in large blocks, such that the time it takes to transfer
`the data set over a communication channel is comparable to the access latency time. For
`current tape peripherals which read at rates from 10 MB/sec to 15 MB/sec, the average
`data set size in an archive should be on the order of 500 MB to 1 GB. Since digital
`objects can be of arbitrary size, containers are used to aggregate digital objects before
`storage into the archive.
`
`The second constraint that must be managed for archives is the minimization of the
`number of data sets that are seen by the archive. Current archival storage nameservers
`are able to manage on the order of 10 - 40 million data sets. If each data set size is on the
`order of 500 MB, the archive can manage about 10 petabytes of data (10,000 TBs, or 10
`million GBs). Archival storage systems provide a scalable solution only if containers are
`used to aggregate digital objects into large data sets. The total number of digital objects
`that can be managed is on the order of 40 billion, if one thousand digital objects are
`aggregated into each container.
`
`Databases excel at supporting large numbers of records. Note that the Transaction
`Processing Council D benchmark [4] measures performance of relational databases on
`decision support queries for database sizes ranging from 1 gigabyte up to 3 terabytes and
`from 6 million to 18 billion rows. Each row can represent a separate digital object. With
`object relational database systems, a binary large object or BLOB can be associated with
`each row. The BLOBs can reside either internally within the database, or within an
`external file system. In the latter case, handles are used to point to the location of BLOB.
`The use of handles makes it feasible to aggregate digital objects within containers.
`Multiple types of container technology are available for aggregating digital objects.
`Aggregation can be done at the file level, using utilities such as the TAR program, at the
`database level through database tablespaces, or at an intermediate data handling level
`through use of software caches. All three approaches are demonstrated in the persistent
`collection prototyping efforts described in section 4. The database maintains the
`information needed to describe each object, as well as the location of the object within a
`
`HPE, Exh. 1006, p. 11
`
`
`
`container and the location of the container within the storage system. A data
`handling system is used to support database access to archival storage.
`
`Queries are done across the attributes stored within each record. The time needed to
`respond to a query is optimized by constructing indexes across the database tables. This
`can reduce the time needed to do a query by a factor of a thousand, at the cost of the
`storage space for the index, and the time spent in assembling the index. Persistent
`collections may be maintained on disk to support interactive access, or they may be
`stored in the archive, and rebuilt on disk when a need arises. If the collection is
`reassembled from out of the archive, the dominant time needed for the process may be
`the time spent creating a new index. Since archival storage space is cheap, it may be
`preferable to keep both infrastructure independent and infrastructure dependent
`representations of a collection. The time needed to load a pre-indexed database snapshot
`is a small fraction of the time that it would take to reassemble and index a collection.
`The database snapshot, of course, assumes that the database software technology is still
`available for interpreting the database snapshot. For data collections that are frequently
`accessed, the database snapshot may be worth maintaining.
`
`The presentation of information for frequently accessed collections requires Web servers
`to handle the user load. Servers function well for data sets that are stored on local disk.
`In order to access data that reside within an archive, a data handling system is needed to
`transfer data from the archive to the Web server. Otherwise the size of the accessible
`collection may be limited to the size of the Web server disk cache. Web servers are
`available that distribute their load across multiple CPUs of a parallel computer, with
`parallel servers managing over 10 million accesses per day.
`
`Web servers provide a variety of user interfaces to support queries and information
`discovery. The preservation of the user interface requires a way to capture an
`infrastructure independent representation for the query construction and information
`presentation. Web servers are available that retrieve information from databases for
`presentation. What is needed is the software that provides the ability to reconstruct the
`original view of the collection, based upon a description of the collection attributes. Such
`technology is demonstrated as part of the collection instantiation process.
`
`2.4 Managing Heterogeneity of Data Resources
`
`A persistent archive is inherently composed of heterogeneous resources. As technology
`evolves, both old and new versions of the software and hardware infrastructure will be
`present at the same time. An issue that must be managed is the ability to access data that
`is present on multiple storage systems, each with possibly different access protocols. A
`variant of this requirement is the ability to access data within an archive from a database
`that may expect data to reside on a local disk file system. Data handling systems provide
`the ability to interconnect archives with databases and with Web servers. Thus the more
`general form of the persistent archive architecture uses a data handling system to tie each
`component together. At the San Diego Supercomputer Center, a particular
`implementation of a data handling system has been developed, called the Storage
`
`HPE, Exh. 1006, p. 12
`
`
`
`Resource Broker (SRB) [5]. A picture of the SRB architecture is shown in Figure 2-
`1 to illustrate the required components.
`
`SHE Surver
`
`Q
`
`DB2. Oracle. IEustra. Cbj ectStore HPSS. UniTre e
`Distributed Storage Resources
`(database systems, archival storage systems, file systems, ftp)
`
`UNIX, ftp
`
`Figure 2-1. SDSC Storage Resource Broker Architecture
`
`The SRB supports the protocol conversion needed for an application to access data within
`either a database, file system, or archive. The heterogeneous nature of the data storage
`systems is hidden by the uniform access API provided by the SRB. This makes it
`possible for any component of the architecture to be modified, whether archive, or
`database, or Web server. The SRB Server uses a different driver for each type of storage
`resource. The information for which driver to use for access to a particular data set is
`maintained in the associated Meta-data Catalog (MCAT) [6,7]. The MCAT system is a
`database containing information about each data set that is stored in the data storage
`systems. New versions of a storage system are accessed by a new driver written for the
`SRB. Thus the application is able to use a persistent interface, even while the storage
`technology changes over time.
`
`HPE, Exh. 1006, p. 13
`
`
`
`3. Implementation strategy
`
`A collection based persistent archive can be assembled using a scalable architecture. The
`scalable architecture relies upon parallel hardware and software technology that is
`commercially available. The persistent archive requires the integration of three separate
`components; archival storage, collection management, and access servers through the use
`of a data handling system. The result is a system that can be modified to build upon new
`technology on an incremental basis. For a persistent archive to work within this
`migration environment, the data context must be maintained in an information
`independent representation. The technology to instantiate the collection will have to be
`migrated forward in time, along with the data handling system. The collection can be
`kept as bit-files within the archive, while the supporting hardware and software systems
`evolve.
`
`3.1 General Architecture
`
`The implementation of a persistent archive at SDSC is based upon use of commercially
`available software systems, augmented by application level software developed at the San
`Diego Supercomputer Center. The architecture software components are:
`• Archival storage system - IBM High Performance Storage System (HPSS) [3]
`• Data handling system - SDSC Storage Resource Broker (SRB) [5]
`• Object relational database - Oracle version 7.3, IBM DB2 Universal Database
`• Collection management software - SDSC Meta-data Catalog (MCAT) [6,7]
`• Collection instantiation software - SDSC scripts
`• Collection ingestion software - SDSC scripts
`• Hierarchical data model - extended Markup Language - Document Type
`Definition [2]
`• Relational data model - ANSI SQL Data Definition Language [8]
`• DTD manipulation software - UCSD XML Matching and Structuring language
`(XMAS) [9]
`• Web server - Apache Web server
`• Presentation system - Internet Explorer version 5
`
`The hardware components are:
`• Archival storage system - IBM SP 8-node, 32-processor parallel computer, 180
`TB of tape storage, three Storage Technology tape robots, and 1.6 TB of RAID
`disk cache
`• Data management system - Sun Enterprise 4-processor parallel computer
`• Data ingestion platform - SGI workstation
`• Network interconnect - Ethernet, FDDI, and HiPPI
`
`Each of these systems is scalable, and can be implemented using parallel computing
`technology. The efficiency of the archival storage system is critically dependent upon the
`use of containers for aggregating data before storage. Three difference mechanisms have
`been tried at SDSC:
`
`11
`
`HPE, Exh. 1006, p. 14
`
`
`
`• Unix utilities. The TAR utility can be used to aggregate files. For container
`sizes of 100 MB, the additional disk space required is minimal. The
`disadvantages are that the container must be read from the archive and
`unpacked before data sets are accessed.
`• Database tablespace. At SDSC, a prototype version of the DB2 UDB [10]
`parallel object-relational database has been used to support large data
`collections. The prototype database stores the digital objects internally within
`tablespaces. The tablespaces can be stored within the HPSS archival storage
`system, and retrieved to a disk cache on demand. This effectively increases
`the database storage capacity to the size of the archive, while simultaneously
`aggregating digital objects into containers before storage in the archive.
`• Data handling so



