throbber
Collection-Based Long-Term Preservation
`
`by
`
`Reagan Moore, Chaitan Baru, Amarnath Gupta, Bertram
`Ludaescher, Richard Marciano, Arcot Rajasekar
`
`San Diego Supercomputer Center
`San Diego, California
`
`Submitted to
`National Archives and Records Administration
`
`June 1999
`
`DISTRIBUTION STATEMENT A
`Approved for Public Release
`Distribution Unlimited
`
`DTIC QUALITY INSPECTED 4
`
`19990716 068
`
`fir£\^ <=h 'l-lO-i^^
`
`HPE, Exh. 1006, p. 1
`
`

`

`Any opinions, findings, and conclusions or recommendations expressed in this
`publication are those of the authors and do not necessarily reflect the views of the
`National Archives and Records Administration or others supporting the San Diego
`Supercomputer Center.
`
`11
`
`HPE, Exh. 1006, p. 2
`
`

`

`Table of Contents
`
`1. INTRODUCTION 1
`
`2. TECHNICAL ISSUES .3
`
`2.1 MANAGING CONTEXT 3
`2.2 MANAGING PERSISTENCE 4
`2.3 MANAGING SCALABILITY 6
`2.4 MANAGING HETEROGENEITY OF DATA RESOURCES 8
`
`3. IMPLEMENTATION STRATEGY 11
`
`3.1 GENERAL ARCHITECTURE 11
`3.1.1 Archive 12
`3.1.2 Data Handling System 14
`3.1.3 Collection Management 15
`3.2 PERSISTENT ARCHIVE ASSESSMENT 18
`3.2.1 Usage Models 19
`3.2.2 Operational Systems 24
`
`4. COLLECTION SUPPORT, GENERAL REQUIREMENTS 27
`
`4.1 COLLECTION PROCESS DEFINITION 27
`4.2 SUMMARY INFORMATION ACROSS ALL COLLECTIONS 30
`
`5. COLLECTION SUPPORT - E-MAIL POSTINGS 35
`
`5.1 "LONG-TERM PRESERVATION" INFORMATION MODEL 36
`5.2 INGESTION PROCESS 42
`5.3 STORAGE REQUIREMENTS 44
`5.4 DATA ACCESS REQUIREMENTS 44
`5.5 LONG TERM PRESERVATION REQUIREMENTS 46
`
`6. COLLECTION SUPPORT - TIGER/LINE '92 (CENSUS BUREAU) 47
`
`6.1 INFORMATION MODEL 47
`6.2 INGESTION PROCESS 52
`7. COLLECTION SUPPORT -104™ CONGRESS 53
`
`7.1 INFORMATION MODEL 54
`7.2 INGESTION PROCESS 58
`7.3 STORAGE REQUIREMENTS 59
`
`8. COLLECTION SUPPORT - VOTE ARCHIVE DEMO 1997 (VAD97) 61
`
`8.1 INFORMATION MODEL 62
`8.2 INGESTION PROCESS 62
`8.3 LONG TERM PRESERVATION REQUIREMENTS 62
`
`9. COLLECTION SUPPORT - ELECTRONIC ARCHIVING PROJECT (EAP) 67
`
`9.1 INFORMATION MODEL 68
`9.2 INGESTION PROCESS 68
`
`10. COLLECTION SUPPORT - COMBAT AREA CASUALTIES CURRENT FDLE (CACCF) 69
`
`10.1 INFORMATION MODEL 70
`10.2 INGESTION PROCESS 72
`10.3 STORAGE REQUIREMENTS 73
`
`in
`
`HPE, Exh. 1006, p. 3
`
`

`

`11. COLLECTION SUPPORT - PATENT DATA (USPTO) 77
`
`12. COLLECTION SUPPORT - IMAGE COLLECTION (AMICO) 81
`
`12.1 INFORMATION MODEL 82
`12.2 INGESTION PROCESS 84
`12.3 DATA ACCESS REQUIREMENTS 84
`
`13. COLLECTION SUPPORT - JITC COLLECTION , 87
`
`13.1 INFORMATION MODEL 88
`13.2 INGESTION PROCESS 88
`
`14. REMAINING TECHNICAL ISSUES 91
`
`14.1 RESEARCH OPPORTUNITIES 93
`14.2 RESEARCH AND DEVELOPMENT TASKS 93
`14.3 SUMMARY 94
`
`REFERENCES 95
`
`APPENDIX A: E-MAIL POSTINGS 97
`
`APPENDIX B: TIGER/LINE'92 ADDITIONAL INFORMATION 99
`
`APPENDIX C: 104TH ; 123
`
`APPENDIX D: VAD97 126
`
`APPENDIXE: EAP 128
`
`APPENDIX F: VffiTNAM 130
`
`APPENDIX G: AMICO 132
`
`APPENDIX H: JTIC 134
`
`ill
`
`HPE, Exh. 1006, p. 4
`
`

`

`Abstract
`
`The preservation of digital information for long periods of time is becoming feasible
`through the integration of archival storage technology from supercomputer centers,
`information models from the digital library community, and preservation models from the
`archivist's community. The supercomputer centers provide the technology needed to
`store the immense amounts of digital data that are being created, while the digital library
`community provides the mechanisms to define the context needed to interpret the data.
`The coordination of these technologies with preservation and management policies
`defines the infrastructure for a collection based persistent archive [1]. This report
`demonstrates the feasibility of maintaining digital data for hundreds of years through
`detailed prototyping of persistent archives for nine different data collections.
`
`1. Introduction
`
`Supercomputer centers, digital libraries, and archival storage communities have common
`persistent archival storage requirements. Each of these communities is building software
`infrastructure to organize and store large collections of data. An emerging common
`requirement is the ability to maintain data collections for long periods of time. The
`challenge is to maintain the ability to discover, access, and display digital objects that are
`stored within the archive, while the technology used to manage the archive evolves. We
`have implemented an approach based upon the storage of the digital objects that comprise
`the collection, augmented with the meta-data attributes needed to dynamically recreate
`the data collection. This approach builds upon the technology needed to support
`extensible database schema, which in turn enables the creation of data handling systems
`that interconnect legacy storage systems.
`
`The long-term storage and access of digital information is a major challenge for federal
`agencies. The rapid change of technology resulting in obsolescence of storage media,
`coupled with the very large volumes of data (terabytes to petabytes in size) appears to
`make the problem intractable. The concern is that when the data storage technology
`becomes obsolete, the time needed to migrate to new technology may exceed the lifetime
`of the hardware and software systems that are being used. This is exacerbated by the
`need to be able to retrieve information from the archived data. The organization of the
`data into collections must also be preserved in the face of rapidly changing technology.
`Thus each collection must be migrated forward in time onto new management systems,
`simultaneously with the migration of the individual data objects onto new media. The
`ultimate goal is to maintain not only the bits associated with the original data, but also the
`context that permits the data to be interpreted. In this paper we present a scalable
`architecture for managing media migration, and an information model for managing
`migration of the structure of the context. For relational databases, the information model
`includes the schema for organizing attributes and the data dictionary for defining
`semantics. For hierarchical databases, the information model includes a representation of
`the hierarchical structure along with the data dictionary.
`
`HPE, Exh. 1006, p. 5
`
`

`

`We rely on the use of collections to define the context to associate with digital data. The
`context is defined through the creation of hierarchical representations for both the digital
`objects and the associated data collection. Each digital object is maintained as a tagged
`structure that includes the original bytes of data, as well as attributes that have been
`defined as relevant for the data collection. The collection context is defined through use
`of both hierarchical and relational representations for organizing the collection attributes.
`By using infrastructure independent representations, the original context for the archived
`data can be maintained. A collection-based persistent archive is therefore one in which
`the organization of the collection is archived simultaneously with the digital objects that
`comprise the collection [1].
`
`A persistent collection requires the ability to dynamically recreate the collection on new
`technology. For a solution, we consider the integration of scalable archival storage
`technology from supercomputer centers, infrastructure independent information models
`from the digital library community, and preservation models from the archivist's
`community. An infrastructure that supports the continuous migration of both the digital
`objects and the data collections is needed. Scalable archival storage systems are used to
`ensure that sufficient resources are available for continual migration of digital objects to
`new media. The software systems that interpret the infrastructure independent
`representation for the collections are based upon generic digital library systems, and are
`migrated explicitly to new platforms. In this system, the original representation of the
`digital objects and of the collections does not change. The maintenance of the persistent
`archive is then achieved through application of archivist policies that govern the rate of
`migration of the objects and the collection instantiation software.
`
`The goal is to preserve digital information for at least 400 years. This report examines
`the technical issues that must be addressed, evaluates possible implementations, assigns
`metrics for success, and examines business models for managing a collection-based
`persistent archive. The applicability of the results is demonstrated by examination of
`nine different data collections, provided by the USPTO and other federal/state agencies.
`The report is organized into sections to provide a description of the scaling issues, a
`generic description of the technology, a comparative synopsis of the nine collections, and
`detailed descriptions of the approaches that were used for each collection.
`
`HPE, Exh. 1006, p. 6
`
`

`

`2. Technical Issues
`
`The preservation of the context to associate with digital objects is the dominant issue for
`collection-based persistent archives. The context is traditionally defined through
`specification of attributes that are associated with each digital object. The context is also
`defined through the implied relationships that exist between the attributes, and the
`preferred organization of the attributes in user interfaces for accessing the data collection.
`We identify three levels of context that must be preserved:
`
`• Digital object representation. For semi-structured data, the organization of the
`components must be specified, as well as the elements that are used to define the
`attributes to associate with the collection. An example is a multi-media digital
`object that has associated text, images, and video. A hierarchical representation is
`needed to define the relationship between the components, including elements
`consisting of tagged meta-data attributes.
`• Data collection representation. The collection also has an implied organization,
`which today may be specified through a relational schema or a hierarchical
`structure. A schema is used to support relational queries of the attributes or meta-
`data. It is possible to reorganize a collection into multiple tables to improve
`access by building new indexes, and in the more general case, by adding
`attributes. The structure used to define the collection attributes can be different
`from the structure used to specify a digital object within the collection. While
`relational representations can be used today, in the future, alternate
`representations may be based upon hierarchical or multi-dimensional algorithms.
`• Presentation representation. The user interface to the collection can present an
`organization of the collection attributes that is tuned to meet the needs of a
`particular community. Researchers may need access to all of the meta-data
`attributes, while students are interested in a subset. The structure used to define
`the user interface again can be different from the schema used for the collection
`organization. Each of these presentations represents a different view of the
`collection. Re-creation of the original view of a collection may or may not be
`possible.
`
`Digital objects are used to encapsulate each data set. Collections are used to organize the
`context for the digital objects. Presentation interfaces are the structure through which
`collection interactions are defined. The challenge is to preserve all three levels of context
`for each collection.
`2.1 Managing Context
`
`Management of the collection context is made difficult by the rapid change of
`technology. Software systems used to manage collections are changing on five to ten-
`year time scale. It is possible to make a copy of a database through a vendor specific
`dump or backup routine. The copy can then be written into an archive for long term
`storage. This approach fails when the database is retrieved from storage, as the database
`software may no longer exist. The archivist is then faced with migrating the data
`collection onto a new database system. Since this can happen for every data collection,
`
`HPE, Exh. 1006, p. 7
`
`

`

`the archivist will have to continually transform the entire archive. A better approach
`is needed.
`
`An infrastructure independent representation is required for the collection that can be
`maintained for the life of the collection. If possible, a common information model should
`be used to define the hierarchical structures associated with the digital objects, the
`collection organization, and the presentation interface. An emerging standard is the
`extended Markup Language (XML) [2]. XML provides an information model for
`describing hierarchical structures through use of nested elements. Elements are
`essentially tagged pieces of data. A Document Type Definition (DTD) provides the
`particular hierarchical organization that is associated with a given document or digital
`object. XML Style Sheet Language (XSL) can be used to define the presentation style to
`associate with a DTD. It is possible to use multiple style sheets for a given DTD. This
`provides the flexibility needed to represent the context for the user interface into a
`collection, as well as the structure of the digital objects within the collection.
`
`Although DTDs were originally applied to documents, they are now being applied to
`arbitrary digital objects, including the collections themselves. XML DTDs can be used to
`define the structure of digital objects, specify inheritance properties of digital objects,
`and define the collection organization and user interface structure. DTDs can also be
`used to define the structure of highly regular data or semi-structured data. Thus DTDs
`are a strong candidate for a uniform information model.
`
`While XML DTDs provide a tagged structure for organizing information, the semantic
`meaning of the tags used within a DTD is arbitrary, and depends upon the collection. A
`data dictionary is needed for each collection to define the semantics. A persistent
`collection therefore needs the following components to define the context:
`• Data dictionary for collection semantics,
`• DTD for digital object hierarchical structure,
`• DTD for collection hierarchical structure,
`• DTD for user interface structure,
`• XSL style sheets for presentation of each DTD.
`
`2.2 Managing Persistence
`
`Persistence is achieved by providing the ability to dynamically reconstruct a data
`collection on new technology. While the software tools that do the reconstruction have to
`be ported to work with each new hardware platform or database, the collection can
`remain in its original format within an archive. The choice of the appropriate standard
`for the information model is vital for minimizing the support requirements for a
`collection-based persistent archive. The goal is to store the digital objects comprising the
`collection and the collection context in an archive a single time. This is possible if any
`changes to the information model standard contain a superset of the prior information
`model. The knowledge required to manipulate a prior version of the information model
`can then be encapsulated in the software system that is used to reconstruct the collection.
`
`HPE, Exh. 1006, p. 8
`
`

`

`With this caveat, the persistent collection never needs to be modified, and can be
`held as infrastructure independent bit-files in an archive.
`
`The re-creation or instantiation of the data collection is done with a software program that
`uses the DTDs that define the digital object and collection structure to generate the
`collection. While the current prototypes rely on a different collection instantiation
`program for each collection, the goal is to build a generic program that works with any
`DTD. This will reduce the effort required to support dynamic reconstruction of a
`persistent data collection to the maintenance of a single software system.
`
`Maintaining persistent digital objects requires the ability to migrate data to new media.
`The reasons for continuing to refresh the media on which the collection is maintained are:
`• Avoid loss of data because of the finite lifetime and resulting degradation of the
`media.
`• Minimize storage costs. New media typically store at least twice as much data as
`the prior version, usually at the same cost per cartridge. Thus migration to new
`media results in the need for half as many cartridges, decreased floor space, and
`decreased operating costs for managing the cartridges. Note that for this scenario,
`the media costs for a continued migration will remain bounded, and will be less
`than twice the original media cost. The dominate cost to support a continued
`migration onto new media is the operational support needed to handle the media.
`• Maximize the ability to handle exponentially increasing data growth. Many data
`collections are doubling in size in time periods shorter than a year. This means
`the effort to read the entire collection for migration to new media will be less than
`the effort to store the new data that is being collected within that year. Migration
`to higher density media is the only way to keep the number of cartridges to a
`manageable level.
`
`To facilitate migration and access, supercomputer centers keep all data in tape robots.
`For currently available tape (cartridges holding 20 GB to 50 GB of data), a single tape
`robot is able to store 120 terabytes to 300 terabytes of uncompressed data. By year 2003,
`a single tape robot is expected to hold 6000 terabytes, using 1-terabyte capacity
`cartridges. The storage of petabytes (thousands of terabytes) of data is now feasible.
`Given that the collection context and the digital objects can be migrated to new media,
`the remaining system that must be migrated is the archival storage system itself. The
`software that controls the tape archive is composed of databases to store the storage
`location and name of each data set, logging systems to track the completion of
`transactions, and bitfile movers for accessing the storage peripherals. Of these
`components, the most critical resource is the database or nameserver directory that is
`used to manage the names and locations of the data sets. At the San Diego
`Supercomputer, the migration of the nameserver directory to a new system has been done
`twice, from the DataTree archival storage system to the UniTree archival storage system,
`and from UniTree to the IBM High Performance Storage System [3]. Each migration
`required the read of the old directory, and the ingestion of each data set into the new
`system. In Table 2-1, the times and number of data sets migrated are listed. Note that
`even though the number of files dramatically increases, the time required for the
`
`HPE, Exh. 1006, p. 9
`
`

`

`migration decreased. This reflects advances in vendor supplied systems for
`managing the name space. Based on this experience, it is possible to migrate to new
`archival storage systems, without loss of data.
`
`System Migration
`DataTree to UniTree
`UniTree to HPSS
`
`Number of files
`4 million
`7 million
`
`Time (days)
`4
`1
`
`Table 2-1. Migration of archival storage system nameserver directory
`
`One advantage of archival storage systems is their ability to manage the data movement
`independently from the use of the data. Each time the archival storage system was
`upgraded, the new version of the archive was built with a driver that allowed tapes to be
`read from the old system. Thus migration of data between the archival storage systems
`could be combined with migration onto new media, minimizing the number of times a
`tape had to be read.
`
`The creation of a persistent collection can be viewed as the design of a system that
`supports the independent migration of each internal hardware and software component to
`new technology. Management of the migration process then becomes one of the major
`tasks for the archivist.
`
`2.3 Managing Scalability
`
`A persistent archive can be expected to increase in size through either addition of new
`collections, or extensions to existing collections. Hence the architecture must be
`scalable, supporting growth in the total amount of archived data, the number of archived
`data sets, the number of digital objects, the number of collections, and the number of
`accesses per day. These requirements are similar to the demands that are placed on
`supercomputer center archival storage systems. We propose a scalable solution that uses
`supercomputer technology, based on the use of parallel applications running on parallel
`computers.
`
`Archival storage systems are used to manage the storage media and the migration to new
`media. Database management systems are used to manage the collections. Web servers
`are used to manage access to the system. A scalable system is built by identifying both
`the capabilities that are best provided by each component, and the constraints that are
`implicit within each technology. Interfaces are then constructed between the components
`to match the data flow through the architecture to the available capabilities. Table 2-2
`lists the major constraints that the architecture must manage for a scalable system to be
`possible.
`
`Archival storage systems excel at storing large amounts of data on tape, but at the cost of
`relatively slow access times. The time to retrieve a tape from within a tape silo, mount
`
`HPE, Exh. 1006, p. 10
`
`

`

`the tape into a tape drive, and ready the tape for reading is on the order of 15-20
`seconds for current tape silos. The time required to spin the tape forward to the position
`of the
`
`Component
`Archive
`
`Database
`
`Web Server
`
`Capability
`Massive storage
`
`Large number of objects
`
`Ubiquitous access
`
`Constraints
`Latency of access
`Number of data sets
`Access optimization
`Query language
`Presentation format
`Storage capacity
`
`Table 2-2. Architecture Components
`
`desired file is on the order of 1-2 minutes. The total time can be doubled if the tape drive
`is already in use. Thus the access time to data on tape can be 2-4 minutes. To overcome
`this high latency, data is transferred in large blocks, such that the time it takes to transfer
`the data set over a communication channel is comparable to the access latency time. For
`current tape peripherals which read at rates from 10 MB/sec to 15 MB/sec, the average
`data set size in an archive should be on the order of 500 MB to 1 GB. Since digital
`objects can be of arbitrary size, containers are used to aggregate digital objects before
`storage into the archive.
`
`The second constraint that must be managed for archives is the minimization of the
`number of data sets that are seen by the archive. Current archival storage nameservers
`are able to manage on the order of 10 - 40 million data sets. If each data set size is on the
`order of 500 MB, the archive can manage about 10 petabytes of data (10,000 TBs, or 10
`million GBs). Archival storage systems provide a scalable solution only if containers are
`used to aggregate digital objects into large data sets. The total number of digital objects
`that can be managed is on the order of 40 billion, if one thousand digital objects are
`aggregated into each container.
`
`Databases excel at supporting large numbers of records. Note that the Transaction
`Processing Council D benchmark [4] measures performance of relational databases on
`decision support queries for database sizes ranging from 1 gigabyte up to 3 terabytes and
`from 6 million to 18 billion rows. Each row can represent a separate digital object. With
`object relational database systems, a binary large object or BLOB can be associated with
`each row. The BLOBs can reside either internally within the database, or within an
`external file system. In the latter case, handles are used to point to the location of BLOB.
`The use of handles makes it feasible to aggregate digital objects within containers.
`Multiple types of container technology are available for aggregating digital objects.
`Aggregation can be done at the file level, using utilities such as the TAR program, at the
`database level through database tablespaces, or at an intermediate data handling level
`through use of software caches. All three approaches are demonstrated in the persistent
`collection prototyping efforts described in section 4. The database maintains the
`information needed to describe each object, as well as the location of the object within a
`
`HPE, Exh. 1006, p. 11
`
`

`

`container and the location of the container within the storage system. A data
`handling system is used to support database access to archival storage.
`
`Queries are done across the attributes stored within each record. The time needed to
`respond to a query is optimized by constructing indexes across the database tables. This
`can reduce the time needed to do a query by a factor of a thousand, at the cost of the
`storage space for the index, and the time spent in assembling the index. Persistent
`collections may be maintained on disk to support interactive access, or they may be
`stored in the archive, and rebuilt on disk when a need arises. If the collection is
`reassembled from out of the archive, the dominant time needed for the process may be
`the time spent creating a new index. Since archival storage space is cheap, it may be
`preferable to keep both infrastructure independent and infrastructure dependent
`representations of a collection. The time needed to load a pre-indexed database snapshot
`is a small fraction of the time that it would take to reassemble and index a collection.
`The database snapshot, of course, assumes that the database software technology is still
`available for interpreting the database snapshot. For data collections that are frequently
`accessed, the database snapshot may be worth maintaining.
`
`The presentation of information for frequently accessed collections requires Web servers
`to handle the user load. Servers function well for data sets that are stored on local disk.
`In order to access data that reside within an archive, a data handling system is needed to
`transfer data from the archive to the Web server. Otherwise the size of the accessible
`collection may be limited to the size of the Web server disk cache. Web servers are
`available that distribute their load across multiple CPUs of a parallel computer, with
`parallel servers managing over 10 million accesses per day.
`
`Web servers provide a variety of user interfaces to support queries and information
`discovery. The preservation of the user interface requires a way to capture an
`infrastructure independent representation for the query construction and information
`presentation. Web servers are available that retrieve information from databases for
`presentation. What is needed is the software that provides the ability to reconstruct the
`original view of the collection, based upon a description of the collection attributes. Such
`technology is demonstrated as part of the collection instantiation process.
`
`2.4 Managing Heterogeneity of Data Resources
`
`A persistent archive is inherently composed of heterogeneous resources. As technology
`evolves, both old and new versions of the software and hardware infrastructure will be
`present at the same time. An issue that must be managed is the ability to access data that
`is present on multiple storage systems, each with possibly different access protocols. A
`variant of this requirement is the ability to access data within an archive from a database
`that may expect data to reside on a local disk file system. Data handling systems provide
`the ability to interconnect archives with databases and with Web servers. Thus the more
`general form of the persistent archive architecture uses a data handling system to tie each
`component together. At the San Diego Supercomputer Center, a particular
`implementation of a data handling system has been developed, called the Storage
`
`HPE, Exh. 1006, p. 12
`
`

`

`Resource Broker (SRB) [5]. A picture of the SRB architecture is shown in Figure 2-
`1 to illustrate the required components.
`
`SHE Surver
`
`Q
`
`DB2. Oracle. IEustra. Cbj ectStore HPSS. UniTre e
`Distributed Storage Resources
`(database systems, archival storage systems, file systems, ftp)
`
`UNIX, ftp
`
`Figure 2-1. SDSC Storage Resource Broker Architecture
`
`The SRB supports the protocol conversion needed for an application to access data within
`either a database, file system, or archive. The heterogeneous nature of the data storage
`systems is hidden by the uniform access API provided by the SRB. This makes it
`possible for any component of the architecture to be modified, whether archive, or
`database, or Web server. The SRB Server uses a different driver for each type of storage
`resource. The information for which driver to use for access to a particular data set is
`maintained in the associated Meta-data Catalog (MCAT) [6,7]. The MCAT system is a
`database containing information about each data set that is stored in the data storage
`systems. New versions of a storage system are accessed by a new driver written for the
`SRB. Thus the application is able to use a persistent interface, even while the storage
`technology changes over time.
`
`HPE, Exh. 1006, p. 13
`
`

`

`3. Implementation strategy
`
`A collection based persistent archive can be assembled using a scalable architecture. The
`scalable architecture relies upon parallel hardware and software technology that is
`commercially available. The persistent archive requires the integration of three separate
`components; archival storage, collection management, and access servers through the use
`of a data handling system. The result is a system that can be modified to build upon new
`technology on an incremental basis. For a persistent archive to work within this
`migration environment, the data context must be maintained in an information
`independent representation. The technology to instantiate the collection will have to be
`migrated forward in time, along with the data handling system. The collection can be
`kept as bit-files within the archive, while the supporting hardware and software systems
`evolve.
`
`3.1 General Architecture
`
`The implementation of a persistent archive at SDSC is based upon use of commercially
`available software systems, augmented by application level software developed at the San
`Diego Supercomputer Center. The architecture software components are:
`• Archival storage system - IBM High Performance Storage System (HPSS) [3]
`• Data handling system - SDSC Storage Resource Broker (SRB) [5]
`• Object relational database - Oracle version 7.3, IBM DB2 Universal Database
`• Collection management software - SDSC Meta-data Catalog (MCAT) [6,7]
`• Collection instantiation software - SDSC scripts
`• Collection ingestion software - SDSC scripts
`• Hierarchical data model - extended Markup Language - Document Type
`Definition [2]
`• Relational data model - ANSI SQL Data Definition Language [8]
`• DTD manipulation software - UCSD XML Matching and Structuring language
`(XMAS) [9]
`• Web server - Apache Web server
`• Presentation system - Internet Explorer version 5
`
`The hardware components are:
`• Archival storage system - IBM SP 8-node, 32-processor parallel computer, 180
`TB of tape storage, three Storage Technology tape robots, and 1.6 TB of RAID
`disk cache
`• Data management system - Sun Enterprise 4-processor parallel computer
`• Data ingestion platform - SGI workstation
`• Network interconnect - Ethernet, FDDI, and HiPPI
`
`Each of these systems is scalable, and can be implemented using parallel computing
`technology. The efficiency of the archival storage system is critically dependent upon the
`use of containers for aggregating data before storage. Three difference mechanisms have
`been tried at SDSC:
`
`11
`
`HPE, Exh. 1006, p. 14
`
`

`

`• Unix utilities. The TAR utility can be used to aggregate files. For container
`sizes of 100 MB, the additional disk space required is minimal. The
`disadvantages are that the container must be read from the archive and
`unpacked before data sets are accessed.
`• Database tablespace. At SDSC, a prototype version of the DB2 UDB [10]
`parallel object-relational database has been used to support large data
`collections. The prototype database stores the digital objects internally within
`tablespaces. The tablespaces can be stored within the HPSS archival storage
`system, and retrieved to a disk cache on demand. This effectively increases
`the database storage capacity to the size of the archive, while simultaneously
`aggregating digital objects into containers before storage in the archive.
`• Data handling so

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket