`
`Conference Proceedings
`of the
`
`1995 International Conference
`
`on
`
`SUPERCOMPUTING
`
`Barcelona, Spain
`July 3-7, 1995
`
`Sponsored by the
`Association for Computing Machinery
`The First Society in Computing
`
`Special Interest Group on Computer Architecture
`(SIGARCH)
`soy. 4/36"
`ier
`
`
`HPE, Exh. 1015, p.1
`
`HPE, Exh. 1015, p. 1
`
`
`
`
`
`The Association for Computing Machinery
`1515 Broadway
`New York, N.Y. 10036
`
`Copyright © 1995 by the Association for Computing Machinery, Inc.(ACM). Copying with-
`out fee is permitted provided that the copies are not madeordistributed for profit or
`commercial advantage and credit to the source is given. Abstracting with credit is permitted.
`To copy otherwise, to republish, to post on serversorto distribute to lists, requires prior spe-
`cific permission and/or a fee. For other copying ofarticles that carry a code at the bottom of
`the first or last page, copying is permitted provided that the per-copy fee indicated in the code
`is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
`Request permission to republish from: Publications Dept. ACM, Inc. Fax +1 (212) 869-0481
`or <permissions@acm.org >
`
`ACM ISBN: 0-89791-728-6
`
`Additional copies may be ordered prepaid from:
`
`ACM Order Department
`P.O. Box 12114
`Church Street Station
`New York, N.Y. 10257
`
`fa
`QA76,2
`S &&%6
`
`Phone: 1-800-342-6626
`(U.S.A. and Canada)
`1-212-626-0500
`(All other countries)
`Fax: 1-212-944-1318
`E-mail: acmhelp@acm.org
`acm_europe@acm.org (in Europe)
`
`ACM Order Number:415951
`
`Printed in the U.S.A.
`
`HPE, Exh. 1015, p.2 _
`
`HPE, Exh. 1015, p. 2
`
`
`
`9th ACM International Conference on Supercomputing
`
`Sponsored by ACM, TheFirst Society in Computing/SIGARCH
`
`GENERAL CHAIR
`Mateo Valero, Universitat Politécnica de Catalunya
`
`PROGRAM CHAIR
`Michael Wolfe, Oregon Graduate Institute
`
`VICE PROGRAM CHAIRS
`
`°
`
`|
`
`ALGORITHMS
`Denis Nicole
`University of Southampton
`
`ARCHITECTURE
`Pen-Chung Yew
`University of Minnesota
`
`SOFTWARE
`Makoto Amamiya
`Kyushu University
`
`CONFERENCE COMMITTEE
`
`Finance Chair: Miquel Huguet, Convex Supercomputer, S.A.E.
`Local Arrangements Chair: Eduard Ayguadé, Universitat Politécnica de Catalunya
`Exhibition Chair:
`Josep Lluis Larriba-Pey, Universitat Politécnica de Catalunya
`Publicity Chair:
`John R. Sopka, Sequent Computer Corp.
`
`PROGRAM COMMITTEE
`
`Daniel A. Reed, University of Illinois
`Thomas Fahringer, University of Vienna
`P. Sadayappan, The Ohio State University
`Guang R. Gao, McGill University
`Mitsuhisa Sato, Electrotechnical Laboratory
`Hiroki Honda, Yamanashi University
`Satoshi Sekiguchi, Electrotechnical Laboratory
`Chua-Huang Huang, The Ohio State University
`Tatsuya Shindo, Fujitsu, Lid.
`Mitsuri Ikei, Intel SSD Japan
`Teruo Tanaka, Hitachi, Ltd.
`Hironori Kasahara, Waseda University
`Peiyi Tang, University of Southern Queensland
`Zhiyuan Li, University of Minnesota
`Olivier Temam, Leiden University
`David J. Lilja, University of Minnesota
`Kathryn S. McKinley, University of Massachusetts Nian-Feng Tzeng, University of SW Louisiana
`Kleanthis Psarris, University of Texas
`David Wood, University of Wisconsin
`J. Ramanujam, Louisiana State University
`
`
`
`Supported by
`CICYT (Spanish Government)
`CIRIT (Catalan Government)
`
`iii
`
`HPE, Exh. 1015, p. 3
`
`HPE, Exh. 1015, p. 3
`
`
`
`
`
`TABLE OF CONTENTS
`
`Session la: Synchronization
`Decoupling Synchronization and Data Transfer in Message Passing
`Systemsof Parallel SIOTEIIEAIS 5c os sn innsp'v eh nes ini vos PUDRKIN DA AD) Ghee alli 1
`T.Stricker, J. Stichnoth, D. O’Hallaron, S. Hinrichs, and T. Gross, Carnegie Mellon University
`Techniques for Reducing Overheads of Shared-Memory Multiprocessing................. 11
`A. Kagi, N. Aboulenein,D. C. Burger, and J. R. Goodman, University of Wisconsin
`at Madison (best paper award)
`Session 1b: Compiler Optimizations
`Compiler Cache Optimizations for Banded Matrix Problems.............................. 21
`W.Li, University of Rochester
`Optimum Modulo Schedules for Minimum Register Requirements........................ 31
`A. E. Eichenberger,E.S. Davidson, University of Michigan, and S. G. Abraham,
`Hewlett-Packard Laboratories
`Session 2a: Programming Models
`Upper Time Boundsfor Executing PRAM-Programson the LogP-Machine............. 41
`W. Léwe and W. Zimmermann, Universitat Karslruhe
`On the Utility of Threads for Data Parallel Programming ................0 0. .e..c0ccccsee, 51
`T. Fahringer, University of Vienna, M. Haines, and P. Mehrotra, NASA ICASE
`The DMBC:Architecture and Fundamental Operations ............0.00 00.00 cece eee e een. 60
`S. Sahni, University of Florida
`Session 2b: Distributed Memory Compilers
`Unified Compilation Techniques for Shared and Distributed Address
`BHAEE NAGEEANGH 95s snrspts wanes renniianccounerieassinnspn smy~nie vicuncan dll BOA A@ingiGLURR ERED, 67
`C.-W. Tseng, J. M. Anderson,S.P. Amarasinghe, and M. S. Lam, Stanford University
`A Compiler-Directed Distributed Shared Memory System................................. 77
`T.-C. Chiueh and M. Verma, State University of New York at Stony Brook
`Reducing Communication by Honoring Multiple Alignments .............................. 87
`D. A. Garza-Salazar and W. Bohm, Colorado State University
`Session 3a: Performance Analysis
`Multiprocessor Scalability Predictions Through Detailed Program
`Bivensrw Asnyh vst cs «ger oes We ciyncyne Lig hgersldee «wares rau sstar sereecsessoeverecanec ec, 97
`X. Zhang and Z. Xu, University of Texas at San Antonio (best paper award)
`Petri Net Modeling of Interconnection Networks for Massively
`Parallel Architectures... 040 ces inci sess¥hyees rs tsnisnuemeeoasvareasreceecccscceccnncn., 107
`
`J. A. Gregorio, F. Vallejo, R. Beivide, and C. Carrién, Universidad de Cantabria
`
`HPE, Exh. 1015, p. 4
`
`HPE, Exh. 1015, p. 4
`
`
`
`Session 3b: Compiling Unstructured Code
`Efficient Resolution of Sparse Indirections in Data-Parallel Compilers................... 117
`M. Ujaldon and E. L. Zapata, University of Malaga
`Extending High Performance Fortran for the Support of Unstructured
`Computations.) 0.0... 0088). Ue ee, A) a, SRS OA a ae 127
`A. Miiller and R. Riihl, ETH/CSCS
`Run-Time Methodsfor Parallelizing Partially Parallel Loops...........................55 137
`L. Rauchwerger, University of Illinois at Urbana, N. M. Amato, Tezas A&M University,
`and D. A. Padua, University of Illinois at Urbana
`
`Session 4a: Interconnection Networks
`
`A Near-Optimal Broadcasting Algorithm in All-Port Wormhole-Routed
`Hypercubes' ..... 2200 iiss aesasnenniele ae da ean eee OP SE OTL Bsloel pel lak 147
`C.-M. Wang and C.-Y. Ku, Academia Sinica
`A Comparative Study of Single Hop WDM Interconnections for Multiprocessors...... 154
`K. R. Desai and K. Ghose, State University of New York at Binghamton
`Session 4b: Dynamic Load Balancing
`Dynamic Load Balancing for the Simulation of Granular Materials...................... 164
`R. Knecht and G. A. Kohring, Research Center Jtilich (KFA)
`Dynamic Load Balancing of Data Parallel Applications on a Distributed Network..... 170
`M. Hamdi and C.-K. Lee, Hong Kong University of Science and Technology
`Session 5a: Compiling High Performance Fortran
`Efficient Address Generation for Block-Cyclic Distributions....................00c0.0e0es 180
`K. Kennedy, N. Nedeljkovié, and A. Sethi, Rice University
`Using Asynchronous and Bulk Communications to Construct an Optimizing
`Compiler for Distributed Memory Machines with Consideration Given
`to, Communication. Costs. s..06. . 62 css wey see ase aniees ve syed ews soe con sewiass cosa OREO UR Ey ME 185
`H.Sato, T. Nanri, and M. Shimasaki, Kyushu University
`HPF Compiler for the AP1000 «2.05... 2..c0s0).0 coe. beee 190
`T. Shindo, H. Iwashita, T. Doi, J. Hagiwara, and S. Kaneshiro, Fujitsu, Ltd.
`The Threaded Communication Library: Preliminary Experiences on a
`Multiprocessor with Dual-Processor Nodes................:ccecceeceeeeeeeseceeeeveseeenns 195
`N. Elmasri, McGill University, H. H. J. Hum, Concordia University, and G. R. Gao,
`:
`McGill University
`
`viii
`
`HPE, Exh.1015, p. 5
`
`HPE, Exh. 1015, p. 5
`
`
`
`
`
`Session 5b: Parallel Applications
`Amon: A Parallel Slice Algorithm for Wire EROUEINGo in srstge cast ivottee Sisays Gh nae ae hamtete ae? 200
`H. Keshk, S.-I. Mori, H. Nakashima, and S. Tomita, Kyoto University
`Implementation of sparta, a Highly Parallel Circuit Simulator by the
`Preconditioned Jacobi Method, on a Distributed Memory Machine..................... 209
`R. Suda and Y. Oyanagi, University of Tokyo
`Integral Knapsack Problems: Parallel Algorithms and their Implementations
`on Distributed Systems... 00.00... 2... .ccccccsceneencecsccucceecercceudeeetsuaeectecceeveeces 218
`D. Morales, J. Roda, F. Almeida, C. Rodriguez, and F. Garcia, Universidad de La Laguna
`Optimization and Parallelization of a Commodity Trade Model for
`the IBM SP1/2 Using Parallel Programming Tools..............20.0¢00cccececceeeccece ee. 227
`D. Bergmark, Cornell Theory Center
`Session 6: IBM Deep Blue Chess Machine Session
`Deep Blue: Computer Chess and Massively Parallel Systems .................0eccceueees 237
`C. J. Tan, IBM T. J. Watson Research Center
`Deep Blue System Overview ..............00.0.ccccccceccccecceuseeceuuueceetnnnececcencccece. 240
`F.-H. Hsu, M. S. Campbell, and A. J. Hoane, Jr., IBM T. J. Watson Research Center
`Session 7a: Data Caching
`Hardware Implementation Issues of Data Prefetchingiy... dy,sdiu ceca. vl -genonits fog sees byes 245
`N. Drach, Université de Paris XI
`Data Forwarding in Scalable Shared-Memory Multiprocessors ....................0..--.. 255
`D. A. Koufaty, X. Chen, D. K. Poulsen, and J. Torrellas, University of Illinois at Urbana
`Session 7b: Parallelization I
`Enhancing Array Dataflow Dependence Analysis with On-Demand Global
`Walue Propagation os .oi5sicsink ons ores ansercsee¥ so qeat dake tb ROLL. WELLL, Nee 265
`V. Maslov
`Optimal Tile Size Adjustment in Compiling General DOACROSS Loop Nests......... 270
`H. Ohta, Y. Saito, M. Kainaga, and H. Ono, Hitachi, Ltd.
`Session 8a: Data Mapping
`Efficient Embeddings into the Hypercube Using Matrix Transformations............... 280
`M. Hamdi, Hong Kong University of Science and Technology, and S. W. Song, University
`of Sao Paulo
`Architecture-Independent Locality-Improving Transformations of Computational
`Graphs Embedded in k-Dimensions ................. 0... ccceeceee cece ec cccccecccccccc ccc. 289
`C.-W. Ou, M. Gunwani, and S. Ranka, Syracuse University
`Semi-Linear and Bi-Base Storage Schemes Classes: General Overview
`ANd Case: Study soos siceisic cee cae ccc case sun sunw sce syineg ted tny suessles sbi ubsbeeeceeccecseeceueses 299
`J. Jorda, A. Mzoughi, and D. Litaize, IRIT/UPS
`
`HPE,Exh. 1015, p. 6
`
`
`
`HPE, Exh. 1015, p. 6
`
`
`
`Session 8b: Advanced Compilers
`A Dataflow Language with Object-based Extension and Its Implementation
`on a Commercially Available Parallel Machine.....................cccccececececuceceeueees 308
`S. Kusakabe, T. Nagai, Y. Yamashita, R.-I. Taniguchi, and M. Amamiya, Kyushu University
`Compiler Reduction of Synchronisation in Shared Virtual Memory Systems............ 318
`M. O’Boyle, University of Manchester, and F. Bodin, IRISA-INRIA
`A Macrotask-level Unlimited Speculative Execution on Multiprocessors ................ 328
`H. Yamana,M.Sato, Y. Kodama, H. Sakane, Electrotechnical Laboratory, S. Sakai,
`Real World Computing Partnership, and Y. Yamaguchi, Electrotechnical Laboratory
`Session 9a: Processors
`
`A Data Cache with Multiple Caching Strategies Tuned to Different
`Dypes Of LOCAality so. bis sats asain todos artidhl- aaieenete tance SSD ucraed Todo bis 1 abds aan diame 338
`A. Gonzalez, C. Aliagas, and M. Valero, Universitat Politécnica de Catalunya
`Scalar Processor of the VPP500 Parallel Supercomputer..................0eccecceecevees 348
`Y. Nakashima, T. Kitamura, H. Tamura, M. Takiuchi, Fujitsu Limited, and K. Miura,
`Fujitsu America, Inc.
`Session 9b: Using Higher Level Languages
`A Comparison of Parallel Programming Paradigms and Data Distributions
`for a Limited Area Numerical Weather Forecast Routine ..............0..ccccceceeseeeeee 357
`R. van Engelen and L. Wolters, Leiden University
`The Portable Parallel Implementation of Two Novel Mathematical Biology
`Algorithms hn ZPLenses Neo gee eeeeh Grenier, MITEL LM feo tNG ysis acy owlav wre soe gis 3 365
`M.D. Dikaiakos, C. Lin, D. Manoussaki, University of Washington, and D. E. Woodward,
`Southern Methodist University
`Session 10a: Input/Output
`Performance Optimization for Parallel Tape Arrays..................0cceceeceenece senses 375
`T.-C. Chiueh, State University of New York at Stony Brook
`PPFS: A High Performance Portable Parallel File System.....................00.00000005 385
`J. V. Huber, Jr., C. L. Elford, D. A. Reed, A. A. Chien, and D. S. Blumenthal,
`University of Illinois at Urbana
`Communication Strategies for Out-of-core Programs on Distributed
`Memory Machines.................. 02. cece ccc cc ee ence eee nent ee en ee eenvenetesneaeene . 395
`R. Bordawekar and A. Choudhary, Syracuse University
`Performance Evaluation of a Parallel I/O Architecture ..................00.cccceeeeeeeees 404
`S. J. Baylor, C. Benveniste, and Y. Hsu, IBM T. J. Watson Research Center
`
`
`
`HPE, Exh. 1015, p. 7
`
`HPE, Exh. 1015, p. 7
`
`
`
`
`
`Session 10b: Parallelization II
`
`Gated SSA-Based Demand-Driven Symbolic Analysis for Parallelizing Compilers...... 414
`P. Tu and D. Padua, University of Illinois at Urbana (best paper award)
`Advanced Compilation Techniques in the PARADIGM Compiler for
`Distributed-Memory Multicomputers ............... 0. ccc cece cece cece ence ene ee ce eeeeneeenens 424
`E. Su, A. Lain, S. Ramaswamy, D. J. Palermo, E. W. Hodges IV, and P. Banerjee,
`University of Illinois at Urbana
`Vectorization beyond Data Dependences...............c cece eee e cece enter e eee eenteeennene 434
`P. Tang, University of Southern Queensland, and N. Gao, Chinese Academy of Sciences
`Idiom Recognition in the Polaris Parallelizing Compiler ....................ecceceeenenees 444
`B. Pottenger and R. Eigenmann, University of Illinois at Urbana
`
`xi
`
`HPE, Exh. 1015, p. 8
`
`HPE, Exh. 1015, p. 8
`
`
`
`
`
`PPFS: A High Performance Portable Parallel File System
`
`James V. Huber, Jr.*
`Andrew A. Chien*
`
`Christopher L. Elford*
`David S. Blumenthal*
`
`Daniel A. Reed*
`
`Department of Computer Science
`University of Illinois
`Urbana, [linois 61801
`
`Abstract
`
`Rapid increases in processor performance over the past
`decade have outstripped performance improvements in
`input/output devices,
`increasing the importance of in-
`put/output performance to overall system performance.
`Further, experience has shown that the performance of
`parallel input/output systems is particularly sensitive to
`data placement and data management policies, making
`good choicescritical. To explore this vast design space, we
`have developed a user-level library, the Portable Parallel
`File System (PPFS), which supports rapid experimenta-
`tion and exploration. The PPF'S includesa rich application
`interface, allowing the application to advertise access pat-
`terns, control caching and prefetching, and even control
`data placement. PPFS is both extensible and portable,
`making possible a wide range of experiments on a broad
`variety of platforms and configurations. Our initial exper-
`iments, based on simple benchmarks and two application
`programs, show that tailoring policies to input/output ac-
`cess patterns yields significant performancebenefits, often
`improving performance by nearly an order of magnitude.
`
`1
`
`Introduction
`
`The widespread acceptance of massively parallel systems as
`the vehicle of choice for high-performance computing has
`produced a widevariety of machines and an even wider va-
`riety of potential input/output configurations, most with
`inadequate input/output capacity and performance. This
`disparity between computation and input/output rates re-
`flects similar limitations on sequential systems. Processor
`performance has increased by several orders of magnitude
`in the last decade, but improvements in secondary storage
`access times have not kept pace (e.g., disk seek times have
`“Supported in part by the National Science Foundation un-
`der grant NSF ASC 92-12369, by the National Aeronautics and
`Space Administration under NASA Contracts NGT-51023, NAG-
`1-613, and USRA 5555-22 and by the Advanced Research Projects
`Agency under ARPA contracts DAVT63-91-C-0029 and DABT63-
`93-C-0040. Andrew Chien is supported in part by NSF Young In-
`vestigator Award CDA-94-01124.
`
`Permission to makedigital/hard copies of all or part of this material with-
`out fee is granted providedthat the copies are not madeordistributed
`for profit or commercial advantage, the ACM copyright/server
`notice,thetitle of the publication andits date appear, andnoticeis given
`that copyright is by permissionof the Association for Computing Machinery,
`Inc. (ACM). To copy otherwise, to republish,to post on servers or to
`redistribute to lists, requires specific permission and/or fee.
`ICS '95 Barcelona, Spain ® 1995 ACM 0-89791-726-6/95/0007..$3.50
`
`385
`
`decreased by less than a factor of two even as data density
`has increased dramatically).
`In an effort to balance impressive peak processing rates
`with sufficient input/output bandwidth, most parallel sys-
`tems support multiple, redundant arrays of inexpensive
`disks (RAIDs) [21, 12]. Although multiple RAIDs pro-
`vide peak input/output bandwidth equal to the product
`of the number of RAIDs and their individual bandwidths,
`little is known about the effective input/output bandwidth
`of an array of RAIDsin parallel scientific computing envi-
`ronments.
`
`Developing models for how such massively parallel sys-
`tems should be configured to be balanced, and how to
`manage the input/output resources to achieve high sus-
`tained performancerequires exploration of a broad space of
`choices (e.g., disk data striping factors, file and disk block
`prefetch policies, and caching policies). Simply put, we
`must determine the most effective combinations of buffer-
`ing, caching, data distribution and prefetching policies that
`allow a parallel file system to reduce the number of phys-
`ical input/output operations and to overlap physical in-
`put/output with computation.
`To explore these issues, we have developed a Portable
`Parallel File System (PPFS) to study theinteraction of ap-
`plication access patterns, file caching and prefetching algo-
`rithms, and:application file data distributions. PPFS con-
`sists of a user-level input/output library that is portable
`across both parallel systems and workstation clusters, per-
`mitting a wide range of experimentation with modest hu-
`man resources. PPFS allows application control of a vari-
`ety of file caching, prefetching, data layout, and coherence
`policies. This enables rapid exploration of a wide range of
`possibilities, facilitating rapid convergence on good poli-
`cies for each file in the application. The PPFS applica-
`tion interface supports a numberof predefined policies,
`but can also extend its repertoire by accepting user-defined
`layouts, access pattern declarations, and prefetching pat-
`terns.
`Interposing the libraries between the application
`and the system software has allowed us to more quickly
`experiment with a variety of data distribution and data
`management algorithms than would be possible via sys-
`tem software modifications.
`In a preliminary study, we have measured the perfor-
`mance of PPFS on simple input/output benchmarks and
`on two input/output intensive programs, a gene sequenc-
`ing matching application and a low temperature plasma
`electron scattering code. All of these experiments were
`conducted on a variety of Intel Paragon XP/S hardware
`Configurations, using the Intel parallel file system (PFS)
`
`
`
`
`
`
`HPE, Exh. 1015, p. 9
`
`HPE, Exh. 1015, p. 9
`
`
`
`(UFS).
`The simple benchmarks show that PPFS performance
`is less dependent
`than PFS on changes in request size
`or the numberof concurrently active input/output oper-
`ations. The two larger application codes have dramati-
`cally different input/output behavior (the genomecodeis
`read-intensive and the electron scattering code is write-
`intensive), but PPFS demonstrates significant performance
`benefits for each, improving performance by nearly an or-
`der of magnitude in somecases.
`With this context, the remainder of the paper is orga-
`nized as follows. In §2 we elaborate on the motivationsfor a
`portable parallel file system. This is followed in §3-§4 by a
`detailed description of the PPFS design philosophy, design
`overview, examplesof its use, and current implementation
`status. In §5 we analyze the performanceof our current im-
`plementation, based on simple input/output benchmarks.
`In §6, we describe two large application codes, their in-
`put/output behavior, and the performance improvements
`possible with PPFS. Finally, §7 describes related work,
`summarizes our results and discusses current research di-
`rections.
`
` and PPFS atop multiple Intel Paragon Unix file systems
`
`2 Motivations
`
`Input/output performance depends on the distribution
`of file data across storage devices,
`the file caching and
`prefetching algorithms, and parallel, spatio-temporal file
`access patterns. Simply put, the potential design space of
`input/output software systems is enormous, and our cur-
`rent knowledge is extremely limited. Hence, it is critical
`that we develop experimental methodologies that allow us
`to quickly explore large portions of the software design
`space.
`In turn, rapid exploration requires broad control over
`file managementpolicies. Few extant file systems provide
`such malleable interfaces, and the effort to build a low-level
`parallel file system is substantial. Moreover, to be credi-
`ble, building such a system requires access to commercial
`operating system source code;
`this is rarely possible or
`practical.
`Webelieve that many of the important adaptivity issues
`are moreeasily explored via a portable parallel file system
`(PPFS). A PPFS implementation interposes a portable in-
`put /output library between the application and a vendor’s
`basic system software. In this model, a parallelfile consists
`of a collection of standard vendor-supportedfiles and a set
`offile metadata that describes the mapping from these files
`to a single PPFS parallel file.
`This approach allows one to explore a variety of caching
`and prefetching policies and their performance tradeoffs on
`disparate architectures, without becoming mired in costly
`and time-consuming modifications to system software. Al-
`though this compromise doessacrifice some low-level con-
`trol (e.g., disk block placement and disk arm scheduling),
`we have found that it is repaid in increased experimental
`flexibility and coverage.
`Based on the premise of a portable file system infras-
`tructure, the goals of the PPFS project are threefold.
`
`e First, we are developing a flexible application pro-
`gramming interface (API) for specifying access pat-
`tern hints and for controlling file system behavior.
`
`e Second, we are using PPFS to explore a variety of
`data distribution, distributed caching and prefetching
`
`386
`
`strategies via both direct execution of input/output-
`intensive applications and via trace-driven simulation.
`
`e Third, we are exploring distributed techniques for dy-
`namically classifying file access patterns and automat-
`ically adapting data management algorithms to mini-
`mize access latencies for those patterns.
`
`3 PPFS Design Principles
`
`The portable parallel file system (PPFS) is a tool for
`exploring the design space of parallel input/output sys-
`tems. To support experimentation and optimization of in-
`put/output performance, PPFS has an open application
`interface, and it is portable across a number of paral-
`lel platforms. An open application interface supports the
`advertising of access pattern information, control of data
`managementpolicies, and extension of the parallel file sys-
`tem. Portability enables experimentation on a range of
`platforms.
`To maximize experimental flexibility, PPFS is imple-
`mentedas a user-level input/output library designed to be
`portable across several parallel systems. Building portable
`user-level libraries involves a number of challenges; when
`building atop an extant system, one can dependably as-
`sume only that the system software provides basic Unix
`file service. The basic challenge is building a coordination
`framework amongst multiple input/output servers that al-
`lows them to synchronize and exchange data. PPFS im-
`plements parallel file operations, providing global naming
`and location for the componentsof a parallel file and a rich
`application interface. PPFS’s API includeslibrary calls for
`controloffile striping, distribution, and caching, enabling
`the application to export policy and access pattern infor-
`mation to the input/output system.
`
`3.1 Malleable Access
`
`In any parallel file system, caching, prefetching, data place-
`ment, and data sharing policies constrain the range offile
`access patterns where high performanceis possible. In tra-
`ditional file systems, these policies are frozen during im-
`plementation. Given our current, limited understanding
`of the range of possible parallel access patterns, we believe
`the file system must provide application interfaces for spec-
`ifying these and other policies. A portable, extensible file
`system must provide interfaces for extending, changing,
`and controlling these policies.
`As described in §4 and documented in [11], PPFS in-
`corporates a rich set of input/output primitives that al-
`low a parallel application to specify the data distribution
`across disks, prefetching and caching algorithms, and the
`anticipated file access characteristics (e.g., sequential, ran-
`dom, or block-sequential). This infrastructure is far more
`flexible than current vendor parallel systems; both Think-
`ing Machines’ SFS [17] and Intel’s CFS/PFS [9, 12] sup-
`port only a small number of access modes and an even
`smaller numberof automatic file data distributions among
`the disks. This means that PPFS can be used to explore a
`far broader range of input/output system designs than is
`possible with incremental modification of existing parallel
`file systems. For example, with PPFS one can distribute
`a file across disks by groups of N blocks, specify that ev-
`ery k*” block will be accessed, and request that blocks be
`prefetched into distributed client caches based on access
`pattern.
`
`HPE, Exh. 1015, p. 10
`
`HPE, Exh. 1015, p. 10
`
`
`
`|
`
`The design of an experimental tool for exploring parallel
`input/output systems must include support for controlling
`the following key policies.
`e Caching:global, client side, server side, with partic-
`ular control over aggregation (granularity) and write-
`back policies.
`
`e Prefetching:client level, server level, and aggregate
`prefetching to hide access latency and share work.
`Prefetching must be coordinated with caching.
`e Data Distribution: controlling data layout across
`input/output devices.
`e File Sharing: coordinated control over metadata and
`file data consistency.
`
`Portability
`| 3.2.
`| Software portability is always predicated on some design
`' assumptions, and no software is painlessly portable to all
`systems. However, to increase the range of experiments
`| possible with PPFS we must maximize its portability, en-
`| abling experiments on as many systems as possible. Hence,
`|
`the only system requirements to support a port of PPFS
`_ are:
`
`e an underlying Unix file system for each input/output
`node,
`e atyped message passinglibrary (currently NXLIB [26]
`and MPI(19]), and
`
`e a C++ language implementation.
`We view these porting requirements as minimal. Unix, the
`sequential file system base, is a de facto standard. The
`| NXLIB message passing library provides Intel Paragon
`XP/S message passing semantics on a workstation network
`| and compatibility with the native XP/S software environ-
`| ment; however, it can be and has beeneasily replaced with
`_ other message passing substrates (e.g., MPI). Finally, C++
`has allowed us to isolate implementation details in objects
`and to inherit implementation features when porting the
`software to new environments. By minimizing system de-
`pendencies andisolating system dependencies behind ob-
`ject firewalls, PPFS is portable to any system that meets
`our three software assumptions, including both parallel
`systems and workstation clusters *.
`
`4
`
`PPFSSoftware Design
`
`Our PPFSdesign is based on a client/server model. Clients
`are application processes that include both the user pro-
`gram and a PPFS client. Clients issue requests to one or
`more PPFS input/output servers that manage logical in-
`put/output devices. Together the PPFSclient and PPFS
`servers provide parallelfile service to the application pro-
`gram. Below we describe the components of PPFS, their
`interactions with other components, and illustrate their
`use with an example.
`
`4.1 PPFS Components
`
`
`implementation of PPFS is available from
`1The current
`http://www-pablo.cs.uiuc.edu/Projects/PPFS/.
`
`
`
`Figure 1: Portable Parallel File System Design
`
`the key software elements in PPFS
`As Figure 1 shows,
`are theclients, servers, the metadata server, and caching
`agents. These components are briefly described below,
`coupled with a discussion of their interactions. Dashed ar-
`rows in Figure 1 represent local library or system calls, and
`solid arrows (e.g., those between clients and servers) Cor-
`respond to message passing interactions. A more detailed
`description of the file system can be found in [7, 6, 8].
`
`In an SPMD computation,a client consists of an
`Clients
`instance of the user application code andthelocal caching,
`prefetching, and bookkeeping software that permits an ap-
`plication to use PPFS. The client code communicates with
`the metadata server to determine which data servers and
`caching agentswill satisfy input/output requests on behalf
`of the application code for a givenfile. Application-specific
`file system configuration andpolicies originate from the ap-
`plication code, and can affect all elements in the PPFS. For
`example, the PPFS client may maintain a local cache to
`hold recently used and prefetched data. Sucha client cache
`can be configured, controlled (with respect to replacement
`policy for example), or disabled (to ensure data coherence)
`by the application program.
`
`Servers An input/output server is an abstraction of an in-
`put/output device in PPFS. Input/output servers are the
`ultimate resolvers of requests; all physical input/output
`flows through them. Each server consists of a cacheoffile
`data, a prefetch engine that fetches data from the under-
`lying sequential file system based on one of a variety of
`prefetch algorithms, storage for file metadata associated
`with open files, server message passing support, and an
`underlying Unix file system for storing data.
`The number of input/output servers is configurable at
`the time PPFSis initialized and may be larger or smaller
`than the number ofphysical input/output devices. Server
`caching and prefetch policies may also be affected by ap-
`plication calls and can be changed during application exe-
`
`
`
`
`
`
`HPE, Exh. 1015, p. 11
`
`HPE, Exh. 1015, p. 11
`
`
`
`i
`
`e declare per client or aggregate access patterns for the
`PERSISTENT
`file (random,sequential, strided,etc.),
`
`File Name: "foo"|
`
`Clustering Table:
`
`@ control data layoutoffiles over servers (block,cyclic,
`Server|Segment|Size
`2.
`2:0
`40
`etc.),
`
`5
`ela
`40
`290
`423
`40
`1
`1:2
`40
`0
`0:3.
`40 |
`|File Size:
`r))
`Record Size:
`1024
`
`Distribution: _
`Eas)
`Index Scheme:
`HESS
`TRANSIENT
`
`File Handle:
`18 |
`File Size:
`225 |
`| Access Pattern:
`hee
`
`AgentInfo:
`eae
`
`
`
`e control caching policies at servers, clients, and
`
`e control prefetching policies at servers, clients.
`
`Together, the file control interfaces of PPFS provide a
`wide range of information advertisement and policy con-
`trol capabilities to the application. They provide a range
`of policies beyond what is available from commercial file
`systems, and PPFS’s portability enables experimentation
`with such policies on a number ofplatforms. Thesefile con-
`trol interfaces can be used to advertise application informa-
`tion (helping thefile system to make good policy choices),
`to control policies (explicitly optimizing input/output per-
`formance), and to develop new policies (rapid evaluation).
`
`4.3 Performance Instrumentation
`
`We have exploited our existing Pablo performance imstru-
`mentation and analysis suite (24, 23] to instrument PPFS.
`In addition to capturing client input/output requests and
`message passing among PPS clients and servers, client
`and server cache hit ratios, cache utilizations, request
`counts, and access latencies, we also record the times, du-
`rations, and sizes ofall input/output requests that PPFS
`servers generate for the underlying file system. Because all
`performance data is written in the Pablo self-defining data
`format (SDDF)[2], extensive, off-line analysis and behav-
`ioral comparisons are possible via the Pablo performance
`analysis environm