`Design, Automation and Test
`in Burope
`Conference and Exhibition 2000
`
`0 n
`
`AMAZON 1006
`Page 1 of 12
`
`
`
`PROCEEDINGS
`Design, Automation and Test
`in Europe
`Conference and Exhibition 2000
`
`Paris, France
`March 27 - 30,2000
`
`Sponsors
`EDAA, EDAC, IEEE Computer Society - TTTC, IEEE Computer Society - DATC,
`ECSI, IFIP 10.5, RAS Russian Academy of Sciences, IPPM
`
`In cooperation with
`
`ACM-SIGDA, AEIA, ATI, CLRC, CNR, Estonian E SOC, GI,
`GMM, HTE, ITG, KVIV, VDE
`
`Editors
`
`Peter Marwedel, University of Dortmund, Germany
`Ivo Bolsens, IMEC, Belgium
`
`COMPUTER
`SOCIETY
`Los Alarnitos, California
`Washmgton
`0 Brussels
`
`0
`
`Tokyo
`
`AMAZON 1006
`Page 2 of 12
`
`
`
`Copyright 0 2000 by The Institute of Electrical and Electronics Engineers, Inc.
`All rights reserved
`
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries
`may photocopy beyond the limits of US copyright law, for private use of patrons, those articles in
`this volume that carry a code at the bottom of the first page, provided that the per-copy fee
`indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive,
`Danvers, MA 01923.
`
`Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights
`Manager, IEEE Service Center, 445 Hoes Lane, P.O. Box 133, Piscataway, NJ 08855-1331.
`
`The papers in this book comprise the proceedings of the meeting mentioned on the cover and title
`page. They reflect the authors’ opinions and, in the interests of timely dissemination, are
`published as presented and without change. Their inclusion in this publication does not
`necessarily constitute endorsement by the editors, the IEEE Computer Society, or the Institute of
`Electrical and Electronics Engineers, Inc.
`
`IEEE Computer Society Order Number PRO0537
`ISBN 0-7695-0537-6
`ISBN 0-7695-0538-4 (case)
`ISBN 0-7695-0539-2 (microfiche)
`Library of Congress Number 99-69380
`
`Additional copies may be ordered from:
`
`IEEE Computer Society
`Customer Service Center
`10662 Los Vaqueros Circle
`P.O. Box 3014
`Los Alamitos, CA 90720-1314
`Tel: + 1-714-821-8380
`Fax: + 1-714-821-4641
`E-mail: cs.books@computer.org
`
`IEEE Service Center
`445 Hoes Lane
`P.O. Box 1331
`Piscataway, NJ 08855-1331
`Tel: + 1-732-98 1-0060
`Fax: + 1-732-98 1-9667
`mis.custserv@computer.org
`
`IEEE Computer Society
`AsidPacific Office
`Watanabe Bldg., 1-4-2
`Minami- Aoyama
`Minato-ku, Tokyo 107-0062
`JAPAN
`Tel: + 81-3-3408-31 18
`Fax: + 81-3-3408-3553
`tokyo.ofc @computer.org
`
`Editorial production by Bob Werner
`Cover art production by Joe Daigle/Studio Productions
`Printed in the United States of America by The Printing House
`
`SOCIETY
`
`AMAZON 1006
`Page 3 of 12
`
`
`
`Table of Contents
`Design, Automation and Test in Europe - DATE 2000
`DATE Sponsor Committee ...............................................................................................................................
`xix
`DATE Executive Committee ..................................................................................................................
`xix to xx
`Technical Program Chairs ...................................................................................................................
`xxi to xxii
`Vendors Committee ...........................................................................................................................................
`xxii
`Technical Program Committee ......................................................................................................... xxiii to xxvi
`Reviewers ............................................................................................................................................ xxvii to xxviii
`Welcome to DATE 2000 .......................................................................................................................
`xxix to xxx
`Best Paper Awards ..........................................................................................................................................
`.xxxi
`Tutorials ..............................................................................................................................................
`xxxii to xxxiv
`Call for Papers for DATE 2001 ....................................................................................................................
`..771
`
`Plenary - Keynote Session: Connected, Smart Devices - Computing Beyond the Desktop
`
`Moderator: I. Bolsens, IMEC, B
`Speakers: Jerry Fiddler, Chairman and Co-founder of Wind River Systems, USA
`Wim Roelandts, CEO Xilinx, USA
`
`1A: Embedded Software Generation
`Moderators: L. Thiele, TU Zurich, CH
`J.C. Lopez, Castilla-La Mancha U, ES
`Code Selection for Media Processors with SIMD Instructions ................................................................................... 4
`R. Leupers
`Analysis of I-Jigh-Level Address Code Transformations for Programmable Processors ...........................................
`S. Gupra, R. Gupta, M. Miranda, F. Catthoor
`
`.9
`
`Free MDD-based Software Optimization Techniques for Embedded Systems ........................................................
`C. Kim, A. Sangiovanni-Vincentelli, L. Lavagno
`
`14
`
`1B: Low-Power Issues in System-Level Design
`Moderators: A.M. Trullemans-Anckaert, UC Louvain, B
`C. Guardini, PDF Solutions, USA
`
`Quantitative Comparison of Power Management Algorithms ..................................................................................
`Y. Lu, E. Chung, T. Simunii, G. De Micheli, L. Benini
`Efficient Power CO-Estimation Techniques for System-On-Chip Design ...............................................................
`M. Lajolo, A. Raghunathan, S. Dey, L. Lavagno
`A Discrete-Time Battery Model for High-Level Power Estimation .........................................................................
`L. Benini, G. Castelli, A. Macii, E. Macii, M. Poncino, R. Scarsi
`
`20
`
`.27
`
`35
`
`V
`
`AMAZON 1006
`Page 4 of 12
`
`
`
`Automatic Lighthouse Generation for Directed State Space Search ......................................................................
`P. Yalagandula, A. Aziz, V. Singhal
`
`Analyzing Real-Time Systems .............................................................................................................................
`J, RuJ T. Kropf
`
`237
`
`..243
`
`4B: Multi-Processor Architectures and Design Methods
`Moderators: L. Nachtergaele, IMEC, B
`M. Bolle, Bosch Telecom, B
`
`A Generic Architecture for On-Chip Packet-Switched Interconnections ...............................................................
`P. Guerrier. A. Greiner
`
`250
`
`Memory Arbitration and Cache Management in Stream-based Systems ................................................................ 257
`F. Harmsze, A. Timmer, J. van Meerbergen
`HWISW Codesign of an Engine Management System ........................................................................................... 263
`M. Baleani, A. Ferrari, A. Sangiovanni- Vincentelli, C. Turchetti
`
`4C: Logic Synthesis: Performance Optimization
`Moderators: L. Stok, IBM, USA
`J. Monteiro, INESC, PT
`
`Wave Steered FSMs ................................................................................................................................................
`L. Macchiarulo, S. Shu, M. Marek-Sadowska
`
`Delay Minimization and Technology Mapping of Two-Level Structures and Implementation using
`Clock-Delayed Domino Logic ................................................................................................................................
`J. Ciric, G. Yee, C. Sechen
`
`270
`
`277
`
`Gate Sizing using a Statistical Delay Model ........................................................................................................... 283
`E. Jacobs, M. Berkelaar
`
`4D: TPG and Diagnosis in BIST
`Moderators: B. Becker, Freiburg U, D
`K. Kinoshita, Osaka U, JP
`Optimal Hardware Pattern Generation for Functional BIST ................................................................................... 292
`S. Cataldo, S. Chiusano, P. Prinetto, H. Wunderlich
`Built-in Generation of Weighted Test Sequences for Synchronous Sequential Circuits ........................................
`I. Pomeranz, S. Reddy
`Diagnostic Testing of Embedded Memories using BIST .......................................................................................
`T. Bergfeld, E. Rudnick, D. Niggemeyer
`
`298
`
`.305
`
`5A: Architectural-Level Synthesis
`Moderators: P. Eles, Linkoping U, SE
`R. Hermida, U Complutense Madrid, ES
`
`Resolution of Dynamic Memory Allocation and Pointers for the Behavioral Synthesis from C ............................ 312
`L. Sthiria, K. Sato, G. De Micheli
`
`ix
`
`AMAZON 1006
`Page 5 of 12
`
`
`
`An Integrated Temporal Partitioning and Partial Reconfiguration Technique for
`Design Latency Improvement .................................................................................................................................
`S. Ganesan, R. Vemuri
`
`Target Architecture Oriented High-Level Synthesis for Multi-FPGA based Emulation ........................................
`0. Bringmann, C. Menn, W. Rosenstiel
`
`Fast Cache and Bus Power Estimation for Parameterized System-On-A-Chip Design ..........................................
`T. Givargis, F. Vahid, J. Henkel
`
`5B: Analysis of Communication Circuits
`Moderators: A. Konczykowska, France Telecom CNET, F
`R. Schwencker, TU Munich, D
`
`Stochastic Modeling and Performance Evaluation for Digital Clock and Data Recovery Circuits ........................
`A. Demir. P. Feldmann
`
`A New Approach for Computation of Timing Jitter in Phase Locked Loops .........................................................
`M. Gourary, S. Rusakov, S. Ulyanov, M. Zharov, K. Gullapalli, B. Mulvaney
`Compact Modeling of Nonlinear Distortion in Analog Communication Circuits ..................................................
`P. Wambacq, P. Dobrovolnj, S. Donnay, M. Engels, I. Bolsens
`
`5C: Logic Synthesis: Covering and PTL Circuits
`Moderators: M. Berkelaar, TU Eindhoven, NL
`L. Stok, IBM, USA
`
`On using Satisfiability-based Pruning Techniques in Covering Algorithms ..........................................................
`V. Manquinho, J. Marques-Silva
`
`An Efficient Heuristic Approach to Solve the Unate Covering Problem ................................................................
`R. Cordone, F. Ferrandi, D. Sciuto, R. Calvo
`
`320
`
`326
`
`333
`
`340
`
`345
`
`350
`
`356
`
`364
`
`On the Generation of Multiplexer Circuits for Pass Transistor Logic .................................................................... 372
`C. Scholl, B. Becker
`
`5D: Delay and Functional Testing
`Moderators: G. Kosonocky, Intel, USA
`S. Pravossoudovitch, LIRMM, F
`
`On Applying Incremental Satisfiability to Delay Fault Testing .............................................................................. 380
`J. Kim, J. Whittemore, K. Sakallah, J. Silva
`
`Automatic Test Bench Generation for Validation of RT-Level Descriptions: An Industrial Experience ............... 385
`F. Corno, M. Sonza Reorda, G. Squillero, A. Manzone, A. Pincetti
`
`A VHDL Error Simulator for Functional Test Generation .....................................................................................
`A. Fin, F. Fummi
`
`Functional Test Generation for Full Scan Circuits ..................................................................................................
`I. Pomeranz, S. Reddy
`
`390
`
`396
`
`X
`
`AMAZON 1006
`Page 6 of 12
`
`
`
`Fast Cache and Bus Power Estimation for Parameterized System-on-a-Chip Design
`
`Tony D. Givargis, Frank Vahid
`Department of Computer Science and Engineering
`University of California, Riverside, CA 92521
`{givargis, vahid}@cs.ucr.edu
`
`Jörg Henkel
`C&C Research Laboratories, NEC USA
`4 Independence Way, Princeton, NJ 08540
`henkel@ccrl.nj.nec.com
`
`Abstract
`We present a technique for fast estimation of the power
`consumed by the cache and bus sub-system of a parameterized
`system-on-a-chip design for a given application. The technique
`uses a two-step approach of first collecting intermediate data
`about an application using simulation, and
`then using
`equations to rapidly predict the performance and power
`consumption for each of thousands of possible configurations
`of system parameters, such as cache size and associativity and
`bus size and encoding. The estimations display good absolute
`as well as relative accuracy for various examples, and are
`obtained in dramatically less time than other techniques,
`making possible the future use of powerful search heuristics.
`Keywords
`System-on-a-chip, low power, estimation, intellectual property,
`cache, on-chip bus.
`1. Introduction
`Silicon capacity continues to increase faster than the ability
`for designers to use that silicon, resulting in the well-known
`productivity gap [18]. Many researchers propose extensive
`reuse of pre-designed intellectual property cores to reduce this
`gap
`[8], where
`typical cores
`include microprocessors,
`microcontrollers, digital signal processors, encoders/decoders,
`bus
`interfaces, and numerous other common peripheral
`components. Two
`complementary
`core-based
`design
`approaches are emerging. One approach, based on a traditional
`capture-and-simulate [5] paradigm, assumes that a designer
`pieces together many cores obtained from various sources [24]
`(adding some custom logic), simulates extensively, and then
`generates new silicon implementing the system-on-a-chip. The
`other approach, which this paper addresses and which we refer
`to as configure-and-execute, assumes the designer starts with a
`pre-designed system-on-a-chip1, and
`then configures
`that
`system (including adding and deleting some cores) before
`generating new silicon [16][20][21][22]. The configure-and-
`execute approach has an advantage of enabling software
`development on real silicon, reducing the need for lengthy
`hardware/software
`co-simulations.
`Several
`commercial
`products now support such an approach for various application
`domains [14][23], such as networks and communications.
`
`1 Such pre-designed silicon has been referred to as a reference
`design, fig chip (configurable chip), and silicon platform by
`various authors.
`
`A key to the success of a configure-and-execute approach is
`that
`the pre-designed system’s architecture be heavily
`parameterized, so that design metrics like power, performance
`and size, can be optimized for a particular application’s design
`constraints, by selecting particular parameter values before
`generating new silicon. We focus in this paper on parameters of
`the system cache and its associated on-chip buses, the CPU to
`cache bus, and the cache to main memory bus, as cache and bus
`have been shown to contribute to a significant percentage of
`system power. The main contribution of this paper is the
`creation of a fast cache power/performance estimation method
`and its coupling with a fast bus estimation method, enabling
`future heuristics that could simultaneously explore the large
`design space defined by cache and bus parameters. Such
`simultaneous exploration was recently shown to be crucial to
`optimizing deep-submicron designs [7], in which bus power
`consumption begins to surpass that of cache, and in which the
`cache and bus parameters must therefore be carefully tuned to
`one another.
`Section 2 highlights the basic idea of parameterized system
`design. Section 3 describes related work in cache and bus
`power estimation and optimization. Section 4 describes our
`overall estimation approach. Section 5 describes the cache
`model used. Section 6 shows how to couple the cache model
`with our previously developed bus model. Section 7 describes
`our experimental results showing the speed and excellent
`accuracy of our approach. Section 8 provides conclusions.
`2. Parameterized system design
`Our long-term goal is to develop an environment supporting
`
`Figure 1: Steps in parameterized system design.
`
`Reference
`design
`
`Parameter optimization
`
`Characterizing
`simulation
`
`Optimiz.
`criteria
`
`Intermediate data
`Parameter exploration
`Search
`Estimation
`heuristics
`equations
`
`Application
`development
`
`Parameter
`optimization
`
`New silicon
`generation
`
`AMAZON 1006
`Page 7 of 12
`
`
`
`Figure 2: System-on-a-chip reference design’s basic structure.
`
`Micro-
`processor
`
`Cache
`
`DMA
`
`Memory
`
`Bridge
`
`System bus
`
`Peripheral bus
`
`...
`
`Reconfi-
`gurable
`logic
`
`Peripheral
`
`a parameterized system design approach. Such an approach
`consists of three main steps, as illustrated in Figure 1.
`1. Application development begins with a commercially
`available "reference design," implemented on a configurable
`prototyping system-on-a-chip ("fig chip"). Figure 2 illustrates a
`typical reference design system-on-a-chip [24] consisting of a
`microprocessor core, cache, main memory, and direct-memory
`access (DMA) controller, all connected via a system bus. Also
`on that bus is a bridge to a set of peripheral cores, which differ
`depending on
`the class of
`intended applications (e.g.,
`networking), and to reconfigurable logic or to add-on chips.
`The desired application is developed on this fig chip, which
`supports in-circuit emulation and hence at-speed application
`execution, overcoming the problem of prohibitively long
`simulation time for systems-on-a-chip. Some additional cores
`could be added (using the reconfigurable logic) and unneeded
`ones shut off. Numerous system-on-a-chip developers have
`begun to emphasize the importance of starting with such a
`reference design rather than composing cores from scratch
`[16][20][21][22].
`2. Parameter optimization occurs once the application has
`been developed with the aid of the fig chip. The architecture’s
`parameters are optimized
`for
`that application and
`its
`accompanying power, performance and size optimization
`criteria. Critical architectural parameters may include bus
`parameters
`like data size, address and data encoding
`techniques, multiplexing, etc., cache parameters like cache
`size, associativity, write-back techniques, block size/line size,
`etc., DMA parameters, and parameters relating to specific
`peripheral cores, like buffer sizes, resolutions, compression
`levels, etc. While the first step above has already had
`manifestations in commercial products (e.g., [23]), this second
`step still requires extensive research, and is beginning to
`receive some attention (such as work in [2], which optimizes a
`parameterized virtual memory system). Because the parameters
`are highly interdependent (for example, cache size impacts bus
`traffic), as well as strongly dependent on each application’s
`features, this is a hard problem.
`3. New silicon generation then results in a new chip
`implementing the optimized architecture, including any added
`
`cores and excluding any shut-off ones. For example, while the
`reference design may have had a 32K cache and 32-bit
`unencoded bus, the optimized architecture for a particular
`application may have a 4K cache and an 8-bit bus using bus-
`invert encoding to reduce power. Ideally, the silicon is correct
`on the first-pass due to the extensive in-circuit emulation
`already performed.
`Note the difference between a parameterized system design
`approach and the traditional system-level synthesis approach.
`In the traditional approach, an application is first described
`behaviorally, and then an architecture is synthesized for that
`application (processors, memories and buses are instantiated),
`and the behavior is then mapped to the architecture. In the
`parameterized system design approach, the basic architecture is
`pre-defined, and
`the application
`is developed on
`that
`architecture. The "system synthesis" going on is really a fine-
`tuning of the original architecture through selection of values
`for the architecture’s parameters.
`
`3. Related work
`Numerous techniques for high-level power estimation and
`optimization have evolved recently; an overview can be found
`in Raghunathan et al [17].
`Much attention has been given to developing detailed
`models of cache internals to accurately predict a cache’s latency
`[25] as well as power consumption [3] for a given parameter
`configuration. Such detailed models would be used to estimate
`power in cache simulators, which we use as described later.
`Attention has also been given to exploring various cache
`configurations in terms of power, performance and size. Su and
`Despain [19] evaluate several cache design techniques with
`respect to power and performance. Henkel [11] used exhaustive
`trace-driven cache simulations to show that the best cache
`configuration, in terms of power, performance and size,
`differed greatly for different applications.
`Noticing that large tradeoffs are possible by configuring a
`cache for an application, but that cache simulations are slow,
`many researchers have focused on speeding up cache
`simulations. Kirovski et al [9] reduces the number of trace-
`driven cache simulations necessary for exploring different
`cache configurations, by establishing bounds and hence pruning
`numerous inferior configurations without having to simulate
`them. Wu and Wolf [26] order the search of different cache
`configurations such that, after each cache simulation, they can
`reduce the size of a given input trace by removing redundant
`information ("trace stripping"), thus speeding up subsequent
`simulations of other configurations. Our work differs in that we
`couple cache parameters with bus parameters (and possibly
`other parameters in the future), resulting in an enormous
`design space and thus seemingly excluding any approach based
`on repeated simulation. While one-pass cache-simulation [12]
`is a common
`technique,
`in which numerous cache
`configurations are evaluated simultaneously during one
`simulation, incorporating the myriad of other parameters that
`we wish to consider (bus, DMA, peripheral cores, etc.) into
`such an approach would likely become prohibitively complex.
`
`AMAZON 1006
`Page 8 of 12
`
`
`
`(referred to hereafter as a trace-file), we are to compute the
`number of cache misses2, denoted N, for all different caches.
`Two caches are different if they differ in their total cache size,
`line size (block size) or degree of associativity. We limit each
`of these three distinguishing parameters to a finite range:
`
`{
`i
`:2
`{
`i
`:2
`{
`i
`:2
`
`=
`=
`=
`
`S
`L
`
`A
`
`Min
`
`=
`Si
`=
`Li
`Min
`=
`Ai
`
`Min
`
`
`
`}
`}
`}.
`
`S
`Max
`L
`
`Max
`A
`Max
`
`Note that, for practical purposes, we only consider values that
`are powers of two for each of these parameters. Given a trace-
`file, we must define a function:
`(
`)
`fi··
`:
`ALS
`
`N
`
`.
`
`f
`
`to compute the number of cache misses for any cache
`configuration. We assume that, with the aid of a cache-
`simulator, we are able to compute the above function, for any
`value from the sets S, L and A, in linear time with respect to
`the size of the trace-file. Intuitively, our approach works as
`follows. We know that at low cache sizes, higher line size and
`associativity have a greater positive effect than they do at high
`cache sizes. For example, doubling the line size when cache
`size is 512B may reduce cache miss rate by 30%, however,
`when the cache size is 8K, it may not reduce the miss rate at
`all. Thus, we are interested in finding these improvement ratios
`at both low and high cache sizes, so that, by line fitting, the
`improvement ratio for any cache size can be estimated. This
`assumes a smooth design space between these points. We next
`describe our approach for estimating this function for all range
`values.
`Our approach consists of three steps. First we simulate the
`trace-file for some selected S, L and A values and obtain the
`corresponding cache misses. Then we calculate a linear
`equation, using the least square approximation method. Last
`we use our linear equations to compute N for all cache
`parameters. We first simulate the following points in our
`domain space:
`(
`)
`Sf
`(
`)
`Sf
`(
`)
`Sf
`(
`)
`Sf
`(
`)
`Sf
`(
`)
`Sf
`
`,
`,
`,
`,
`,
`,
`
`Min
`
`Max
`
`Min
`
`Min
`
`Max
`
`Max
`
`Min
`
`Max
`
`L
`Min
`L
`L
`L
`Min
`L
`L
`
`N
`1
`N
`N
`N
`N
`N
`
`65432
`
`======
`
`.
`
`,
`A
`Min
`,
`A
`Min
`,
`A
`Min
`,
`A
`Max
`,
`A
`Min
`,
`A
`Max
`
`Max
`
`Min
`
`Then we compute the following ratios:
`=
`=
`
` / , RNN
`
`
` / NN
`R
`1
`4
`2
`1
`3
`1
`=
`=
`
`,
`
` / NN
` / NN
`R
`R
`2
`2
`
`3
`
`5
`
`4
`
`.
`
`6
`
`
`2 Other metrics, e.g., number of write backs, can be estimated,
`using our approach, in a similar manner.
`
`With the advent of deep-submicron technology and the
`accompanying increase in the bus’ contribution to system power
`consumption, recently researchers have begun to focus on
`reducing bus power [17], and more closely related to our work,
`on the inter-relationship of cache and bus power consumption.
`Compiler-level approaches, like that by Panda and Dutt [15],
`seek to generate executable code that minimizes power on a
`given cache/bus architecture. An architectural approach was
`presented by Fornaciari et al [4], who investigated the power
`consumption of different bus encodings for various cache
`configurations. Li and Givargis et al [6] developed a fast bus
`power model and then coupled [7] this model with Li and
`Henkel’s cache simulations to show that, in deep submicron
`technologies, the best cache and best bus configurations are
`tightly
`interdependent
`and
`thus
`should
`be
`sought
`simultaneously. Our work is an improvement to this work in
`that the long cache simulations can be replaced by the fast
`models in this paper to reduce the time to evaluate all
`cache/bus configurations
`for a given application
`from
`days/weeks to seconds/minutes.
`4. Approach overview
`Because a parameterized system design approach will have
`numerous interdependent parameters, an approach requiring
`simulation for each configuration would be computationally
`infeasible due to the exponential number of configurations.
`Thus, we instead use a two-step approach to parameter
`optimization, as shown in Figure 1. Characterizing simulation
`involves simulating the application with typical input vectors
`once or a small number of times that is just enough to provide
`enough intermediate data to characterize the application for the
`second step. The second step, parameter exploration, uses
`heuristics to traverse the design space of possible parameter
`configurations, coupled with fast estimation equations that use
`the intermediate data to provide power, performance and size
`values for a given configuration. These equations evaluate in
`constant-time, so can deal with huge numbers of possible
`configurations.
`We have chosen to focus initially on developing parameter
`optimization for a system’s cache and bus sub-systems, because
`these typically consume a significant percentage of system
`power (we plan to soon extend our approach to also consider
`DMA). We have already developed an approach for buses [6]
`involving definition of the intermediate data (bus traffic),
`estimation equations for power, performance and size as a
`function of bus parameters (size and encoding) and traffic, and
`an exhaustive search heuristic. We now describe
`the
`intermediate data and estimation equations necessary for cache
`parameter optimization, followed by a description of a method
`for coupling the cache methods with that previously developed
`for buses. We showed earlier [7] that the tight interdependency
`of cache and bus parameters requires such a coupling in order
`to find the best cache and bus configurations optimizing power,
`performance and size.
`5. Cache performance and power estimation
`In this section, we discuss the technique that we have
`employed for rapidly estimating cache metrics. We define the
`problem as follows. Given a trace of memory references
`
`AMAZON 1006
`Page 9 of 12
`
`
`
`control line and extra circuit logic to compute the Hamming
`distance (bit transitions) between two consecutive data items. If
`the Hamming distance is greater than ½ the bus width, then the
`control line is asserted and the inverted data is send over the
`bus [12]:
`
`œœœœœœœœœœßøŒŒŒŒŒŒŒŒŒŒºØ œœøŒŒØ(cid:215)(cid:247)(cid:247)(cid:247)(cid:247)ł(cid:246)(cid:231)(cid:231)(cid:231)(cid:231)Ł(cid:230)
`
`+(cid:215)(cid:247)(cid:247)ł(cid:246)(cid:231)(cid:231)Ł(cid:230)
`
`1+k
`
`2
`k
`2
`
`2
`
`1+k
`
`œœøŒŒØ
`
`+
`
`k
`2
`k
`2
`
`k
`2
`
`+(cid:215)(cid:247)(cid:247)ł(cid:246)(cid:231)(cid:231)Ł(cid:230)(cid:247)(cid:247)ł(cid:246)(cid:231)(cid:231)Ł(cid:230) œœøŒŒØ
`
`1+k
`
`1
`k
`2
`
`1
`
`kn
`
`=
`
`(
`C
`bus
`
`PI
`bus
`
`)
`
`)
`(
`m
`
`
`
`/sectransition
`
`Given the traffic m on a bus, power can be quickly
`estimated using analytical models as described above.
`Likewise, similar analytical models can be applied to compute
`cache and memory power, (and performance). These have been
`extensively modeled by [3].
`7. Experiments
`In order to verify our approach, we performed the following
`experiments. We used two applications written in C, a diesel
`engine controller (Diesel), and an encryption algorithm (Key).
`We explored power and performance for mapping each
`application
`to a
`system architecture
`including
`three
`parameterized parts: cache, CPU-cache bus, and cache-memory
`bus. The cache parameters and their possible values were:
`cache size of 128, 256, 512, 1K, 2K, 4K, 8K or 32K; cache line
`of 8, 16 or 32; and associativity of 2, 4 or 8. The parameters
`for each bus were: data width of 4, 8, 16 and 32; and bus invert
`encoding either enabled or disabled.
`We compared
`this paper's fast cache/bus estimation
`technique with the simulation approach of [7]. For the
`simulation approach, illustrated in Figure 3(a), we ran the C
`application through a trace stripper to generate a trace of
`memory references. Then, for each cache configuration, we
`
`Figure 3: Experimental setup: (a) simulation approach, (b)
`our model-based approach.
`application code
`cache config. bus config.
`trace
`
`trace stripper
`
`cache simul.
`bus traffic
`bus simul.
`
`(a)
`
`instr. set sim.
`
`application code
`
`6 cache
`configs.
`
`cache config.
`bus config.
`
`(b)
`
`trace
`
`cache simul.
`trace stripper
`reference values
`cache model
`bus traffic
`bus model
`
`instr. set sim.
`
`metric
`s
`
`metric
`s
`
`Here, R1/ R2 denotes the improvement we obtain by using
`maximum line-size/associativity when cache size is at its
`minimum. Likewise R3/ R4 denote the positive improvement we
`obtain by using maximum line-size/associativity when the
`cache size is at its maximum. Given these ratios we estimate N
`for a given cache size, line size and associativity as follows:
`(
`)
`=
`S
`(
`)
`L
`j
`(
`A
`k
`(
`Ns
`(
`Rl
`(
`Ra
`,
`L
`
`s
`=
`l
`=
`a
`t
`t
`t
`
`1
`
`2
`
`3
`f
`
`=
`=
`=
`(
`S
`
`i
`
`S
`L
`
`Min
`
`Min
`A
`
`2
`
`3
`
`4
`,
`
`j
`
`A
`
`i
`
`1
`)
`
`2
`
`1
`
`2
`
`Max
`
`S
`L
`/
`+
`
`//
`
`Max
`)
`A
`Min
`Max
`)
`N
`N
`+
`R
`R
`1
`1
`+
`)
`R
`R
`--»
`1(
`)
`t
`t
`
`k
`
`1
`
`2
`
`t
`
`3
`
`).
`
`The first three equations, s, l and a, normalize our
`parameters to be within a unit range. The next equation, t1,
`estimates cache misses using lowest line size and associativity,
`by computing a linear line through the points N1 and N2. If
`more simulation data
`is available,
`the
`least
`square
`approximation is used to compute t1. The next two equations, t2
`and t3, estimate the expected improvement gained from higher
`line size or associativity. The last equation combines the
`previous equations to estimate cache miss rate.
`6. Combined cache/bus estimation
`In this section, we describe how to extend the cache data
`into bus data for simultaneous cache/bus design space
`exploration. The technique described in the previous section
`allows us to rapidly estimate the number of cache misses, N,
`for a given cache parameter setting. This number, N, is a
`measure of cache to main-memory bus traffic. Likewise, the
`total number of cache accesses, i.e., the size of the trace-file, is
`a measure of CPU to cache bus traffic. Given this traffic, and
`assuming data of random nature, we can use equations [6] to
`compute the bit switching activity on the bus and use it, along
`with wire capacitance models, to compute power consumption
`of our system. In this work, we consider varying the number of
`data bus wires, e.g., 16 or 32-bits, and data encoding, e.g.,
`binary or bus-invert.
`For our bus model, we assume that there are m, n-bit items
`transmitted per unit time on a bus of width k using binary
`encoding. (Here m denotes the traffic on the bus and is
`obtained by estimating cache misses as described above.) The
`following equation gives power consumption for the data bus:
`(
`)
`(
`)
`C
`k
`
`kn
`
`=
`
`P
`bus
`
`bus
`
`
`
`temtransfer/i
`
`
`
`erbit/transf
`
`
`
`/bittransition
`
`21
`
`)
`
`(
`
`m
`
`
`
`ditem/secon
`
`
`
`/sectransition
`
`( )(
`mk
`
`)
`
`œœøŒŒØ
`)
`
`kn
`
`(
`C
`
`bus
`
`21
`
`=
`
`In this equation, bus capacitance is calculated using models
`developed by Chern et al [1]. Our equation is expanded to take
`into account bus-invert encoding. This method uses an extra
`
`(cid:247)ł(cid:246)(cid:231)Ł(cid:230) (cid:247)(cid:247)ł(cid:246)(cid:231)(cid:231)Ł(cid:230) œœøŒŒØ
`
`AMAZON 1006
`Page 10 of 12
`
`-
`-
`-
`-
`-
`-
`(cid:229)
`(cid:229)
`
`
`magnitude faster than simulation-based approaches, but yields
`good accuracy. The technique therefore ena