IPR2017-00814, No. 1006-6 Exhibit - Fast Cache and Bus Power Estimation for Parameterized Systemon a Chip Design, T D Givargis et al (P.T.A.B. Jan. 31, 2017)

PROCEEDINGS
`Design, Automation and Test
`in Burope
`Conference and Exhibition 2000
`
`0 n
`
`AMAZON 1006
`Page 1 of 12
`
`

`PROCEEDINGS
`Design, Automation and Test
`in Europe
`Conference and Exhibition 2000
`
`Paris, France
`March 27 - 30,2000
`
`Sponsors
`EDAA, EDAC, IEEE Computer Society - TTTC, IEEE Computer Society - DATC,
`ECSI, IFIP 10.5, RAS Russian Academy of Sciences, IPPM
`
`In cooperation with
`
`ACM-SIGDA, AEIA, ATI, CLRC, CNR, Estonian E SOC, GI,
`GMM, HTE, ITG, KVIV, VDE
`
`Editors
`
`Peter Marwedel, University of Dortmund, Germany
`Ivo Bolsens, IMEC, Belgium
`
`COMPUTER
`SOCIETY
`Los Alarnitos, California
`Washmgton
`0 Brussels
`
`0
`
`Tokyo
`
`AMAZON 1006
`Page 2 of 12
`
`

`Copyright 0 2000 by The Institute of Electrical and Electronics Engineers, Inc.
`All rights reserved
`
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries
`may photocopy beyond the limits of US copyright law, for private use of patrons, those articles in
`this volume that carry a code at the bottom of the first page, provided that the per-copy fee
`indicated in the code is paid through the Copyright Clearance Center, 222 Rosewood Drive,
`Danvers, MA 01923.
`
`Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights
`Manager, IEEE Service Center, 445 Hoes Lane, P.O. Box 133, Piscataway, NJ 08855-1331.
`
`The papers in this book comprise the proceedings of the meeting mentioned on the cover and title
`page. They reflect the authors’ opinions and, in the interests of timely dissemination, are
`published as presented and without change. Their inclusion in this publication does not
`necessarily constitute endorsement by the editors, the IEEE Computer Society, or the Institute of
`Electrical and Electronics Engineers, Inc.
`
`IEEE Computer Society Order Number PRO0537
`ISBN 0-7695-0537-6
`ISBN 0-7695-0538-4 (case)
`ISBN 0-7695-0539-2 (microfiche)
`Library of Congress Number 99-69380
`
`Additional copies may be ordered from:
`
`IEEE Computer Society
`Customer Service Center
`10662 Los Vaqueros Circle
`P.O. Box 3014
`Los Alamitos, CA 90720-1314
`Tel: + 1-714-821-8380
`Fax: + 1-714-821-4641
`E-mail: cs.books@computer.org
`
`IEEE Service Center
`445 Hoes Lane
`P.O. Box 1331
`Piscataway, NJ 08855-1331
`Tel: + 1-732-98 1-0060
`Fax: + 1-732-98 1-9667
`mis.custserv@computer.org
`
`IEEE Computer Society
`AsidPacific Office
`Watanabe Bldg., 1-4-2
`Minami- Aoyama
`Minato-ku, Tokyo 107-0062
`JAPAN
`Tel: + 81-3-3408-31 18
`Fax: + 81-3-3408-3553
`tokyo.ofc @computer.org
`
`Editorial production by Bob Werner
`Cover art production by Joe Daigle/Studio Productions
`Printed in the United States of America by The Printing House
`
`SOCIETY
`
`AMAZON 1006
`Page 3 of 12
`
`

`Table of Contents
`Design, Automation and Test in Europe - DATE 2000
`DATE Sponsor Committee ...............................................................................................................................
`xix
`DATE Executive Committee ..................................................................................................................
`xix to xx
`Technical Program Chairs ...................................................................................................................
`xxi to xxii
`Vendors Committee ...........................................................................................................................................
`xxii
`Technical Program Committee ......................................................................................................... xxiii to xxvi
`Reviewers ............................................................................................................................................ xxvii to xxviii
`Welcome to DATE 2000 .......................................................................................................................
`xxix to xxx
`Best Paper Awards ..........................................................................................................................................
`.xxxi
`Tutorials ..............................................................................................................................................
`xxxii to xxxiv
`Call for Papers for DATE 2001 ....................................................................................................................
`..771
`
`Plenary - Keynote Session: Connected, Smart Devices - Computing Beyond the Desktop
`
`Moderator: I. Bolsens, IMEC, B
`Speakers: Jerry Fiddler, Chairman and Co-founder of Wind River Systems, USA
`Wim Roelandts, CEO Xilinx, USA
`
`1A: Embedded Software Generation
`Moderators: L. Thiele, TU Zurich, CH
`J.C. Lopez, Castilla-La Mancha U, ES
`Code Selection for Media Processors with SIMD Instructions ................................................................................... 4
`R. Leupers
`Analysis of I-Jigh-Level Address Code Transformations for Programmable Processors ...........................................
`S. Gupra, R. Gupta, M. Miranda, F. Catthoor
`
`.9
`
`Free MDD-based Software Optimization Techniques for Embedded Systems ........................................................
`C. Kim, A. Sangiovanni-Vincentelli, L. Lavagno
`
`14
`
`1B: Low-Power Issues in System-Level Design
`Moderators: A.M. Trullemans-Anckaert, UC Louvain, B
`C. Guardini, PDF Solutions, USA
`
`Quantitative Comparison of Power Management Algorithms ..................................................................................
`Y. Lu, E. Chung, T. Simunii, G. De Micheli, L. Benini
`Efficient Power CO-Estimation Techniques for System-On-Chip Design ...............................................................
`M. Lajolo, A. Raghunathan, S. Dey, L. Lavagno
`A Discrete-Time Battery Model for High-Level Power Estimation .........................................................................
`L. Benini, G. Castelli, A. Macii, E. Macii, M. Poncino, R. Scarsi
`
`20
`
`.27
`
`35
`
`V
`
`AMAZON 1006
`Page 4 of 12
`
`

`Automatic Lighthouse Generation for Directed State Space Search ......................................................................
`P. Yalagandula, A. Aziz, V. Singhal
`
`Analyzing Real-Time Systems .............................................................................................................................
`J, RuJ T. Kropf
`
`237
`
`..243
`
`4B: Multi-Processor Architectures and Design Methods
`Moderators: L. Nachtergaele, IMEC, B
`M. Bolle, Bosch Telecom, B
`
`A Generic Architecture for On-Chip Packet-Switched Interconnections ...............................................................
`P. Guerrier. A. Greiner
`
`250
`
`Memory Arbitration and Cache Management in Stream-based Systems ................................................................ 257
`F. Harmsze, A. Timmer, J. van Meerbergen
`HWISW Codesign of an Engine Management System ........................................................................................... 263
`M. Baleani, A. Ferrari, A. Sangiovanni- Vincentelli, C. Turchetti
`
`4C: Logic Synthesis: Performance Optimization
`Moderators: L. Stok, IBM, USA
`J. Monteiro, INESC, PT
`
`Wave Steered FSMs ................................................................................................................................................
`L. Macchiarulo, S. Shu, M. Marek-Sadowska
`
`Delay Minimization and Technology Mapping of Two-Level Structures and Implementation using
`Clock-Delayed Domino Logic ................................................................................................................................
`J. Ciric, G. Yee, C. Sechen
`
`270
`
`277
`
`Gate Sizing using a Statistical Delay Model ........................................................................................................... 283
`E. Jacobs, M. Berkelaar
`
`4D: TPG and Diagnosis in BIST
`Moderators: B. Becker, Freiburg U, D
`K. Kinoshita, Osaka U, JP
`Optimal Hardware Pattern Generation for Functional BIST ................................................................................... 292
`S. Cataldo, S. Chiusano, P. Prinetto, H. Wunderlich
`Built-in Generation of Weighted Test Sequences for Synchronous Sequential Circuits ........................................
`I. Pomeranz, S. Reddy
`Diagnostic Testing of Embedded Memories using BIST .......................................................................................
`T. Bergfeld, E. Rudnick, D. Niggemeyer
`
`298
`
`.305
`
`5A: Architectural-Level Synthesis
`Moderators: P. Eles, Linkoping U, SE
`R. Hermida, U Complutense Madrid, ES
`
`Resolution of Dynamic Memory Allocation and Pointers for the Behavioral Synthesis from C ............................ 312
`L. Sthiria, K. Sato, G. De Micheli
`
`ix
`
`AMAZON 1006
`Page 5 of 12
`
`

`An Integrated Temporal Partitioning and Partial Reconfiguration Technique for
`Design Latency Improvement .................................................................................................................................
`S. Ganesan, R. Vemuri
`
`Target Architecture Oriented High-Level Synthesis for Multi-FPGA based Emulation ........................................
`0. Bringmann, C. Menn, W. Rosenstiel
`
`Fast Cache and Bus Power Estimation for Parameterized System-On-A-Chip Design ..........................................
`T. Givargis, F. Vahid, J. Henkel
`
`5B: Analysis of Communication Circuits
`Moderators: A. Konczykowska, France Telecom CNET, F
`R. Schwencker, TU Munich, D
`
`Stochastic Modeling and Performance Evaluation for Digital Clock and Data Recovery Circuits ........................
`A. Demir. P. Feldmann
`
`A New Approach for Computation of Timing Jitter in Phase Locked Loops .........................................................
`M. Gourary, S. Rusakov, S. Ulyanov, M. Zharov, K. Gullapalli, B. Mulvaney
`Compact Modeling of Nonlinear Distortion in Analog Communication Circuits ..................................................
`P. Wambacq, P. Dobrovolnj, S. Donnay, M. Engels, I. Bolsens
`
`5C: Logic Synthesis: Covering and PTL Circuits
`Moderators: M. Berkelaar, TU Eindhoven, NL
`L. Stok, IBM, USA
`
`On using Satisfiability-based Pruning Techniques in Covering Algorithms ..........................................................
`V. Manquinho, J. Marques-Silva
`
`An Efficient Heuristic Approach to Solve the Unate Covering Problem ................................................................
`R. Cordone, F. Ferrandi, D. Sciuto, R. Calvo
`
`320
`
`326
`
`333
`
`340
`
`345
`
`350
`
`356
`
`364
`
`On the Generation of Multiplexer Circuits for Pass Transistor Logic .................................................................... 372
`C. Scholl, B. Becker
`
`5D: Delay and Functional Testing
`Moderators: G. Kosonocky, Intel, USA
`S. Pravossoudovitch, LIRMM, F
`
`On Applying Incremental Satisfiability to Delay Fault Testing .............................................................................. 380
`J. Kim, J. Whittemore, K. Sakallah, J. Silva
`
`Automatic Test Bench Generation for Validation of RT-Level Descriptions: An Industrial Experience ............... 385
`F. Corno, M. Sonza Reorda, G. Squillero, A. Manzone, A. Pincetti
`
`A VHDL Error Simulator for Functional Test Generation .....................................................................................
`A. Fin, F. Fummi
`
`Functional Test Generation for Full Scan Circuits ..................................................................................................
`I. Pomeranz, S. Reddy
`
`390
`
`396
`
`X
`
`AMAZON 1006
`Page 6 of 12
`
`

`Fast Cache and Bus Power Estimation for Parameterized System-on-a-Chip Design
`
`Tony D. Givargis, Frank Vahid
`Department of Computer Science and Engineering
`University of California, Riverside, CA 92521
`{givargis, vahid}@cs.ucr.edu
`
`Jörg Henkel
`C&C Research Laboratories, NEC USA
`4 Independence Way, Princeton, NJ 08540
`henkel@ccrl.nj.nec.com
`
`Abstract
`We present a technique for fast estimation of the power
`consumed by the cache and bus sub-system of a parameterized
`system-on-a-chip design for a given application. The technique
`uses a two-step approach of first collecting intermediate data
`about an application using simulation, and
`then using
`equations to rapidly predict the performance and power
`consumption for each of thousands of possible configurations
`of system parameters, such as cache size and associativity and
`bus size and encoding. The estimations display good absolute
`as well as relative accuracy for various examples, and are
`obtained in dramatically less time than other techniques,
`making possible the future use of powerful search heuristics.
`Keywords
`System-on-a-chip, low power, estimation, intellectual property,
`cache, on-chip bus.
`1. Introduction
`Silicon capacity continues to increase faster than the ability
`for designers to use that silicon, resulting in the well-known
`productivity gap [18]. Many researchers propose extensive
`reuse of pre-designed intellectual property cores to reduce this
`gap
`[8], where
`typical cores
`include microprocessors,
`microcontrollers, digital signal processors, encoders/decoders,
`bus
`interfaces, and numerous other common peripheral
`components. Two
`complementary
`core-based
`design
`approaches are emerging. One approach, based on a traditional
`capture-and-simulate [5] paradigm, assumes that a designer
`pieces together many cores obtained from various sources [24]
`(adding some custom logic), simulates extensively, and then
`generates new silicon implementing the system-on-a-chip. The
`other approach, which this paper addresses and which we refer
`to as configure-and-execute, assumes the designer starts with a
`pre-designed system-on-a-chip1, and
`then configures
`that
`system (including adding and deleting some cores) before
`generating new silicon [16][20][21][22]. The configure-and-
`execute approach has an advantage of enabling software
`development on real silicon, reducing the need for lengthy
`hardware/software
`co-simulations.
`Several
`commercial
`products now support such an approach for various application
`domains [14][23], such as networks and communications.
`
`1 Such pre-designed silicon has been referred to as a reference
`design, fig chip (configurable chip), and silicon platform by
`various authors.
`
`A key to the success of a configure-and-execute approach is
`that
`the pre-designed system’s architecture be heavily
`parameterized, so that design metrics like power, performance
`and size, can be optimized for a particular application’s design
`constraints, by selecting particular parameter values before
`generating new silicon. We focus in this paper on parameters of
`the system cache and its associated on-chip buses, the CPU to
`cache bus, and the cache to main memory bus, as cache and bus
`have been shown to contribute to a significant percentage of
`system power. The main contribution of this paper is the
`creation of a fast cache power/performance estimation method
`and its coupling with a fast bus estimation method, enabling
`future heuristics that could simultaneously explore the large
`design space defined by cache and bus parameters. Such
`simultaneous exploration was recently shown to be crucial to
`optimizing deep-submicron designs [7], in which bus power
`consumption begins to surpass that of cache, and in which the
`cache and bus parameters must therefore be carefully tuned to
`one another.
`Section 2 highlights the basic idea of parameterized system
`design. Section 3 describes related work in cache and bus
`power estimation and optimization. Section 4 describes our
`overall estimation approach. Section 5 describes the cache
`model used. Section 6 shows how to couple the cache model
`with our previously developed bus model. Section 7 describes
`our experimental results showing the speed and excellent
`accuracy of our approach. Section 8 provides conclusions.
`2. Parameterized system design
`Our long-term goal is to develop an environment supporting
`
`Figure 1: Steps in parameterized system design.
`
`Reference
`design
`
`Parameter optimization
`
`Characterizing
`simulation
`
`Optimiz.
`criteria
`
`Intermediate data
`Parameter exploration
`Search
`Estimation
`heuristics
`equations
`
`Application
`development
`
`Parameter
`optimization
`
`New silicon
`generation
`
`AMAZON 1006
`Page 7 of 12
`
`

`Figure 2: System-on-a-chip reference design’s basic structure.
`
`Micro-
`processor
`
`Cache
`
`DMA
`
`Memory
`
`Bridge
`
`System bus
`
`Peripheral bus
`
`...
`
`Reconfi-
`gurable
`logic
`
`Peripheral
`
`a parameterized system design approach. Such an approach
`consists of three main steps, as illustrated in Figure 1.
`1. Application development begins with a commercially
`available "reference design," implemented on a configurable
`prototyping system-on-a-chip ("fig chip"). Figure 2 illustrates a
`typical reference design system-on-a-chip [24] consisting of a
`microprocessor core, cache, main memory, and direct-memory
`access (DMA) controller, all connected via a system bus. Also
`on that bus is a bridge to a set of peripheral cores, which differ
`depending on
`the class of
`intended applications (e.g.,
`networking), and to reconfigurable logic or to add-on chips.
`The desired application is developed on this fig chip, which
`supports in-circuit emulation and hence at-speed application
`execution, overcoming the problem of prohibitively long
`simulation time for systems-on-a-chip. Some additional cores
`could be added (using the reconfigurable logic) and unneeded
`ones shut off. Numerous system-on-a-chip developers have
`begun to emphasize the importance of starting with such a
`reference design rather than composing cores from scratch
`[16][20][21][22].
`2. Parameter optimization occurs once the application has
`been developed with the aid of the fig chip. The architecture’s
`parameters are optimized
`for
`that application and
`its
`accompanying power, performance and size optimization
`criteria. Critical architectural parameters may include bus
`parameters
`like data size, address and data encoding
`techniques, multiplexing, etc., cache parameters like cache
`size, associativity, write-back techniques, block size/line size,
`etc., DMA parameters, and parameters relating to specific
`peripheral cores, like buffer sizes, resolutions, compression
`levels, etc. While the first step above has already had
`manifestations in commercial products (e.g., [23]), this second
`step still requires extensive research, and is beginning to
`receive some attention (such as work in [2], which optimizes a
`parameterized virtual memory system). Because the parameters
`are highly interdependent (for example, cache size impacts bus
`traffic), as well as strongly dependent on each application’s
`features, this is a hard problem.
`3. New silicon generation then results in a new chip
`implementing the optimized architecture, including any added
`
`cores and excluding any shut-off ones. For example, while the
`reference design may have had a 32K cache and 32-bit
`unencoded bus, the optimized architecture for a particular
`application may have a 4K cache and an 8-bit bus using bus-
`invert encoding to reduce power. Ideally, the silicon is correct
`on the first-pass due to the extensive in-circuit emulation
`already performed.
`Note the difference between a parameterized system design
`approach and the traditional system-level synthesis approach.
`In the traditional approach, an application is first described
`behaviorally, and then an architecture is synthesized for that
`application (processors, memories and buses are instantiated),
`and the behavior is then mapped to the architecture. In the
`parameterized system design approach, the basic architecture is
`pre-defined, and
`the application
`is developed on
`that
`architecture. The "system synthesis" going on is really a fine-
`tuning of the original architecture through selection of values
`for the architecture’s parameters.
`
`3. Related work
`Numerous techniques for high-level power estimation and
`optimization have evolved recently; an overview can be found
`in Raghunathan et al [17].
`Much attention has been given to developing detailed
`models of cache internals to accurately predict a cache’s latency
`[25] as well as power consumption [3] for a given parameter
`configuration. Such detailed models would be used to estimate
`power in cache simulators, which we use as described later.
`Attention has also been given to exploring various cache
`configurations in terms of power, performance and size. Su and
`Despain [19] evaluate several cache design techniques with
`respect to power and performance. Henkel [11] used exhaustive
`trace-driven cache simulations to show that the best cache
`configuration, in terms of power, performance and size,
`differed greatly for different applications.
`Noticing that large tradeoffs are possible by configuring a
`cache for an application, but that cache simulations are slow,
`many researchers have focused on speeding up cache
`simulations. Kirovski et al [9] reduces the number of trace-
`driven cache simulations necessary for exploring different
`cache configurations, by establishing bounds and hence pruning
`numerous inferior configurations without having to simulate
`them. Wu and Wolf [26] order the search of different cache
`configurations such that, after each cache simulation, they can
`reduce the size of a given input trace by removing redundant
`information ("trace stripping"), thus speeding up subsequent
`simulations of other configurations. Our work differs in that we
`couple cache parameters with bus parameters (and possibly
`other parameters in the future), resulting in an enormous
`design space and thus seemingly excluding any approach based
`on repeated simulation. While one-pass cache-simulation [12]
`is a common
`technique,
`in which numerous cache
`configurations are evaluated simultaneously during one
`simulation, incorporating the myriad of other parameters that
`we wish to consider (bus, DMA, peripheral cores, etc.) into
`such an approach would likely become prohibitively complex.
`
`AMAZON 1006
`Page 8 of 12
`
`

`(referred to hereafter as a trace-file), we are to compute the
`number of cache misses2, denoted N, for all different caches.
`Two caches are different if they differ in their total cache size,
`line size (block size) or degree of associativity. We limit each
`of these three distinguishing parameters to a finite range:
`
`{
`i
`:2
`{
`i
`:2
`{
`i
`:2
`
`=
`=
`=
`
`S
`L
`
`A
`
`Min
`
`=
`Si
`=
`Li
`Min
`=
`Ai
`
`Min
`
`
`
`}
`}
`}.
`
`S
`Max
`L
`
`Max
`A
`Max
`
`Note that, for practical purposes, we only consider values that
`are powers of two for each of these parameters. Given a trace-
`file, we must define a function:
`(
`)
`ﬁ··
`:
`ALS
`
`N
`
`.
`
`f
`
`to compute the number of cache misses for any cache
`configuration. We assume that, with the aid of a cache-
`simulator, we are able to compute the above function, for any
`value from the sets S, L and A, in linear time with respect to
`the size of the trace-file. Intuitively, our approach works as
`follows. We know that at low cache sizes, higher line size and
`associativity have a greater positive effect than they do at high
`cache sizes. For example, doubling the line size when cache
`size is 512B may reduce cache miss rate by 30%, however,
`when the cache size is 8K, it may not reduce the miss rate at
`all. Thus, we are interested in finding these improvement ratios
`at both low and high cache sizes, so that, by line fitting, the
`improvement ratio for any cache size can be estimated. This
`assumes a smooth design space between these points. We next
`describe our approach for estimating this function for all range
`values.
`Our approach consists of three steps. First we simulate the
`trace-file for some selected S, L and A values and obtain the
`corresponding cache misses. Then we calculate a linear
`equation, using the least square approximation method. Last
`we use our linear equations to compute N for all cache
`parameters. We first simulate the following points in our
`domain space:
`(
`)
`Sf
`(
`)
`Sf
`(
`)
`Sf
`(
`)
`Sf
`(
`)
`Sf
`(
`)
`Sf
`
`,
`,
`,
`,
`,
`,
`
`Min
`
`Max
`
`Min
`
`Min
`
`Max
`
`Max
`
`Min
`
`Max
`
`L
`Min
`L
`L
`L
`Min
`L
`L
`
`N
`1
`N
`N
`N
`N
`N
`
`65432
`
`======
`
`.
`
`,
`A
`Min
`,
`A
`Min
`,
`A
`Min
`,
`A
`Max
`,
`A
`Min
`,
`A
`Max
`
`Max
`
`Min
`
`Then we compute the following ratios:
`=
`=
`
` / , RNN
`
`
` / NN
`R
`1
`4
`2
`1
`3
`1
`=
`=
`
`,
`
` / NN
` / NN
`R
`R
`2
`2
`
`3
`
`5
`
`4
`
`.
`
`6
`
`
`2 Other metrics, e.g., number of write backs, can be estimated,
`using our approach, in a similar manner.
`
`With the advent of deep-submicron technology and the
`accompanying increase in the bus’ contribution to system power
`consumption, recently researchers have begun to focus on
`reducing bus power [17], and more closely related to our work,
`on the inter-relationship of cache and bus power consumption.
`Compiler-level approaches, like that by Panda and Dutt [15],
`seek to generate executable code that minimizes power on a
`given cache/bus architecture. An architectural approach was
`presented by Fornaciari et al [4], who investigated the power
`consumption of different bus encodings for various cache
`configurations. Li and Givargis et al [6] developed a fast bus
`power model and then coupled [7] this model with Li and
`Henkel’s cache simulations to show that, in deep submicron
`technologies, the best cache and best bus configurations are
`tightly
`interdependent
`and
`thus
`should
`be
`sought
`simultaneously. Our work is an improvement to this work in
`that the long cache simulations can be replaced by the fast
`models in this paper to reduce the time to evaluate all
`cache/bus configurations
`for a given application
`from
`days/weeks to seconds/minutes.
`4. Approach overview
`Because a parameterized system design approach will have
`numerous interdependent parameters, an approach requiring
`simulation for each configuration would be computationally
`infeasible due to the exponential number of configurations.
`Thus, we instead use a two-step approach to parameter
`optimization, as shown in Figure 1. Characterizing simulation
`involves simulating the application with typical input vectors
`once or a small number of times that is just enough to provide
`enough intermediate data to characterize the application for the
`second step. The second step, parameter exploration, uses
`heuristics to traverse the design space of possible parameter
`configurations, coupled with fast estimation equations that use
`the intermediate data to provide power, performance and size
`values for a given configuration. These equations evaluate in
`constant-time, so can deal with huge numbers of possible
`configurations.
`We have chosen to focus initially on developing parameter
`optimization for a system’s cache and bus sub-systems, because
`these typically consume a significant percentage of system
`power (we plan to soon extend our approach to also consider
`DMA). We have already developed an approach for buses [6]
`involving definition of the intermediate data (bus traffic),
`estimation equations for power, performance and size as a
`function of bus parameters (size and encoding) and traffic, and
`an exhaustive search heuristic. We now describe
`the
`intermediate data and estimation equations necessary for cache
`parameter optimization, followed by a description of a method
`for coupling the cache methods with that previously developed
`for buses. We showed earlier [7] that the tight interdependency
`of cache and bus parameters requires such a coupling in order
`to find the best cache and bus configurations optimizing power,
`performance and size.
`5. Cache performance and power estimation
`In this section, we discuss the technique that we have
`employed for rapidly estimating cache metrics. We define the
`problem as follows. Given a trace of memory references
`
`AMAZON 1006
`Page 9 of 12
`
`

`control line and extra circuit logic to compute the Hamming
`distance (bit transitions) between two consecutive data items. If
`the Hamming distance is greater than ½ the bus width, then the
`control line is asserted and the inverted data is send over the
`bus [12]:
`
`œœœœœœœœœœßøŒŒŒŒŒŒŒŒŒŒºØ œœøŒŒØ(cid:215)(cid:247)(cid:247)(cid:247)(cid:247)ł(cid:246)(cid:231)(cid:231)(cid:231)(cid:231)Ł(cid:230)
`
`+(cid:215)(cid:247)(cid:247)ł(cid:246)(cid:231)(cid:231)Ł(cid:230)
`
`1+k
`
`2
`k
`2
`
`2
`
`1+k
`
`œœøŒŒØ
`
`+
`
`k
`2
`k
`2
`
`k
`2
`
`+(cid:215)(cid:247)(cid:247)ł(cid:246)(cid:231)(cid:231)Ł(cid:230)(cid:247)(cid:247)ł(cid:246)(cid:231)(cid:231)Ł(cid:230) œœøŒŒØ
`
`1+k
`
`1
`k
`2
`
`1
`
`kn
`
`=
`
`(
`C
`bus
`
`PI
`bus
`
`)
`
`)
`(
`m
`
`
`
`/sectransition
`
`Given the traffic m on a bus, power can be quickly
`estimated using analytical models as described above.
`Likewise, similar analytical models can be applied to compute
`cache and memory power, (and performance). These have been
`extensively modeled by [3].
`7. Experiments
`In order to verify our approach, we performed the following
`experiments. We used two applications written in C, a diesel
`engine controller (Diesel), and an encryption algorithm (Key).
`We explored power and performance for mapping each
`application
`to a
`system architecture
`including
`three
`parameterized parts: cache, CPU-cache bus, and cache-memory
`bus. The cache parameters and their possible values were:
`cache size of 128, 256, 512, 1K, 2K, 4K, 8K or 32K; cache line
`of 8, 16 or 32; and associativity of 2, 4 or 8. The parameters
`for each bus were: data width of 4, 8, 16 and 32; and bus invert
`encoding either enabled or disabled.
`We compared
`this paper's fast cache/bus estimation
`technique with the simulation approach of [7]. For the
`simulation approach, illustrated in Figure 3(a), we ran the C
`application through a trace stripper to generate a trace of
`memory references. Then, for each cache configuration, we
`
`Figure 3: Experimental setup: (a) simulation approach, (b)
`our model-based approach.
`application code
`cache config. bus config.
`trace
`
`trace stripper
`
`cache simul.
`bus traffic
`bus simul.
`
`(a)
`
`instr. set sim.
`
`application code
`
`6 cache
`configs.
`
`cache config.
`bus config.
`
`(b)
`
`trace
`
`cache simul.
`trace stripper
`reference values
`cache model
`bus traffic
`bus model
`
`instr. set sim.
`
`metric
`s
`
`metric
`s
`
`Here, R1/ R2 denotes the improvement we obtain by using
`maximum line-size/associativity when cache size is at its
`minimum. Likewise R3/ R4 denote the positive improvement we
`obtain by using maximum line-size/associativity when the
`cache size is at its maximum. Given these ratios we estimate N
`for a given cache size, line size and associativity as follows:
`(
`)
`=
`S
`(
`)
`L
`j
`(
`A
`k
`(
`Ns
`(
`Rl
`(
`Ra
`,
`L
`
`s
`=
`l
`=
`a
`t
`t
`t
`
`1
`
`2
`
`3
`f
`
`=
`=
`=
`(
`S
`
`i
`
`S
`L
`
`Min
`
`Min
`A
`
`2
`
`3
`
`4
`,
`
`j
`
`A
`
`i
`
`1
`)
`
`2
`
`1
`
`2
`
`Max
`
`S
`L
`/
`+
`
`//
`
`Max
`)
`A
`Min
`Max
`)
`N
`N
`+
`R
`R
`1
`1
`+
`)
`R
`R
`--»
`1(
`)
`t
`t
`
`k
`
`1
`
`2
`
`t
`
`3
`
`).
`
`The first three equations, s, l and a, normalize our
`parameters to be within a unit range. The next equation, t1,
`estimates cache misses using lowest line size and associativity,
`by computing a linear line through the points N1 and N2. If
`more simulation data
`is available,
`the
`least
`square
`approximation is used to compute t1. The next two equations, t2
`and t3, estimate the expected improvement gained from higher
`line size or associativity. The last equation combines the
`previous equations to estimate cache miss rate.
`6. Combined cache/bus estimation
`In this section, we describe how to extend the cache data
`into bus data for simultaneous cache/bus design space
`exploration. The technique described in the previous section
`allows us to rapidly estimate the number of cache misses, N,
`for a given cache parameter setting. This number, N, is a
`measure of cache to main-memory bus traffic. Likewise, the
`total number of cache accesses, i.e., the size of the trace-file, is
`a measure of CPU to cache bus traffic. Given this traffic, and
`assuming data of random nature, we can use equations [6] to
`compute the bit switching activity on the bus and use it, along
`with wire capacitance models, to compute power consumption
`of our system. In this work, we consider varying the number of
`data bus wires, e.g., 16 or 32-bits, and data encoding, e.g.,
`binary or bus-invert.
`For our bus model, we assume that there are m, n-bit items
`transmitted per unit time on a bus of width k using binary
`encoding. (Here m denotes the traffic on the bus and is
`obtained by estimating cache misses as described above.) The
`following equation gives power consumption for the data bus:
`(
`)
`(
`)
`C
`k
`
`kn
`
`=
`
`P
`bus
`
`bus
`
`
`
`temtransfer/i
`
`
`
`erbit/transf
`
`
`
`/bittransition
`
`21
`
`)
`
`(
`
`m
`
`
`
`ditem/secon
`
`
`
`/sectransition
`
`( )(
`mk
`
`)
`
`œœøŒŒØ
`)
`
`kn
`
`(
`C
`
`bus
`
`21
`
`=
`
`In this equation, bus capacitance is calculated using models
`developed by Chern et al [1]. Our equation is expanded to take
`into account bus-invert encoding. This method uses an extra
`
`(cid:247)ł(cid:246)(cid:231)Ł(cid:230) (cid:247)(cid:247)ł(cid:246)(cid:231)(cid:231)Ł(cid:230) œœøŒŒØ
`
`AMAZON 1006
`Page 10 of 12
`
`-
`-
`-
`-
`-
`-
`(cid:229)
`(cid:229)
`

`magnitude faster than simulation-based approaches, but yields
`good accuracy. The technique therefore ena

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases