`
`ICCD 2003
`
`
`
`21° International Conference on
`
`Computer Design
`
`HPE, Exh. 1020, p. 1
`
`HPE, Exh. 1020, p. 1
`
`
`
`TKVSTY
`T1é
`39 3
`
`Copyright © 2003 by TheInstitute of Electrical and
`Electronics Engineers,Inc.
`All rights reserved
`
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries may
`photocopy beyond the limits of US copyright law, for private use of patrons, those articles in this volume
`that carry a code at the bottom ofthe first page, providedthat the per-copy fee indicated in the codeis paid
`through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
`
`Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights Manager, IEEE
`Service Center, 445 Hoes Lane, P.O. Box 133, Piscataway, NJ 08855-1331.
`
`The papersin this book comprise the proceedings of the meeting mentioned on the cover andtitle page.
`They reflect the authors’ opinions and, in the interests of timely dissemination, are published as presented
`and without change. Their inclusionin this publication does not necessarily constitute endorsement bythe
`editors, the IEEE ComputerSociety, or the Institute ofElectrical and Electronics Engineers, Inc.
`
`IEEE Computer Society Order Number PRO2025
`ISBN 0-7695-2025-1
`ISSN Number 1063-6404
`
`Additional copies maybe orderedfrom:
`
`IEEE Computer Society
`CustomerService Center
`10662 Los VaquerosCircle
`P.O. Box 3014
`Los Alamitos, CA 90720-1314
`Tel: + 1-714-821-8380
`Fax: + 1-714-821-4641
`E-mail: cs.books@computer.org
`
`IEEE Service Center
`445 Hoes Lane
`P.O. Box 1331
`Piscataway, NJ 08855-1331
`Tel: + 1-732-981-0060
`Fax: + 1-732-981-9667
`http://shop.ieee.org/store/
`customer-service@ieee.org
`
`IEEE ComputerSociety
`Asia/Pacific Office
`Watanabe Bldg., 1-4-2
`Minami-Aoyama
`Minato-ku, Tokyo 107-0062
`JAPAN
`Tel: + 81-3-3408-3118
`Fax: + 81-3-3408-3553
`tokyo.ofc @computer.org
`
`Individual paper REPRINTS may be orderedat: reprints @computer.org
`
`Editorial production by Bob Werner
`Coverart production by Joe Daigle/Studio Productions
`Printed in the United States of America by The Printing House
`
`B&B
`COMPUTER
`SOCIETY
`
`IEEE
`
`HPE, Exh. 1020, p. 2
`
`
`
`HPE, Exh. 1020, p. 2
`
`
`
`
`
`
`
`Table of Contents
`
`International Conference on Computer Design — ICCD 2003
`
`CFe
`
`41-3-63
`
`Welcome
`
`Organizing Committee
`Program Committee
`Additional Reviewers
`
`Keynotes
`
`High-Speed Link Design, Then and Now
`M. Horowitz
`
`Terascale Computing and BlueGene
`W. Pulleyblank
`
`Advanced EDATools for High-Performance Design
`T. Vucurevich
`
`
`
`Session 1.1 Energy Efficiency
`
`Energy Efficient Asymmetrically Ported Register Files
`A. Aggarwal and M. Franklin
`
`PowerEfficient Data Cache Designs
`J. Abella and A. Gonzalez
`
`On Reducing Register Pressure and Energy in Multiple-Banked Register Files
`J. Abella and A. Gonzdlez
`
`Low Power Multiplication Algorithm for Switching Activity Reduction through
`Operand Decomposition
`M.Ito, D. Chinnery, and K. Keutzer
`
`XX
`
`XXi
`
`XXIl
`
`to
`
`21
`
`Session 1.2 Timing Verification
`
`Verification of Timed Circuits with Failure Directed Abstractions
`H. Zheng, C. Myers, D. Walter, S. Little, and T. Yoneda
`
`28
`
`Procedures for Identifying Untestable and Redundant Transition Faults in
`
`Synchronous Sequential Circuits
`G. Chen, S. Reddy, and I. Pomeranz
`
`36
`
`Event-Centric Simulation of Crosstalk Pulse Faults in Sequential Circuits
`M. Phadoongsidhi and K. Saluja
`
`Specifying and Verifying Systems with Multiple Clocks
`E. Clarke, D. Kroening, and K. Yorav
`
`42
`
`48
`
`v
`
`LINDA HALL LIBRARY
`Kanees City, Mo.
`HPE, Exh. 1020, p. 3
`
`HPE, Exh. 1020, p. 3
`
`
`
`Session 1.3 Electrical Analysis for System LSI
`Enhanced QMM-BEM Solverfor 3-D Finite-Domain Capacitance Extraction with Multilayered Dielectrics US
`Weve: Wang, and X. Hong
`
`An Improved Method for Fast Noise Estimation Based on Net Segmentation 64BeSsaiollgC. Huang and A. Dasgupta
`Symbolic Failure Analysis ofCustom Circuits due to Excessive Leakage Current
`70
`H. Song, S. Bohidar, I, Bahar, and J Grodstein
`AnEfficient Algorithm for Calculating the Worst-case Delay due to Crosstalk
`76
`5+]LabyingSeDDSeDRLFRESERDV. RajappanandS. Sapatnekar
`
`Session 2.1 Power Optimization
`A Compact Model for Analysis and Design ofOn-chip Power Network with Decoupling Capacitors
`P. Zarkesh-Ha, K. Doniger, W. Loh, D. Sun, R. Stephani, and G. Priebe
`Precomputation-Based Guarding for Dynamic and Leakage Power Reduction
`A. Abddollahi, M. Pedram, F. Fallah, and I. Ghosh
`Charge-Recycling Voltage Domainsfor Energy-Efficient Low-Voltage Operation of
`Digital CMOSCircuits
`S. Rajapandian, Z. Xu, and K. Shepard
`Low Power Adderwith Adaptive Supply Voltage
`A. Suzuki, W. Jeong, andK. Roy
`A Transparent Voltage Conversion MethodandIts Application to a Dual-Supply-Voltage Register File
`Session 2.2 Invited Session: Gene Chip Design
`Detection of Biological Molecules: From Self-Assembled Films to Self-Integrated Devices
`
`N. Tzartzanis and W. Walker
`
`R. Levicky
`
`84
`90
`
`98
`103
`107
`
`Li
`
`EmbeddedTutorial
`
`116
`
`
`
`
`
`Design Flow Enhancements for DNAArrays
`A. Kahng, I. Mandoiu, S. Reda, X. Xu, and A. Zelikovsky
`Session 2.3 System Level Design
`126
`Bus Architecture Synthesis for Hardware-Software Co-Design ofDeep Submicron Systems on Chip
`N. Thepayasuwan,V. Damle, and A. Doboli
`Dynamically Optimized Synchronous Communication for Low Power System on Chip PIGSIETSArltuN AGA.
`V. Chandra, G. Carpenter, and J. Burns
`Interface Synthesis Using Memory Mappingfor an FPGA Platform
`140
`M. Luthra,S. Gupta, N. Dutt, R, Gupta, and A. Nicolau
`Efficient Synthesis ofNetworks On Chip
`6
`A. Pinto, L. Carloni, and A, Sangiovanni-Vincentelli
`
`vi
`
`HPE, Exh. 1020, p. 4
`
`HPE, Exh. 1020, p. 4
`
`
`
`
`
`Reducing Compilation Time Overhead in Compiled Simulators
`M. Reshadi and N. Dutt
`
`Session 3.1 Systems Performance
`
`Profiling Interrupt Handler Performance through Kernel Instrumentation
`B. Moore, T. Slabach, and L. Schaelicke
`
`Design and Performance of Compressed Interconnects for High Performance Servers
`K. Kant and R. lyer
`
`Routed Inter-ALU Networks for ILP Scalability and Performance
`K. Sankaralingam, V. Singh, S. Keckler, and D. Burger
`
`Session 3.2 Micro Processor Test & Diagnosis
`
`Automatic Generation ofCritical-Path Tests for a Partial-Scan Microprocessor
`J. Grodstein, D. Bhavsar, V. Bettada, and R. Davies
`
`Test Generation for Non-separable RTL Controller-datapath Circuits Using a
`Satisfiability Based Approach
`L. Lingappan, S. Ravi, and N. Jha
`
`Cost-Effective Graceful Degradation in Speculative Processor Subsystems:
`The Branch Prediction Case
`S. Almukhaizim, T. Verdel, and Y. Makris
`
`Multiple Fault Diagnosis Using n-Detection Tests
`Z. Wang, M. Marek-Sadowska, K. Tsai, and J. Rajski
`
`Session 3.3 Physical Design
`
`A Physical Design Methodology for 1.3GHz SPARC64 Microprocessor
`N. Ito, H. Komatsu, ¥. Tanamura, R. Yamashita,
`H. Sugiyama, Y. Sugiyama, and H. Hamamura
`
`Physical Design ofthe “2.5D” Stacked System
`Y. Deng and W. Maly
`
`Flow-Based Cell Moving Algorithm for Desired Cell Distribution
`B. Choi, H. Xu, M. Wang, and M. Sarrafzadeh
`
`Session 4.1 Performance Optimization
`
`NpBench: A Benchmark Suite for Control plane and Data plane Applications for Network Processors
`B. Lee and L. John
`
`Hardware-Only Compression of Underutilized Address Buses:
`Design and Performance, Power, and Cost Analysis
`N. Mahapatra, J. Liu, and K. Sundaresan
`
`151
`
`156
`
`164
`
`170
`
`180
`
`187
`
`194
`
`198
`
`204
`
`211
`
`218
`
`226
`
`234
`
`Pipelined Multiplicative Division with IEEE Rounding
`G. Even and P. Seidel
`
`240
`
`vii
`
`HPE, Exh 1020, p. 5
`
`HPE, Exh. 1020, p. 5
`
`
`
`
`
`
`
`
`
`Session 4.2 Clock & Signal Distribution
`Design of Resonant Global Clock Distributions
`S. Chan, K. Shepard, and P. Restley
`Modeling and Mitigation ofJitter in Multi-Gbps Source-Synchronous I/O Links sie
`G. BalamuruganandN. Shanbhag
`A Mixed-Mode Delay-Locked Loop Architecture
`D. Eckerbert, L. Svensson, and P. Larsson-Edefors
`
`OptimalInductance for On-chip RLC Interconnections 264erreeeeeNEMEEteyetetVBvedby,
`S. Das, K. Agarwal, D. Blaauw, and D. Sylvester
`Session 4.3 Performance and Power-Driven Physical Design
`Spec Based Flip-Flop and Buffer Insertion
`N. Akkiraju and M. Mohan
`
`248
`
`254
`
`261
`
`270
`
`A Microeconomic Model for Simultaneous Gate Sizing and Voltage Scaling for Power Optimization
`N. Ranganathanand A. Murugavel
`A Simple Yet Effective Merging Schemefor Prescribed-Skew Clock Routing
`R. Chaturvedi and J. Hu
`
`Session 5.1 Instruction Execution
`Hardware-Based Pointer Data Prefetcher
`
`S. Laiand S. Lu
`
`A Dependence Driven Efficient Dispatch Scheme
`S. Nadathur and A, Tyagi
`An Efficient VLIW DSPArchitecture for Baseband Processing
`RGintG Chang, C. Lee, and C. Jen
`Dynamic Thread Resizing for Speculative Multithreaded Processors
`
`M. Zahran and M. Franklin
`
`Session 5.2 Invited Session: Test Compression Technology
`Care Bit Density and Test Cube Clusters: Multi-Level Compression Opportunities
`
`B. Koenemann
`
`276
`
`282
`
`290
`
`299
`
`307
`
`a13
`
`320
`
`326
`XMAX: X-Tolerant Architecture for MAXimalTest Compression
`: ’TAUSMTERBoteTeeSkegeeeeteeeS. Mitra and K. Kim
`
`
`
`Test Data Compression and Compaction for Embedded Test of Nanometer Technology Designs
`J. Rajski and J. Tyszer
`
`331
`
`viii
`
`HPE, Exh. 1020, p. 6
`
`HPE, Exh. 1020, p. 6
`
`
`
`
`
`Session 5.3 Physical Design for Regular Fabrics and FPGA’s
`
` Non-Crossing OBDDsfor Mapping to Regular Circuit Structures
`A. Cao and C. Koh
`
`338
`
`Interconnect Estimation for FPGAs under Timing Driven Domains
`P. Kannan and D. Bhatia
`
`344
`
`SEEeneenna 350
`
`ROAD: An Order-Impervious Optimal Detailed Router for FPGAs
`
`H.Arslan and S. Dutt
`
`Session 6.1 Array Design Optimization
`
`
`Reducing dTLB Energy through Dynamic Resizing
`V. Delaluz, M. Kandemir, A. Sivasubramaniam, M. Irwin, and N. Vijaykrishnan
`
`358
`
`Distributed Reorder Buffer Schemes for Low Power
`G. Kucuk, O. Ergin, D. Ponomarev, and K. Ghose
`
`
`
`Virtual Page Tag Reduction for Low-Power TLBs
`P. Petrov and A. Orailoglu
`
`
`
`364
`
`37!
`
`
`Dynamic Cluster Resizing
`J. Gonzalez and A. Gonzdlez
`
`375
`
`Session 6.2 Test Compaction
`
`Independent Test Sequence Compaction through Integer Programming
`P. Drineas and Y. Makris
`
`
`
`On Combining Pinpoint Test Set Relaxation and Run-Length Codes for Reducing Test Data Volume
`S. Kajihara, Y. Doi, L. Li, and K. Chakrabarty
`
`380
`
`387
`
`Static Test Compaction for Multiple Full-Scan Circuits
`I. Pomeranz and S. Reddy
`
`393
`
`A Methodto Find Don’t Care Values in Test Sequences for Sequential Circuits
`Y. Higami, S. Kobayashi, Y. Takamatsu, S. Kajihara, and I. Pomeranz
`
`Session 6.3 Invited Session: Techniques for Synthesizing into Fabrics
`
`Simplifying SoC Design with the Customizable Control Processor Platform
`C. Ogilvie, R. Ray, R. Devins, M. Kautzman,
`M. Hale, R. Bergamaschi, B. Lynch, and S. Gaur
` Structured ASICs: Opportunities and Challenges
`B. Zahiri
`
`397
`
`402
`
`404
`
`System LSI Implementation Fabrics for the Future
`S. Kaptanoglu
`
`410
`
`ix
`
`HPE, Exh. 1020, p. 7
`
`HPE, Exh. 1020, p. 7
`
`
`
`
`
`
`
`"
`
`Session 7.1 Hardware Partitioning
`
`Multiple-V,, Scheduling/Allocation for Partitioned Floorplan
`D. Kang, M. Johnson, and K. Roy
`
`age.
`
`SCATOMi:Scheduling Driven Circuit Partitioning Algorithm for Multiple FPGAs
`Using Time-multiplexed, Off-chip, Multicasting Interconnection Architecture
`419
`Y. Kwon, B. Park, and C. Kyung
`
`A Study of Hardware Techniquesthat Dynamically Exploit Frequent Operands to
`Reduce Power Consumption in Integer Function Units
`K. Gandhi and N. Mahapatra
`
`Session 7.2 Energy-Aware Design and Application
`
`KnapBind: An Area-Efficient Binding Algorithm for Low-leakage Datapaths
`-C. Gopalakrishnan and S. Katkoori
`
`A NovelSynthesis Strategy Driven by Partial Evaluation Based Circuit Reduction for
`Application Specific DSP Circuits
`M. Mukherjee and R. Vemuri
`
`PowerFluctuation Minimization During Behavioral Synthesis Using ILP-Based Datapath Scheduling
`S. Mohanty, N. Ranganathan, and S. Chappidi
`
`An Energy-Aware Simulation Model and Transaction Protocol for
`Dynamic WorkloadDistribution in Mobile Ad Hoc Networks
`F. Ghasemi-Tari, P. Rong, and M. Pedram
`
`Session 7.3 Invited Session: High-Speed Design Issues and Test Challenges
`CMOSHigh-Speed Serial I/Os — Present and Future
`M.Lee, W. Dally, R. Farjad-Rad, H. Ng,
`R. Senthinathan, J. Edmondson, and J. Poulton
`
`Fully Differential Receiver Chipset for 40 Gb/s Applications Using GalnAs/InP
`Single Heterojunction Bipolar Transistors
`K. Kiziloglu, S. Seetharaman, K. Glass, C. Bil, H. Duong, and G. Asmanis
`
`Paradigm Shift for Jitter and Noise in Design and Test >GB/s Data Communication Systems
`M. Li and J. Wilstrup
`
`Session 8.1 Efficiency and Reliability
`
`Cost-Efficient Memory Architecture Design of NAND Flash Memory Embedded Systems
`C. Park, J. Seo, D. Seo, S. Kim, and B. Kim
`
`Exploiting Microarchitectural Redundancy for Defect Tolerance
`P. Shivakumar, S. Keckler, C. Moore, and D. Burger
`
`Reducing Multimedia Decode Power Using Feedback Control
`Z. Lu, J. Lach, M. Stan, and K. Skadron
`
`426
`
`430
`
`436
`
`44]
`
`444
`
`454
`
`462
`
`467
`
`474
`
`481
`
`489
`
`HPE, Exh. 1020,p. 8
`
`HPE, Exh. 1020, p. 8
`
`
`
`
`
`Session 8.2 Novel Methodsin Logic Synthesis
`
`Structural Detection of Symmetries in Boolean Functions
`G. Wang, A. Kuehimann, and A. Sangiovanni-Vincentelli
`504
`
`498
`
`Boolean Decomposition Based on Cyclic Chains
`E. Dubrova, M. Teslenko, and J. Karlsson
`
`SAT-Based Algorithms for Logic Minimization
`S. Sapra, M. Theobald, and E. Clarke
`
`Session 9.1 Communications and Context Management
`
`Low-Density Parity-Check Decoder Architecture for High Throughput Optical Fiber Channels
`A. Selvarathinam, E. Kim, and G. Choi
`
`Improving Branch Prediction Accuracy in Embedded Processors in the Presence of Coniext Switches
`S. Pasricha and A. Veidenbaum
`
`Reducing Operand Transport Complexity of Superscalar Processors Using Distributed Register Files
`S. Bunchua, D. Wills, and L. Wills
`
`xpipes: a Latency Insensitive Parameterized Network-on-chip Architecture for Multi-Processor SoCs
`M.Dall’Osso, G. Biccari, L. Giovannini, D. Bertozzi, and L. Benini
`
`Session 9.2 Board Test and Power-Aware Test
`
`Aggressive Test Power Reduction through Test Stimuli Transformation
`O. Sinanoglu andA. Orailoglu
`
`510
`
`520
`
`526
`
`532
`
`536
`
`542
`
`Power-Time Tradeoffin Test Scheduling for SoCs
`M. Nourani and J. Chin
`
`548
`
`Multiple Transition Model and Enhanced Boundary Scan Architecture to
`
`Test Interconnects for Signal Integrity
`M.Tehranipour, N. Ahmed, and M. Nourani
`Author Index
`
`xi
`
`HPE, Exh. 1020, p. 9
`
`HPE, Exh. 1020, p. 9
`
`
`
`Cost-Efficient Memory Architecture Design of NAND Flash Memory
`Embedded Systems
`
`Chanik Park, Jaeyu Seo, DongyoungSeo, Shinhan Kim and Bumsoo Kim
`Software Center, SAMSUNGElectronics, Co., Ltd.
`{ci.park, pelican,dy76.seo, shinhank, bumsoo}@samsung.com
`
`Abstract
`
`NAND flash memory has become an indispensable
`component in embedded systems because of its versatile
`features such as non-volatility, solid-state reliability,
`low
`cos,t and high density. Even though NAND flash memory
`gains popularity as data storage, it also can be exploited as
`code memory for XIP (execute-in-place). In this paper, we
`present
`a_
`cost-efficient memory
`architecture which
`incorporates NANDflash memoryinto an existing memory
`hierarchy for code execution. The usefulness of the
`proposed approach is demonstrated with real embedded
`workloads on a real hardware prototyping board.
`
`1. Introduction
`
`A memory architecture design is a main concern to
`embedded system engineers since it dominates the cost,
`power, and performance of embedded systems. The typical
`memory architecture of embedded systems consists of
`ROM forinitial bootstrapping and code execution, RAM
`for working memory, and flash memory for permanentdata
`storage. In particular, emerging memory technology,
`the
`flash memory, is becoming an indispensable componentin
`embedded systems due to its versatile features: non-
`volatility, solid-state reliability,
`low power consumption,
`and so on. The most popular flash types are NOR and
`NAND. NORflash is particularly well suited for code
`Storage and execute-in-place (XIP)! applications, which
`require high-speed random access. While NAND flash
`provides high density and low-cost data storage, it does not
`lend itself to XIP applications due to the sequential access
`architecture and longaccesslatency.
`Table
`1
`shows different characteristics of various
`memory devices. Mobile SDRAM has strong points in
`
`
`' XIP is the execution of an application directly from the
`Flash instead of having to download the code into the
`systems’ RAM before executingit.
`
`performancebut requires high power consumption over the
`other memories. Fast SRAM or low power SRAMcan be
`selected
`according to
`the
`trade-off between power
`consumption and performance with a high cost.
`In non-
`volatile memories, NORflash provides fast random access
`speed and low power consumption, but has high cost
`compared with NAND flash. Even though NANDflash
`showslong random readlatency, it has advantages in low
`power consumption, storage capacity, and fast erase/write
`performancein contrast to NORflash.
`
`Table 1. Characteristics of various memory devices. The
`values in the table were calculated based on SAMSUNG
`2003 memory data sheets [1-2].
`
`Memory
`
`Mobile SDRAM
`Low power SRAM
`Fast SRAM
`NOR
`NAND
`
`Current (mA)
`$/Gb__idle
`active
`
`Random Access (16bit)
`read
`write
`erase
`
`05
`48
`0.005
`320
`614.5
`96
`0.03
`21.
`0.01
`
`i
`3
`65
`32
`10
`
`90ns
`55ns
`10ns
`200ns
`10.1us
`
`g0ns
`55ns
`10ns
`210.5us
`200.5us
`
`NA
`N.A
`NA
`1,2sec
`2ms
`
`Even though NAND flash memory is widely used as
`data storage in embedded systems, research on NANDflash
`memory as code storage are hardly found in industry or
`academia.
`In this paper, we present a new memoryarchitecture to
`enable NANDflash memory to provide XIP functionality.
`With XIP functionality in NAND flash,
`the cost of the
`memory system can be reduced since the NANDflash can
`be used as notonly as data storage but also as code storage
`for execution. As a result, we can obtain cost-efficient
`memory systems with reasonable performance and power
`consumption.
`The basic idea of our approach is to exploit the locality
`of code access pattern and devise a cache controller for
`repeatedly accessed codes. The prefetching cache is used to
`hide memorylatencyresulting from NAND memory access.
`In this paper we concentrate on code execution even though
`data memory is also an important aspect of memory
`architecture. There are two major contributions in this
`paper. First, we demonstrate the NAND XIPis feasible in
`
`1063-6404/03 $17.00 © 2003 IEEE
`
`474 .
`
`HPE, Exh. 1020, p. 10
`
`HPE, Exh. 1020, p. 10
`
`
`
`sal-life systems through a real hardware and commercial
`environment. Second, we apply highly optimized
`
`a hing techniques geared toward the specific features of
`
`SAND Flash.
`
`Therest of this paper is organized as follows. In the
`
`16) t section, we describe the trend of memory architecture
`
`or embedded systems. Section 3 reviews related work in
`
`sademia and industry. In Sections 4 and 5, we present our
`
`.w memory architecture based on NAND XIP. In Section
`
`we demonstrate the proposed architecture with real
`
`kloads on a hardware prototyping board and evaluate
`
`st, performance, and power consumption over existing
`
`ory architectures. Finally, our conclusions and future
`
`work are drawn in Section 7.
`
`2. Motivational Systems: Mobile Embedded
`
`
`
` (b)
`
`Figure 1. Mobile System Trend
`
`Figure | shows mobile system trend in terms of memory
`_
`
`hierarchy. Thefirst approach is to use NOR and SRAM for
`
`code storage and working memory,respectively, as shown
`
`in Figure 1(a). It is appropriate for low-end phones, which
`
`"require medium performance andcost. However, as mobile
`
`“systems evolve into data centric and multimedia-oriented
`
`applications, high performance and huge capacity for
`
`permanent storage have become necessary. The second
`
`architecture (Figure 1(b)) seems to meet the requirements in
`_ terms of storage capacity through NANDflash memory, but
`its performance is not enough to accommodate 3G
`applications which
`consist
`of
`real-time multimedia
`"applications.
`In
`addition,
`the
`increased
`number of
`components increases system cost. The third architecture
`(Figure 1(c)) eliminates NOR flash memory and uses
`- NAND flash memory for using shadowing *
`technique.
`Copying all code into RAM offers the best performance
`- possible, but contributes to the slow boot process. A large
`amount of SDRAMis necessary to hold both the OS and
`the applications. The higher power consumption from
`power hungry SDRAM memory is another problem for
`battery-operated systems.
`
`
`2 During system booting time, entire code imageis copied
`from permanentstorage into systems’ RAM for execution.
`
`
`
`As an improved solution of the third architecture in
`Figure 1(c), demand paging can be used with the assistance
`of operating system and it may reduce the size of SDRAM.
`However, this approachis not applicable to low or mid-end
`mobile system since it requires heavy virtual memory
`management code and MMU.
`Thus,it is important to investigate an efficient memory
`system in
`terms of
`cost, performance
`and
`power
`consumption.
`
`3. Related Work
`
`researchers have exploited NOR Flash
`In the past,
`memory as caches for magnetic disks due to its low power
`consumption and high-speed characteristics. eNvy focused
`on developing a persistent storage architecture without
`magnetic disks [7]. Fred et al showedthat flash memory can
`reduce energy consumption by an order of magnitude,
`compared to magnetic disk, while providing good read
`performance and acceptable write performance [9]. B.
`Marshet al examined the impactof using flash memory as a
`second-level
`file system buffer cache to reduce power
`consumption andfile access latency on a mobile computer
`[8].
`Li-Pin et al investigated the performance issue of NAND
`flash memory
`storage
`subsystems with
`a
`striping
`architecture, which uses I/O parallelism [10]. In industry
`[5], NAND XIP is implemented using small size of buffer
`and V/O interface conversion, but the XIP areais limited to
`boot code, thus OSand application codes should be copied
`to system memory.
`In summary, even though several researches have been
`made to obtain the maximum performance and low power
`consumption from data storage, few efforts to support XIP
`in NANDflash are found in academiaor industry.
`
`4. NANDXIP Architecture
`
`In this section, we describe NAND XIP architecture.
`First, we look into the structure of NAND flash and
`illustrate basic implementation of NAND XIP based on
`caching mechanism.
`
`4.1. Background
`A NANDflash memory consists of a fixed number of
`blocks, where each block has 32 pages and each page
`consists of 512 bytes main data and 16 bytes spare data as
`shownin Figure 2. Spare data can be used to store auxiliary
`information such as bad block identification and error
`correction code (ECC) for associated main data. NAND
`flash memories are subject
`to a condition called “bad
`block”,
`in which a block cannot be completely erased or
`
`i
`a
`4
`
`cannot be written due to partial or 2-bit errors. Bad blocks
`
`
`HPE, Exh. 1020, p. 11
`
`475
`
`HPE, Exh. 1020, p. 11
`
`
`
`the I/O interface of
`conversion is necessary to connect
`NANDflash to memory bus. For cache mechanism,direct
`map cache with victim cache is adopted based on Jouppi’s
`work in [4] with optimization for NANDflash. In [4],
`the
`victim cache is accessed on a main cache miss;
`if the
`address hits the victim cache, the data returned to the CPpy
`and at the same timeit is promoted to the main cache;the
`replaced block in the main cache is moved to the victim
`cache, therefore performing a “swap”. If the victim cache
`also misses, NANDflash accessis performed; the incoming
`data fills the main cache, and the replaced block will be
`moved to the victim cache. In next section, we modify the
`above “swap” algorithm using system memory and Page
`address translation table (PAT). The prefetching cacheis
`used to hide memory latencyresulting from NAND memory
`access. Several hardware prefetching techniques can be
`found in literature [12]. In our case, prefetching information
`is analyzed through profiling process and the prefetching
`information is stored in spare data at code image building
`time.
`
`5. Intelligent Caching: Priority-based Caching
`
`for
`suitable
`implementation is
`Though the basic
`application code which shows its spatial and temporal
`localities, it may be less effective in systems code which has
`a complex functionality, a large size, and interrupt-driven
`control transfers amongits procedures [13]. Torrellaset al.
`presented that the operating system has the characteristics
`that
`large sections of its code are rarely accessed and
`suffers considerable interference within popular execution
`paths [13]. For example, periodic timer interrupt, rarely-
`executed special-case code, and plenty of loop-less code
`disrupt
`the
`localities. On the other hand,
`real-time
`applications should beretained as long as possible to satisfy
`the timing constraints*. In this paper, we distinguish the
`different cache behavior between system and application
`codes, and adaptit to the page-based NANDarchitecture.
`We apply profile-guided static analysis of code access
`pattern.
`three categories
`into.
`We can divide code pages
`depending ontheir access cost: high priority, mid priority
`and low priority pages. Even though the priority can be
`determined by various objectives, we set
`the priority to
`pages based on the numberofreferences to pages and their
`criticality. For example,
`if a specific page is referenced
`more frequently orhastimecritical codes,it is classified as
`a high-priority page and should be cached orretained in
`cache to reduce the later access cost in case that the page is
`in NAND flash memory. OS-related code, system libraries
`and real-time applications have high-priority pages. On the
`
`* In this paper,real-time applications indicate multimedia
`applications with soft real-time constraints.
`
`may exist in NAND flash memory when shipped or may
`occur during operation.
`
`512 bytes
`
`16 bytes 1 Block
`
`=32 pages
`
`VO bus
`
`Figure 2. Structure of NANDflash memory
`In order to implement
`the NAND XIP, we should
`considerthe following points.
`
`@ Average memory accesstime
`@ Worst case handling
`@
`Bad block management
`
`The performance of memory system is measured by
`average access
`time [3].
`In order
`to implement XIP
`functionality,
`the average access time of NAND flash
`should be comparable to that of other memories such as
`NOR, SRAM and SDRAM. Though average memory
`access time is a good metric for performance evaluation,
`worst-case handling, or cache miss handling is another
`problem in practical view since most mobile systems such
`as cellular phones include time-critical
`interrupt handling
`such as call processing. For instance,
`if the time-critical
`interrupt occurs during cache miss handling,
`the system
`may notsatisfy given deadline and to makeit worse,it may
`lose data or connection. The third aspect to be considered
`in NANDXIPis to managebad blocks, which are inherent
`in NAND flash memory because bad blocks cause
`discontinuous memory space, whichis intolerable for code
`execution.
`
`4.2. Basic Implementation
`
`
`
`nm):
`BI:
`DATA [0:7] gs?
`om
`:
`:
`ADDR[0:12] sap:
`wid
`CE#,OE#,WE#amp?
`2 a
`
`
` :
`
`BUSY# que :
`
`' Figure 3. NAND XIP controller
`The proposed architecture consists of a small amount of
`SRAM for cache,
`interface conversion logic,
`the control
`logic and NAND Flash as shown in Figure 3. Interface
`
`476
`
`HPE, Exh. 1020, p. 12
`
`HPE, Exh. 1020, p. 12
`
`
`
`
`ther hand, mid-priority page is defined to be normal
`
`pplication code which is handled by normal caching
`
`This section presents our experiment environment. Our
`policy. Finally, low-priority page corresponds to sequential
`
`environment consists of a prototyping board, our in-house
`code such as initialization code, which is rarely executed.
`
`cache simulator with pattern analysis and a real workload,
`PATis introduced to remap pagesin bad blocks to pages in
`
`namely PocketPC [6] as shown in Figure 6. The prototyping
`good blocks and to remap requested pages to swapped
`
`board is composed of: main board and daughter board (a
`ges
`in
`system memory. We
`illustrate
`the caching
`
`yellow rectangle in Figure 6). The main board has ARM9-
`echanism in detail in Figure 4. First, when page A with
`
`based micro-controller, SDRAM, NORflash and so on. The
`igh-priority is requested,it is cached from NANDflash to
`
`daughter board contains an FPGA for cache controller and
`main cache. Next, when page B is requested from the CPU,
`
`victim cache, fast SRAM for tag and cache memory, and
`‘it should be moved to main cache or system memory. Here
`
`two NANDflash memories. The daughter boardis used not
`assumingthat pageBis in conflict with page A, page B is
`
`only to implement a real cache configuration on FPGA but
`‘moved to system memory (SRAM/SDRAM)since page B
`
`also to gather memory address
`traces
`from running
`is low priority page (“L” in spare area of NAND flash
`
`applications. In Figure 7, one NANDis dedicated to NAND
`‘memory means low-priority). At
`the same time, PATis
`
`XIP and the other NAND is dedicated for collecting
`“updated so thatlater access to page B is referred to system
`
`memory traces from host bus. Trace collection function is
`memory. Again, when page C is requested and in conflict
`
`started and stopped by using manual switches and manual
`with page A, page C replaces page A and page A is
`
`switch’s on/off interval determines the time period for trace
`discarded from main cache since C’s priority is high. The
`
`gathering. Collected address traces are stored for cache
`evicted page A is moved to victim cache. In summary, on
`
`simulator.
`NANDflash’s page demand,
`the controller discards or
`
`Thespecification of main processor and NANDflash is
`‘swaps existing cache page according to the priority
`
`shown in Table 2. The cache simulator explores various
`information stored in spare area data. The detail algorithm
`parameters
`such
`as miss ate,
`replacement
`policy,
`assoicativity, and cache size based on memory traces from
`the prototyping board. The real embedded workload,
`PocketPC supports XIP-enabled image based on the
`existing ROMFSfile systems in which each application can
`be directly executed without being loadedinto RAM.
`
`6. Experimental Setup
`
`
`
`Table 1: The specification of the prototyping board
`Parameter
`Configuration
`CPU clock
`200 MHz
`L1 Icache
`64way, 32byte line, 8KB
`Bus width
`16bit
`NAND readinitial latency
`10us
`NAND serial read time
`50ns
`SRAMread time
`10ns
`SDRAMread time
`90ns
`NORread time
`200ns
`
`Re
`
`poe Figure 6. A prototyping board for NAND XIP
`
`aed
`
`Bins
`
`;
`
`a ed
`:
`:
`ae ica
`‘
`Ts
`
`A477
`
`HPE, Exh. 1020, p. 13
`
`
`
`
`prefetching technique improves memory latency hiding
`om miss-predictionat run-time.
`Data bus
`Address bus
`
`
`
`NAND|page] > cachejpage% CACHE_SIZE]:
`
`Priority Caching (address)
`page = convert(addresshi
`if (isInPAT(page))
`main memoryhit;
`else iffisInM ainCache(page))
`Cache hit;
`else iffisInVictim Cache(page))
`Victim hit:
`else ( miss. fetch a page from NAND flash memory
`page_priority = NAND[page].priority:
`if (cache[page% CACHE_SIZE].priority ==HIGH)
`NAND[page] > main memory:
`else if (cache[page% CACHE_SIZE]priority == MID)
`{
`cache[page% CACHE_SIZE] > victim:
`NAND[page] > cachelpage% CACHE_SIZE]:
`lse
`
`/e
`
`Figure 5. Intelligent Caching Architecture
`
`HPE, Exh. 1020, p. 13
`
`
`
`
`
`
`
`32kB WM 64k8 O 126ke D 256K8 51248|
`
`Bus 328 Se oe
`
`NANDflash
`for trace
`collection
`
`NANDflash
`for XIP
`
`Host
`
`os
`Figure 7. FPGA Prototyping for NAND XI
`6.1. Experimental Results
`
`In Figure 8, we compare the miss ratio over various
`configuration parameters such as associativity, replacement
`policy, and cache size. We collected address traces from
`PocketPC while we were executing various applications
`such as “Media Player” and “MP3 player” since they are
`popular embedded multimedia applications which involve
`real-time requirements. Note that the cache size is the most
`importantfactorto affect miss ratio as shown in Figure 8.
`32kB8 @64kB O 128KB O256KB W512kB |
`
`
`
`(ns*mW
`Energy
`
`64
`
`128
`
`256
`
`
`
`Cache Line Size (Bytes)
`
`(b)
`Figure 9. Cache line size versus (a) access time per 32-byte
`and (b) energy consumption
`To analyze the optimal cache line size in NAND XIP
`cache, simulation has been done with the memory traces
`which are gathered from the prototyping board. The line
`size of 256-byte shows better numbersin average memory
`access time and in energy consumption overall other cache
`sizes as shown in Figure 9. Therefore,
`the line size of
`NAND XIP controller
`is determined to be 256-byte
`hereafter.
`
`[@32KB @ 64KB O 128k8 O 256KB M512KB
`
`
`
`
`ee]
`
`Average
`Memory A008
`Access
`300)
`veh
`2007
`
`100Kft
`Yee)
`0
`SDRAM
`NAND XP
`shadow ng
`(bast)
`ArchitecturalChoices
`
`(a)
` Ir
`G@32xe Meaka Oizsxs O2s6xe M512KB
`
`Energy (ns*mW )
`
`NAND XP
`SDRAM
`(proriy)
`shadow hg
`ArchitecturalChotes
`
`
`NANO
`XP (bast)
`
`NOR XP
`
`(b)
`Figure 10. Overall performance comparison of different
`memory architectures: (a) aver