throbber
Proceedings
`
`ICCD 2003
`
`
`
`21° International Conference on
`
`Computer Design
`
`HPE, Exh. 1020, p. 1
`
`HPE, Exh. 1020, p. 1
`
`

`

`TKVSTY
`T1é
`39 3
`
`Copyright © 2003 by TheInstitute of Electrical and
`Electronics Engineers,Inc.
`All rights reserved
`
`Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries may
`photocopy beyond the limits of US copyright law, for private use of patrons, those articles in this volume
`that carry a code at the bottom ofthe first page, providedthat the per-copy fee indicated in the codeis paid
`through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.
`
`Other copying, reprint, or republication requests should be addressed to: IEEE Copyrights Manager, IEEE
`Service Center, 445 Hoes Lane, P.O. Box 133, Piscataway, NJ 08855-1331.
`
`The papersin this book comprise the proceedings of the meeting mentioned on the cover andtitle page.
`They reflect the authors’ opinions and, in the interests of timely dissemination, are published as presented
`and without change. Their inclusionin this publication does not necessarily constitute endorsement bythe
`editors, the IEEE ComputerSociety, or the Institute ofElectrical and Electronics Engineers, Inc.
`
`IEEE Computer Society Order Number PRO2025
`ISBN 0-7695-2025-1
`ISSN Number 1063-6404
`
`Additional copies maybe orderedfrom:
`
`IEEE Computer Society
`CustomerService Center
`10662 Los VaquerosCircle
`P.O. Box 3014
`Los Alamitos, CA 90720-1314
`Tel: + 1-714-821-8380
`Fax: + 1-714-821-4641
`E-mail: cs.books@computer.org
`
`IEEE Service Center
`445 Hoes Lane
`P.O. Box 1331
`Piscataway, NJ 08855-1331
`Tel: + 1-732-981-0060
`Fax: + 1-732-981-9667
`http://shop.ieee.org/store/
`customer-service@ieee.org
`
`IEEE ComputerSociety
`Asia/Pacific Office
`Watanabe Bldg., 1-4-2
`Minami-Aoyama
`Minato-ku, Tokyo 107-0062
`JAPAN
`Tel: + 81-3-3408-3118
`Fax: + 81-3-3408-3553
`tokyo.ofc @computer.org
`
`Individual paper REPRINTS may be orderedat: reprints @computer.org
`
`Editorial production by Bob Werner
`Coverart production by Joe Daigle/Studio Productions
`Printed in the United States of America by The Printing House
`
`B&B
`COMPUTER
`SOCIETY
`
`IEEE
`
`HPE, Exh. 1020, p. 2
`
`
`
`HPE, Exh. 1020, p. 2
`
`

`

`
`
`
`
`Table of Contents
`
`International Conference on Computer Design — ICCD 2003
`
`CFe
`
`41-3-63
`
`Welcome
`
`Organizing Committee
`Program Committee
`Additional Reviewers
`
`Keynotes
`
`High-Speed Link Design, Then and Now
`M. Horowitz
`
`Terascale Computing and BlueGene
`W. Pulleyblank
`
`Advanced EDATools for High-Performance Design
`T. Vucurevich
`
`
`
`Session 1.1 Energy Efficiency
`
`Energy Efficient Asymmetrically Ported Register Files
`A. Aggarwal and M. Franklin
`
`PowerEfficient Data Cache Designs
`J. Abella and A. Gonzalez
`
`On Reducing Register Pressure and Energy in Multiple-Banked Register Files
`J. Abella and A. Gonzdlez
`
`Low Power Multiplication Algorithm for Switching Activity Reduction through
`Operand Decomposition
`M.Ito, D. Chinnery, and K. Keutzer
`
`XX
`
`XXi
`
`XXIl
`
`to
`
`21
`
`Session 1.2 Timing Verification
`
`Verification of Timed Circuits with Failure Directed Abstractions
`H. Zheng, C. Myers, D. Walter, S. Little, and T. Yoneda
`
`28
`
`Procedures for Identifying Untestable and Redundant Transition Faults in
`
`Synchronous Sequential Circuits
`G. Chen, S. Reddy, and I. Pomeranz
`
`36
`
`Event-Centric Simulation of Crosstalk Pulse Faults in Sequential Circuits
`M. Phadoongsidhi and K. Saluja
`
`Specifying and Verifying Systems with Multiple Clocks
`E. Clarke, D. Kroening, and K. Yorav
`
`42
`
`48
`
`v
`
`LINDA HALL LIBRARY
`Kanees City, Mo.
`HPE, Exh. 1020, p. 3
`
`HPE, Exh. 1020, p. 3
`
`

`

`Session 1.3 Electrical Analysis for System LSI
`Enhanced QMM-BEM Solverfor 3-D Finite-Domain Capacitance Extraction with Multilayered Dielectrics US
`Weve: Wang, and X. Hong
`
`An Improved Method for Fast Noise Estimation Based on Net Segmentation 64BeSsaiollgC. Huang and A. Dasgupta
`Symbolic Failure Analysis ofCustom Circuits due to Excessive Leakage Current
`70
`H. Song, S. Bohidar, I, Bahar, and J Grodstein
`AnEfficient Algorithm for Calculating the Worst-case Delay due to Crosstalk
`76
`5+]LabyingSeDDSeDRLFRESERDV. RajappanandS. Sapatnekar
`
`Session 2.1 Power Optimization
`A Compact Model for Analysis and Design ofOn-chip Power Network with Decoupling Capacitors
`P. Zarkesh-Ha, K. Doniger, W. Loh, D. Sun, R. Stephani, and G. Priebe
`Precomputation-Based Guarding for Dynamic and Leakage Power Reduction
`A. Abddollahi, M. Pedram, F. Fallah, and I. Ghosh
`Charge-Recycling Voltage Domainsfor Energy-Efficient Low-Voltage Operation of
`Digital CMOSCircuits
`S. Rajapandian, Z. Xu, and K. Shepard
`Low Power Adderwith Adaptive Supply Voltage
`A. Suzuki, W. Jeong, andK. Roy
`A Transparent Voltage Conversion MethodandIts Application to a Dual-Supply-Voltage Register File
`Session 2.2 Invited Session: Gene Chip Design
`Detection of Biological Molecules: From Self-Assembled Films to Self-Integrated Devices
`
`N. Tzartzanis and W. Walker
`
`R. Levicky
`
`84
`90
`
`98
`103
`107
`
`Li
`
`EmbeddedTutorial
`
`116
`
`
`
`
`
`Design Flow Enhancements for DNAArrays
`A. Kahng, I. Mandoiu, S. Reda, X. Xu, and A. Zelikovsky
`Session 2.3 System Level Design
`126
`Bus Architecture Synthesis for Hardware-Software Co-Design ofDeep Submicron Systems on Chip
`N. Thepayasuwan,V. Damle, and A. Doboli
`Dynamically Optimized Synchronous Communication for Low Power System on Chip PIGSIETSArltuN AGA.
`V. Chandra, G. Carpenter, and J. Burns
`Interface Synthesis Using Memory Mappingfor an FPGA Platform
`140
`M. Luthra,S. Gupta, N. Dutt, R, Gupta, and A. Nicolau
`Efficient Synthesis ofNetworks On Chip
`6
`A. Pinto, L. Carloni, and A, Sangiovanni-Vincentelli
`
`vi
`
`HPE, Exh. 1020, p. 4
`
`HPE, Exh. 1020, p. 4
`
`

`

`
`
`Reducing Compilation Time Overhead in Compiled Simulators
`M. Reshadi and N. Dutt
`
`Session 3.1 Systems Performance
`
`Profiling Interrupt Handler Performance through Kernel Instrumentation
`B. Moore, T. Slabach, and L. Schaelicke
`
`Design and Performance of Compressed Interconnects for High Performance Servers
`K. Kant and R. lyer
`
`Routed Inter-ALU Networks for ILP Scalability and Performance
`K. Sankaralingam, V. Singh, S. Keckler, and D. Burger
`
`Session 3.2 Micro Processor Test & Diagnosis
`
`Automatic Generation ofCritical-Path Tests for a Partial-Scan Microprocessor
`J. Grodstein, D. Bhavsar, V. Bettada, and R. Davies
`
`Test Generation for Non-separable RTL Controller-datapath Circuits Using a
`Satisfiability Based Approach
`L. Lingappan, S. Ravi, and N. Jha
`
`Cost-Effective Graceful Degradation in Speculative Processor Subsystems:
`The Branch Prediction Case
`S. Almukhaizim, T. Verdel, and Y. Makris
`
`Multiple Fault Diagnosis Using n-Detection Tests
`Z. Wang, M. Marek-Sadowska, K. Tsai, and J. Rajski
`
`Session 3.3 Physical Design
`
`A Physical Design Methodology for 1.3GHz SPARC64 Microprocessor
`N. Ito, H. Komatsu, ¥. Tanamura, R. Yamashita,
`H. Sugiyama, Y. Sugiyama, and H. Hamamura
`
`Physical Design ofthe “2.5D” Stacked System
`Y. Deng and W. Maly
`
`Flow-Based Cell Moving Algorithm for Desired Cell Distribution
`B. Choi, H. Xu, M. Wang, and M. Sarrafzadeh
`
`Session 4.1 Performance Optimization
`
`NpBench: A Benchmark Suite for Control plane and Data plane Applications for Network Processors
`B. Lee and L. John
`
`Hardware-Only Compression of Underutilized Address Buses:
`Design and Performance, Power, and Cost Analysis
`N. Mahapatra, J. Liu, and K. Sundaresan
`
`151
`
`156
`
`164
`
`170
`
`180
`
`187
`
`194
`
`198
`
`204
`
`211
`
`218
`
`226
`
`234
`
`Pipelined Multiplicative Division with IEEE Rounding
`G. Even and P. Seidel
`
`240
`
`vii
`
`HPE, Exh 1020, p. 5
`
`HPE, Exh. 1020, p. 5
`
`

`

`
`
`
`
`
`
`Session 4.2 Clock & Signal Distribution
`Design of Resonant Global Clock Distributions
`S. Chan, K. Shepard, and P. Restley
`Modeling and Mitigation ofJitter in Multi-Gbps Source-Synchronous I/O Links sie
`G. BalamuruganandN. Shanbhag
`A Mixed-Mode Delay-Locked Loop Architecture
`D. Eckerbert, L. Svensson, and P. Larsson-Edefors
`
`OptimalInductance for On-chip RLC Interconnections 264erreeeeeNEMEEteyetetVBvedby,
`S. Das, K. Agarwal, D. Blaauw, and D. Sylvester
`Session 4.3 Performance and Power-Driven Physical Design
`Spec Based Flip-Flop and Buffer Insertion
`N. Akkiraju and M. Mohan
`
`248
`
`254
`
`261
`
`270
`
`A Microeconomic Model for Simultaneous Gate Sizing and Voltage Scaling for Power Optimization
`N. Ranganathanand A. Murugavel
`A Simple Yet Effective Merging Schemefor Prescribed-Skew Clock Routing
`R. Chaturvedi and J. Hu
`
`Session 5.1 Instruction Execution
`Hardware-Based Pointer Data Prefetcher
`
`S. Laiand S. Lu
`
`A Dependence Driven Efficient Dispatch Scheme
`S. Nadathur and A, Tyagi
`An Efficient VLIW DSPArchitecture for Baseband Processing
`RGintG Chang, C. Lee, and C. Jen
`Dynamic Thread Resizing for Speculative Multithreaded Processors
`
`M. Zahran and M. Franklin
`
`Session 5.2 Invited Session: Test Compression Technology
`Care Bit Density and Test Cube Clusters: Multi-Level Compression Opportunities
`
`B. Koenemann
`
`276
`
`282
`
`290
`
`299
`
`307
`
`a13
`
`320
`
`326
`XMAX: X-Tolerant Architecture for MAXimalTest Compression
`: ’TAUSMTERBoteTeeSkegeeeeteeeS. Mitra and K. Kim
`
`
`
`Test Data Compression and Compaction for Embedded Test of Nanometer Technology Designs
`J. Rajski and J. Tyszer
`
`331
`
`viii
`
`HPE, Exh. 1020, p. 6
`
`HPE, Exh. 1020, p. 6
`
`

`

`
`
`Session 5.3 Physical Design for Regular Fabrics and FPGA’s
`
` Non-Crossing OBDDsfor Mapping to Regular Circuit Structures
`A. Cao and C. Koh
`
`338
`
`Interconnect Estimation for FPGAs under Timing Driven Domains
`P. Kannan and D. Bhatia
`
`344
`
`SEEeneenna 350
`
`ROAD: An Order-Impervious Optimal Detailed Router for FPGAs
`
`H.Arslan and S. Dutt
`
`Session 6.1 Array Design Optimization
`
`
`Reducing dTLB Energy through Dynamic Resizing
`V. Delaluz, M. Kandemir, A. Sivasubramaniam, M. Irwin, and N. Vijaykrishnan
`
`358
`
`Distributed Reorder Buffer Schemes for Low Power
`G. Kucuk, O. Ergin, D. Ponomarev, and K. Ghose
`
`
`
`Virtual Page Tag Reduction for Low-Power TLBs
`P. Petrov and A. Orailoglu
`
`
`
`364
`
`37!
`
`
`Dynamic Cluster Resizing
`J. Gonzalez and A. Gonzdlez
`
`375
`
`Session 6.2 Test Compaction
`
`Independent Test Sequence Compaction through Integer Programming
`P. Drineas and Y. Makris
`
`
`
`On Combining Pinpoint Test Set Relaxation and Run-Length Codes for Reducing Test Data Volume
`S. Kajihara, Y. Doi, L. Li, and K. Chakrabarty
`
`380
`
`387
`
`Static Test Compaction for Multiple Full-Scan Circuits
`I. Pomeranz and S. Reddy
`
`393
`
`A Methodto Find Don’t Care Values in Test Sequences for Sequential Circuits
`Y. Higami, S. Kobayashi, Y. Takamatsu, S. Kajihara, and I. Pomeranz
`
`Session 6.3 Invited Session: Techniques for Synthesizing into Fabrics
`
`Simplifying SoC Design with the Customizable Control Processor Platform
`C. Ogilvie, R. Ray, R. Devins, M. Kautzman,
`M. Hale, R. Bergamaschi, B. Lynch, and S. Gaur
` Structured ASICs: Opportunities and Challenges
`B. Zahiri
`
`397
`
`402
`
`404
`
`System LSI Implementation Fabrics for the Future
`S. Kaptanoglu
`
`410
`
`ix
`
`HPE, Exh. 1020, p. 7
`
`HPE, Exh. 1020, p. 7
`
`

`

`
`
`
`
`"
`
`Session 7.1 Hardware Partitioning
`
`Multiple-V,, Scheduling/Allocation for Partitioned Floorplan
`D. Kang, M. Johnson, and K. Roy
`
`age.
`
`SCATOMi:Scheduling Driven Circuit Partitioning Algorithm for Multiple FPGAs
`Using Time-multiplexed, Off-chip, Multicasting Interconnection Architecture
`419
`Y. Kwon, B. Park, and C. Kyung
`
`A Study of Hardware Techniquesthat Dynamically Exploit Frequent Operands to
`Reduce Power Consumption in Integer Function Units
`K. Gandhi and N. Mahapatra
`
`Session 7.2 Energy-Aware Design and Application
`
`KnapBind: An Area-Efficient Binding Algorithm for Low-leakage Datapaths
`-C. Gopalakrishnan and S. Katkoori
`
`A NovelSynthesis Strategy Driven by Partial Evaluation Based Circuit Reduction for
`Application Specific DSP Circuits
`M. Mukherjee and R. Vemuri
`
`PowerFluctuation Minimization During Behavioral Synthesis Using ILP-Based Datapath Scheduling
`S. Mohanty, N. Ranganathan, and S. Chappidi
`
`An Energy-Aware Simulation Model and Transaction Protocol for
`Dynamic WorkloadDistribution in Mobile Ad Hoc Networks
`F. Ghasemi-Tari, P. Rong, and M. Pedram
`
`Session 7.3 Invited Session: High-Speed Design Issues and Test Challenges
`CMOSHigh-Speed Serial I/Os — Present and Future
`M.Lee, W. Dally, R. Farjad-Rad, H. Ng,
`R. Senthinathan, J. Edmondson, and J. Poulton
`
`Fully Differential Receiver Chipset for 40 Gb/s Applications Using GalnAs/InP
`Single Heterojunction Bipolar Transistors
`K. Kiziloglu, S. Seetharaman, K. Glass, C. Bil, H. Duong, and G. Asmanis
`
`Paradigm Shift for Jitter and Noise in Design and Test >GB/s Data Communication Systems
`M. Li and J. Wilstrup
`
`Session 8.1 Efficiency and Reliability
`
`Cost-Efficient Memory Architecture Design of NAND Flash Memory Embedded Systems
`C. Park, J. Seo, D. Seo, S. Kim, and B. Kim
`
`Exploiting Microarchitectural Redundancy for Defect Tolerance
`P. Shivakumar, S. Keckler, C. Moore, and D. Burger
`
`Reducing Multimedia Decode Power Using Feedback Control
`Z. Lu, J. Lach, M. Stan, and K. Skadron
`
`426
`
`430
`
`436
`
`44]
`
`444
`
`454
`
`462
`
`467
`
`474
`
`481
`
`489
`
`HPE, Exh. 1020,p. 8
`
`HPE, Exh. 1020, p. 8
`
`

`

`
`
`Session 8.2 Novel Methodsin Logic Synthesis
`
`Structural Detection of Symmetries in Boolean Functions
`G. Wang, A. Kuehimann, and A. Sangiovanni-Vincentelli
`504
`
`498
`
`Boolean Decomposition Based on Cyclic Chains
`E. Dubrova, M. Teslenko, and J. Karlsson
`
`SAT-Based Algorithms for Logic Minimization
`S. Sapra, M. Theobald, and E. Clarke
`
`Session 9.1 Communications and Context Management
`
`Low-Density Parity-Check Decoder Architecture for High Throughput Optical Fiber Channels
`A. Selvarathinam, E. Kim, and G. Choi
`
`Improving Branch Prediction Accuracy in Embedded Processors in the Presence of Coniext Switches
`S. Pasricha and A. Veidenbaum
`
`Reducing Operand Transport Complexity of Superscalar Processors Using Distributed Register Files
`S. Bunchua, D. Wills, and L. Wills
`
`xpipes: a Latency Insensitive Parameterized Network-on-chip Architecture for Multi-Processor SoCs
`M.Dall’Osso, G. Biccari, L. Giovannini, D. Bertozzi, and L. Benini
`
`Session 9.2 Board Test and Power-Aware Test
`
`Aggressive Test Power Reduction through Test Stimuli Transformation
`O. Sinanoglu andA. Orailoglu
`
`510
`
`520
`
`526
`
`532
`
`536
`
`542
`
`Power-Time Tradeoffin Test Scheduling for SoCs
`M. Nourani and J. Chin
`
`548
`
`Multiple Transition Model and Enhanced Boundary Scan Architecture to
`
`Test Interconnects for Signal Integrity
`M.Tehranipour, N. Ahmed, and M. Nourani
`Author Index
`
`xi
`
`HPE, Exh. 1020, p. 9
`
`HPE, Exh. 1020, p. 9
`
`

`

`Cost-Efficient Memory Architecture Design of NAND Flash Memory
`Embedded Systems
`
`Chanik Park, Jaeyu Seo, DongyoungSeo, Shinhan Kim and Bumsoo Kim
`Software Center, SAMSUNGElectronics, Co., Ltd.
`{ci.park, pelican,dy76.seo, shinhank, bumsoo}@samsung.com
`
`Abstract
`
`NAND flash memory has become an indispensable
`component in embedded systems because of its versatile
`features such as non-volatility, solid-state reliability,
`low
`cos,t and high density. Even though NAND flash memory
`gains popularity as data storage, it also can be exploited as
`code memory for XIP (execute-in-place). In this paper, we
`present
`a_
`cost-efficient memory
`architecture which
`incorporates NANDflash memoryinto an existing memory
`hierarchy for code execution. The usefulness of the
`proposed approach is demonstrated with real embedded
`workloads on a real hardware prototyping board.
`
`1. Introduction
`
`A memory architecture design is a main concern to
`embedded system engineers since it dominates the cost,
`power, and performance of embedded systems. The typical
`memory architecture of embedded systems consists of
`ROM forinitial bootstrapping and code execution, RAM
`for working memory, and flash memory for permanentdata
`storage. In particular, emerging memory technology,
`the
`flash memory, is becoming an indispensable componentin
`embedded systems due to its versatile features: non-
`volatility, solid-state reliability,
`low power consumption,
`and so on. The most popular flash types are NOR and
`NAND. NORflash is particularly well suited for code
`Storage and execute-in-place (XIP)! applications, which
`require high-speed random access. While NAND flash
`provides high density and low-cost data storage, it does not
`lend itself to XIP applications due to the sequential access
`architecture and longaccesslatency.
`Table
`1
`shows different characteristics of various
`memory devices. Mobile SDRAM has strong points in
`
`
`' XIP is the execution of an application directly from the
`Flash instead of having to download the code into the
`systems’ RAM before executingit.
`
`performancebut requires high power consumption over the
`other memories. Fast SRAM or low power SRAMcan be
`selected
`according to
`the
`trade-off between power
`consumption and performance with a high cost.
`In non-
`volatile memories, NORflash provides fast random access
`speed and low power consumption, but has high cost
`compared with NAND flash. Even though NANDflash
`showslong random readlatency, it has advantages in low
`power consumption, storage capacity, and fast erase/write
`performancein contrast to NORflash.
`
`Table 1. Characteristics of various memory devices. The
`values in the table were calculated based on SAMSUNG
`2003 memory data sheets [1-2].
`
`Memory
`
`Mobile SDRAM
`Low power SRAM
`Fast SRAM
`NOR
`NAND
`
`Current (mA)
`$/Gb__idle
`active
`
`Random Access (16bit)
`read
`write
`erase
`
`05
`48
`0.005
`320
`614.5
`96
`0.03
`21.
`0.01
`
`i
`3
`65
`32
`10
`
`90ns
`55ns
`10ns
`200ns
`10.1us
`
`g0ns
`55ns
`10ns
`210.5us
`200.5us
`
`NA
`N.A
`NA
`1,2sec
`2ms
`
`Even though NAND flash memory is widely used as
`data storage in embedded systems, research on NANDflash
`memory as code storage are hardly found in industry or
`academia.
`In this paper, we present a new memoryarchitecture to
`enable NANDflash memory to provide XIP functionality.
`With XIP functionality in NAND flash,
`the cost of the
`memory system can be reduced since the NANDflash can
`be used as notonly as data storage but also as code storage
`for execution. As a result, we can obtain cost-efficient
`memory systems with reasonable performance and power
`consumption.
`The basic idea of our approach is to exploit the locality
`of code access pattern and devise a cache controller for
`repeatedly accessed codes. The prefetching cache is used to
`hide memorylatencyresulting from NAND memory access.
`In this paper we concentrate on code execution even though
`data memory is also an important aspect of memory
`architecture. There are two major contributions in this
`paper. First, we demonstrate the NAND XIPis feasible in
`
`1063-6404/03 $17.00 © 2003 IEEE
`
`474 .
`
`HPE, Exh. 1020, p. 10
`
`HPE, Exh. 1020, p. 10
`
`

`

`sal-life systems through a real hardware and commercial
`environment. Second, we apply highly optimized
`
`a hing techniques geared toward the specific features of
`
`SAND Flash.
`
`Therest of this paper is organized as follows. In the
`
`16) t section, we describe the trend of memory architecture
`
`or embedded systems. Section 3 reviews related work in
`
`sademia and industry. In Sections 4 and 5, we present our
`
`.w memory architecture based on NAND XIP. In Section
`
`we demonstrate the proposed architecture with real
`
`kloads on a hardware prototyping board and evaluate
`
`st, performance, and power consumption over existing
`
`ory architectures. Finally, our conclusions and future
`
`work are drawn in Section 7.
`
`2. Motivational Systems: Mobile Embedded
`
`
`
` (b)
`
`Figure 1. Mobile System Trend
`
`Figure | shows mobile system trend in terms of memory
`_
`
`hierarchy. Thefirst approach is to use NOR and SRAM for
`
`code storage and working memory,respectively, as shown
`
`in Figure 1(a). It is appropriate for low-end phones, which
`
`"require medium performance andcost. However, as mobile
`
`“systems evolve into data centric and multimedia-oriented
`
`applications, high performance and huge capacity for
`
`permanent storage have become necessary. The second
`
`architecture (Figure 1(b)) seems to meet the requirements in
`_ terms of storage capacity through NANDflash memory, but
`its performance is not enough to accommodate 3G
`applications which
`consist
`of
`real-time multimedia
`"applications.
`In
`addition,
`the
`increased
`number of
`components increases system cost. The third architecture
`(Figure 1(c)) eliminates NOR flash memory and uses
`- NAND flash memory for using shadowing *
`technique.
`Copying all code into RAM offers the best performance
`- possible, but contributes to the slow boot process. A large
`amount of SDRAMis necessary to hold both the OS and
`the applications. The higher power consumption from
`power hungry SDRAM memory is another problem for
`battery-operated systems.
`
`
`2 During system booting time, entire code imageis copied
`from permanentstorage into systems’ RAM for execution.
`
`
`
`As an improved solution of the third architecture in
`Figure 1(c), demand paging can be used with the assistance
`of operating system and it may reduce the size of SDRAM.
`However, this approachis not applicable to low or mid-end
`mobile system since it requires heavy virtual memory
`management code and MMU.
`Thus,it is important to investigate an efficient memory
`system in
`terms of
`cost, performance
`and
`power
`consumption.
`
`3. Related Work
`
`researchers have exploited NOR Flash
`In the past,
`memory as caches for magnetic disks due to its low power
`consumption and high-speed characteristics. eNvy focused
`on developing a persistent storage architecture without
`magnetic disks [7]. Fred et al showedthat flash memory can
`reduce energy consumption by an order of magnitude,
`compared to magnetic disk, while providing good read
`performance and acceptable write performance [9]. B.
`Marshet al examined the impactof using flash memory as a
`second-level
`file system buffer cache to reduce power
`consumption andfile access latency on a mobile computer
`[8].
`Li-Pin et al investigated the performance issue of NAND
`flash memory
`storage
`subsystems with
`a
`striping
`architecture, which uses I/O parallelism [10]. In industry
`[5], NAND XIP is implemented using small size of buffer
`and V/O interface conversion, but the XIP areais limited to
`boot code, thus OSand application codes should be copied
`to system memory.
`In summary, even though several researches have been
`made to obtain the maximum performance and low power
`consumption from data storage, few efforts to support XIP
`in NANDflash are found in academiaor industry.
`
`4. NANDXIP Architecture
`
`In this section, we describe NAND XIP architecture.
`First, we look into the structure of NAND flash and
`illustrate basic implementation of NAND XIP based on
`caching mechanism.
`
`4.1. Background
`A NANDflash memory consists of a fixed number of
`blocks, where each block has 32 pages and each page
`consists of 512 bytes main data and 16 bytes spare data as
`shownin Figure 2. Spare data can be used to store auxiliary
`information such as bad block identification and error
`correction code (ECC) for associated main data. NAND
`flash memories are subject
`to a condition called “bad
`block”,
`in which a block cannot be completely erased or
`
`i
`a
`4
`
`cannot be written due to partial or 2-bit errors. Bad blocks
`
`
`HPE, Exh. 1020, p. 11
`
`475
`
`HPE, Exh. 1020, p. 11
`
`

`

`the I/O interface of
`conversion is necessary to connect
`NANDflash to memory bus. For cache mechanism,direct
`map cache with victim cache is adopted based on Jouppi’s
`work in [4] with optimization for NANDflash. In [4],
`the
`victim cache is accessed on a main cache miss;
`if the
`address hits the victim cache, the data returned to the CPpy
`and at the same timeit is promoted to the main cache;the
`replaced block in the main cache is moved to the victim
`cache, therefore performing a “swap”. If the victim cache
`also misses, NANDflash accessis performed; the incoming
`data fills the main cache, and the replaced block will be
`moved to the victim cache. In next section, we modify the
`above “swap” algorithm using system memory and Page
`address translation table (PAT). The prefetching cacheis
`used to hide memory latencyresulting from NAND memory
`access. Several hardware prefetching techniques can be
`found in literature [12]. In our case, prefetching information
`is analyzed through profiling process and the prefetching
`information is stored in spare data at code image building
`time.
`
`5. Intelligent Caching: Priority-based Caching
`
`for
`suitable
`implementation is
`Though the basic
`application code which shows its spatial and temporal
`localities, it may be less effective in systems code which has
`a complex functionality, a large size, and interrupt-driven
`control transfers amongits procedures [13]. Torrellaset al.
`presented that the operating system has the characteristics
`that
`large sections of its code are rarely accessed and
`suffers considerable interference within popular execution
`paths [13]. For example, periodic timer interrupt, rarely-
`executed special-case code, and plenty of loop-less code
`disrupt
`the
`localities. On the other hand,
`real-time
`applications should beretained as long as possible to satisfy
`the timing constraints*. In this paper, we distinguish the
`different cache behavior between system and application
`codes, and adaptit to the page-based NANDarchitecture.
`We apply profile-guided static analysis of code access
`pattern.
`three categories
`into.
`We can divide code pages
`depending ontheir access cost: high priority, mid priority
`and low priority pages. Even though the priority can be
`determined by various objectives, we set
`the priority to
`pages based on the numberofreferences to pages and their
`criticality. For example,
`if a specific page is referenced
`more frequently orhastimecritical codes,it is classified as
`a high-priority page and should be cached orretained in
`cache to reduce the later access cost in case that the page is
`in NAND flash memory. OS-related code, system libraries
`and real-time applications have high-priority pages. On the
`
`* In this paper,real-time applications indicate multimedia
`applications with soft real-time constraints.
`
`may exist in NAND flash memory when shipped or may
`occur during operation.
`
`512 bytes
`
`16 bytes 1 Block
`
`=32 pages
`
`VO bus
`
`Figure 2. Structure of NANDflash memory
`In order to implement
`the NAND XIP, we should
`considerthe following points.
`
`@ Average memory accesstime
`@ Worst case handling
`@
`Bad block management
`
`The performance of memory system is measured by
`average access
`time [3].
`In order
`to implement XIP
`functionality,
`the average access time of NAND flash
`should be comparable to that of other memories such as
`NOR, SRAM and SDRAM. Though average memory
`access time is a good metric for performance evaluation,
`worst-case handling, or cache miss handling is another
`problem in practical view since most mobile systems such
`as cellular phones include time-critical
`interrupt handling
`such as call processing. For instance,
`if the time-critical
`interrupt occurs during cache miss handling,
`the system
`may notsatisfy given deadline and to makeit worse,it may
`lose data or connection. The third aspect to be considered
`in NANDXIPis to managebad blocks, which are inherent
`in NAND flash memory because bad blocks cause
`discontinuous memory space, whichis intolerable for code
`execution.
`
`4.2. Basic Implementation
`
`
`
`nm):
`BI:
`DATA [0:7] gs?
`om
`:
`:
`ADDR[0:12] sap:
`wid
`CE#,OE#,WE#amp?
`2 a
`
`
` :
`
`BUSY# que :
`
`' Figure 3. NAND XIP controller
`The proposed architecture consists of a small amount of
`SRAM for cache,
`interface conversion logic,
`the control
`logic and NAND Flash as shown in Figure 3. Interface
`
`476
`
`HPE, Exh. 1020, p. 12
`
`HPE, Exh. 1020, p. 12
`
`

`

`
`ther hand, mid-priority page is defined to be normal
`
`pplication code which is handled by normal caching
`
`This section presents our experiment environment. Our
`policy. Finally, low-priority page corresponds to sequential
`
`environment consists of a prototyping board, our in-house
`code such as initialization code, which is rarely executed.
`
`cache simulator with pattern analysis and a real workload,
`PATis introduced to remap pagesin bad blocks to pages in
`
`namely PocketPC [6] as shown in Figure 6. The prototyping
`good blocks and to remap requested pages to swapped
`
`board is composed of: main board and daughter board (a
`ges
`in
`system memory. We
`illustrate
`the caching
`
`yellow rectangle in Figure 6). The main board has ARM9-
`echanism in detail in Figure 4. First, when page A with
`
`based micro-controller, SDRAM, NORflash and so on. The
`igh-priority is requested,it is cached from NANDflash to
`
`daughter board contains an FPGA for cache controller and
`main cache. Next, when page B is requested from the CPU,
`
`victim cache, fast SRAM for tag and cache memory, and
`‘it should be moved to main cache or system memory. Here
`
`two NANDflash memories. The daughter boardis used not
`assumingthat pageBis in conflict with page A, page B is
`
`only to implement a real cache configuration on FPGA but
`‘moved to system memory (SRAM/SDRAM)since page B
`
`also to gather memory address
`traces
`from running
`is low priority page (“L” in spare area of NAND flash
`
`applications. In Figure 7, one NANDis dedicated to NAND
`‘memory means low-priority). At
`the same time, PATis
`
`XIP and the other NAND is dedicated for collecting
`“updated so thatlater access to page B is referred to system
`
`memory traces from host bus. Trace collection function is
`memory. Again, when page C is requested and in conflict
`
`started and stopped by using manual switches and manual
`with page A, page C replaces page A and page A is
`
`switch’s on/off interval determines the time period for trace
`discarded from main cache since C’s priority is high. The
`
`gathering. Collected address traces are stored for cache
`evicted page A is moved to victim cache. In summary, on
`
`simulator.
`NANDflash’s page demand,
`the controller discards or
`
`Thespecification of main processor and NANDflash is
`‘swaps existing cache page according to the priority
`
`shown in Table 2. The cache simulator explores various
`information stored in spare area data. The detail algorithm
`parameters
`such
`as miss ate,
`replacement
`policy,
`assoicativity, and cache size based on memory traces from
`the prototyping board. The real embedded workload,
`PocketPC supports XIP-enabled image based on the
`existing ROMFSfile systems in which each application can
`be directly executed without being loadedinto RAM.
`
`6. Experimental Setup
`
`
`
`Table 1: The specification of the prototyping board
`Parameter
`Configuration
`CPU clock
`200 MHz
`L1 Icache
`64way, 32byte line, 8KB
`Bus width
`16bit
`NAND readinitial latency
`10us
`NAND serial read time
`50ns
`SRAMread time
`10ns
`SDRAMread time
`90ns
`NORread time
`200ns
`
`Re
`
`poe Figure 6. A prototyping board for NAND XIP
`
`aed
`
`Bins
`
`;
`
`a ed
`:
`:
`ae ica
`‘
`Ts
`
`A477
`
`HPE, Exh. 1020, p. 13
`
`
`
`
`prefetching technique improves memory latency hiding
`om miss-predictionat run-time.
`Data bus
`Address bus
`
`
`
`NAND|page] > cachejpage% CACHE_SIZE]:
`
`Priority Caching (address)
`page = convert(addresshi
`if (isInPAT(page))
`main memoryhit;
`else iffisInM ainCache(page))
`Cache hit;
`else iffisInVictim Cache(page))
`Victim hit:
`else ( miss. fetch a page from NAND flash memory
`page_priority = NAND[page].priority:
`if (cache[page% CACHE_SIZE].priority ==HIGH)
`NAND[page] > main memory:
`else if (cache[page% CACHE_SIZE]priority == MID)
`{
`cache[page% CACHE_SIZE] > victim:
`NAND[page] > cachelpage% CACHE_SIZE]:
`lse
`
`/e
`
`Figure 5. Intelligent Caching Architecture
`
`HPE, Exh. 1020, p. 13
`
`

`

`
`
`
`
`32kB WM 64k8 O 126ke D 256K8 51248|
`
`Bus 328 Se oe
`
`NANDflash
`for trace
`collection
`
`NANDflash
`for XIP
`
`Host
`
`os
`Figure 7. FPGA Prototyping for NAND XI
`6.1. Experimental Results
`
`In Figure 8, we compare the miss ratio over various
`configuration parameters such as associativity, replacement
`policy, and cache size. We collected address traces from
`PocketPC while we were executing various applications
`such as “Media Player” and “MP3 player” since they are
`popular embedded multimedia applications which involve
`real-time requirements. Note that the cache size is the most
`importantfactorto affect miss ratio as shown in Figure 8.
`32kB8 @64kB O 128KB O256KB W512kB |
`
`
`
`(ns*mW
`Energy
`
`64
`
`128
`
`256
`
`
`
`Cache Line Size (Bytes)
`
`(b)
`Figure 9. Cache line size versus (a) access time per 32-byte
`and (b) energy consumption
`To analyze the optimal cache line size in NAND XIP
`cache, simulation has been done with the memory traces
`which are gathered from the prototyping board. The line
`size of 256-byte shows better numbersin average memory
`access time and in energy consumption overall other cache
`sizes as shown in Figure 9. Therefore,
`the line size of
`NAND XIP controller
`is determined to be 256-byte
`hereafter.
`
`[@32KB @ 64KB O 128k8 O 256KB M512KB
`
`
`
`
`ee]
`
`Average
`Memory A008
`Access
`300)
`veh
`2007
`
`100Kft
`Yee)
`0
`SDRAM
`NAND XP
`shadow ng
`(bast)
`ArchitecturalChoices
`
`(a)
` Ir
`G@32xe Meaka Oizsxs O2s6xe M512KB
`
`Energy (ns*mW )
`
`NAND XP
`SDRAM
`(proriy)
`shadow hg
`ArchitecturalChotes
`
`
`NANO
`XP (bast)
`
`NOR XP
`
`(b)
`Figure 10. Overall performance comparison of different
`memory architectures: (a) aver

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket