`OF HARDWARE FROM C
`
`A DISSERTATION
`
`SUBMITTED TO THE DEPARTMENT OF
`
`ELECTRICAL ENGINEERING
`
`OF STANFORD UNIVERSITY
`
`IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
`
`FOR THE DEGREE OF
`
`DOCTOR OF PHILOSOPHY
`
`Luc Séméria
`
`June 2001
`
`XILINX, EX. 1024
`Page 1 of 146
`
`
`
`
`
`
`
` Copyright by Luc Séméria 2001
`All Rights Reserved
`
`XILINX, EX. 1024
`Page 2 of 146
`
`
`
`
`
`
`
`ii
`
`I certify that I have read this dissertation and that, in my opinion, it
`is fully adequate in scope and quality as a dissertation for the degree
`of Doctor of Philosophy.
`
`______________________________________
`Giovanni De Micheli, Principal Advisor
`
`I certify that I have read this dissertation and that, in my opinion, it
`is fully adequate in scope and quality as a dissertation for the degree
`of Doctor of Philosophy.
`
`______________________________________
`Monica Lam
`
`I certify that I have read this dissertation and that, in my opinion, it
`is fully adequate in scope and quality as a dissertation for the degree
`of Doctor of Philosophy.
`
`______________________________________
`Michael Flynn
`
`Approved for the University Committee on Graduate Studies:
`
`______________________________________
`
`XILINX, EX. 1024
`Page 3 of 146
`
`
`
`
`
`
`
`
`
`iii
`
`ABSTRACT
`
`With the advances in Computer Aided Design (CAD) technology, the design of digital
`
`circuits is becoming more and more as the development of software. Hardware is modeled
`
`using Hardware Description Languages (HDLs) very much as software is described using
`
`programming languages. Synthesis tools are used like compilers to map these HDL mod-
`
`els into hardware. However, modern systems, which consist of mixed software/hardware
`
`modules, are often initially modeled using programming languages instead of HDLs. Spe-
`
`cifically, C/C++-based languages with hardware support are used to quickly verify the
`
`functionality and estimate the performances at the system level.
`
`One of the greatest challenges in C/C++-based design methodology is to efficiently
`
`map C/C++ models into hardware. Many of the networking and multimedia applications
`
`implemented in hardware or mixed hardware/software systems are making use of complex
`
`data structures stored in one or multiple memories. As a result, many of the C/C++ fea-
`
`XILINX, EX. 1024
`Page 4 of 146
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`iv
`tures that were originally designed for software applications are now making their way
`
`into hardware. Such features include dynamic memory allocation and pointers used to
`
`manage data.
`
`In this thesis, I present a solution for efficiently mapping arbitrary C code with point-
`
`ers and
`malloc
`
`/
`
`free
`
` into hardware. In hardware, a pointer is not only the address of data
`
`in memory, but it may also reference data mapped to registers, ports or wires. Pointer anal-
`
`ysis is used to find the set of locations each pointer may reference in a program at compile
`
`time. The values of the pointers are then encoded, and branching statements are used to
`
`dynamically access data referenced by pointers. Dynamic memory allocation and deallo-
`
`cation are supported by instantiating hardware memory allocators tailored to an applica-
`
`tion and a memory architecture. Several optimizations may also be performed. A heuristic
`
`algorithm is presented to efficiently encode the values of the pointers. Compiler tech-
`
`niques may also be used to reduce storage before loads and stores.
`
`An implementation using the
`SUIF
`
` compiler framework is presented, followed by some
`
`examples of implementations taken from multimedia and networking applications.
`
`XILINX, EX. 1024
`Page 5 of 146
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`v
`
`ACKNOWLEDGMENTS
`
` I would like thank my advisor Giovanni De Micheli for giving me the opportunity of
`
`working on this thesis, for his support and his directions. This work couldn’t have been
`
`possible without the advises of several other people. First, I would like to acknowledge the
`
`support of Synopsys Inc. that initiated this research and financed it. Specifically, I would
`
`like to thank Raul Camposano, Abhijit Ghosh, Joachim Kunkel and Don Mc Millan for
`
`their support. I would also like to thank many of my friends met at Synopsys whose com-
`
`ments helped me at the different stages of this research: Olivier Coudert, Maurizio Dami-
`
`ani, Stan Liao, Preeti Panda, Srinivas Raghvendra, Mike Sharp. This research couldn’t
`
`have been possible without the help of the
`
`SUIF
`
` group at Stanford and more specifically
`
`Monica Lam and Dave Heine. Finally, thank you to Koichi Sato from
`
`NEC
`
` corp., who was
`
`a visiting researcher at Stanford in 1999 and worked with me on the synthesis of dynamic
`
`memory allocation and deallocation.
`
`XILINX, EX. 1024
`Page 6 of 146
`
`
`
`
`
`
`
`vi
`A thesis is more than its technical contributions, it is also a personal achievement. For
`
`this I would warmly thank my family and friends. First my wife Cécile who joined me in
`
`this adventure. Thank you for your loving trust and your support. I would also like to
`
`thank my family here, Bernard and Joan, and in France, my parents, my brother and sisters
`
`whose encouragements throughout my studies have always been the greatest help.
`
`This is also a good opportunity to thank some of the people who really made a differ-
`
`ence in my student life. Jean-Marie Fluchaire, high-school mathematics professor at the
`
`Lycée du Grésivaudan Meylan, and Henri Koen, professor of mathematics in Classes Pré-
`
`paratoires at Lycée Masséna Nice.
`
`XILINX, EX. 1024
`Page 7 of 146
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`1.
`
`2.
`
`3.
`
`4.
`
`5.
`
`6.
`
`vii
`
`TABLE OF CONTENTS
`
`Introduction
`1.1 Motivations
`1.2
`Design methodology
`1.3
`Objectives and Contributions
`Related Work
`2.1 Modeling and synthesizing hardware from C/C++
`2.2
`Software compilation of C and C++ onto parallel architectures
`2.3
`Application-specific memory management methodology
`Background
`3.1 Memory space representation
`3.2
`Pointer analysis
`3.3
`Definition of the subset
`Synthesis of Hardware From C
`4.1 Memory Refinement
`4.2
`Pointer Synthesis
`4.3
`Synthesis of
`malloc
`4.4
`Summary
`Implementation
`5.1
`Toolflow
`5.2
`Results
`Optimization of Loads and Stores
`6.1
`Optimization of loads
`
` and
`
`free
`
`1
`2
`4
`7
`10
`11
`18
`19
`23
`27
`30
`32
`35
`36
`40
`48
`51
`53
`53
`57
`62
`63
`
`XILINX, EX. 1024
`Page 8 of 146
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`viii
`
`Optimization of stores
`6.2
`Results
`6.3
`Summary
`6.4
`Encoding of Pointers
`7.1
`Definition of the problem
`7.2
`Problem formulation
`7.3
`Simplified problem
`7.4
`Encoding with folding
`7.5
`Encoding with splitting
`7.6
`Encoding algorithm
`7.7
`Results
`7.8
`Summary
`Library of Allocators and Optimization of
`malloc
`8.1
`Optimized general purpose allocator
`8.2
`Specific-purpose allocator
` and
`8.3
`Optimizing sequences of
`free
`malloc
`8.4
`Experimental results and discussion
`8.5
`Summary
`Conclusion
`
`7.
`
`8.
`
`9.
`
` and
`
`free
`
` calls
`
`66
`70
`71
`72
`73
`76
`79
`83
`86
`90
`96
`98
`99
`100
`102
`102
`104
`107
`108
`
`XILINX, EX. 1024
`Page 9 of 146
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`ix
`
`LIST OF TABLES
`
`3.1
`
`5.1
`
`6.1
`
`6.2
`
`7.1
`
`8.1
`
`8.2
`
`
`
`Location set examples (f
`
`
`
`=offset of field F), (s
`
`=stride or array element size) 28
`
`Result of the synthesis using target library tsmc.35u (area in library units). 61
`
`Area after synthesis and optimization using target library lsi_10k (in library
`units).
`70
`
`Timing after synthesis and optimization using target library lsi_10k (in ns). 70
`
`Area after synthesis and optimization using tsmc.35 library (in library units).
`For each example, P represents the number of pointers and N the number of
`variables.
`97
`
`Implementation of the different allocators (area in library units using the
`. and
`. represent respectively the area of
`tsmc.35 target library;
`comb
`non-comb
`combinational logic and non-combinational logic (i.e. registers, etc.) at
`100MHz)
`
`105
`
`Results for the different examples and optimizations (size in library units using
`the tsmc.35 target library; frequency 100MHz for test1, test2 and ATM,
`50MHz for Video Filter; CPU time for synthesis measured on Sun Ultra2 does
`not include high-level synthesis)
`106
`
`XILINX, EX. 1024
`Page 10 of 146
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`x
`
`LIST OF ILLUSTRATIONS
`
`1.1
`
`2.1
`
`3.1
`
`4.1
`
`4.2
`
`4.3
`
`4.4
`
`4.5
`
`5.1
`
`5.2
`
`5.3
`
`6.1
`
`Top-down design flow of a system
`
`5
`
`Interface of the allocator block implementing
` and
` functions. 21
`malloc
`free
`
`Representation of
`; the offset and stride
`struct{int a; int b;} r[]
`correspond to the locations
` where
` is integer.
`28
`r[i].b
`i
`
`Memory refinement from a continuous memory space to a set of memories,
`registers and wires
`36
`
`Memory layout of
`struct {char c1; char c2;
`.
`short s; int i;} csi
`
`Encoding of pointers in an array.
`
`38
`
`41
`
`Implementation of
`, where
` may point to
` or
` and
` may point to
`*q=*p+1
`p
`a
`b
`q
` or an element of
`.
`45
`c
`table[]
`
`Architecture for multiple memories and allocators.
`
`Position of SpC between high-level synthesis and architectural mapping.
`
`SystemC and traditional C-to-HDL methodology using SpC.
`
`Architecture of the 2D IDCT.
`
`Optimization of a load.
`
`51
`
`54
`
`56
`
`60
`
`64
`
`XILINX, EX. 1024
`Page 11 of 146
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`xi
`Example of code segment before and after optimizing load.
`
`CDFG for
` with
`*p=*p+1
`
`.
`p->{a,b}
`
`65
`
`68
`
`(a) Example of pointer-dependence graph and (b) definitions of the points-to
`set of each pointer.
`74
`
`Example of (a) non-optimal and (b) optimal encodings; codes that are changed
`in the optimal encoding are shown in bold.
`76
`
`Global encoding and selection of the relevant bits for each pointer.
`
`Example of (a) relation matrix and (b) corresponding affinity graph.
`
`Example of optimal encoding.
`
`Pointer-dependence graph and definitions of the points-to sets.
`
`Relation matrix and corresponding affinity graph before folding.
`
`79
`
`82
`
`83
`
`84
`
`85
`
`Relation matrix and corresponding affinity graph after folding
` and
`a
`
`e.
`
`85
`
`Result of the encoding after folding
` and
`a
`
`.
`e
`
`Pointer-dependence graph and definitions of the points-to sets.
`
`Relation matrix and corresponding affinity graph before splitting.
`
`Result of the encoding without splitting.
`
`86
`
`87
`
`87
`
`87
`
`Relation matrix and corresponding affinity graph after splitting symbol
`. 88
`a
`
`Result of the encoding after splitting symbol
`.
`a
`
`Graph embedding algorithm with splitting and folding.
`
`Pointer-dependence graph and definitions of the points-to sets.
`
`88
`
`91
`
`93
`
`(a) Relation matrix and (b) affinity graph at the beginning of iteration 1; (c) af-
`finity graph at the beginning of iteration 2.
`93
`
`
`
`Result of the encoding after splitting and folding (where ‘-’ is don’t care). 96
`
`
`
`Architecture of an optimized general-purpose allocator.
`
`101
`
`Encoding of pointers in an array for optimized general-purpose allocator. 101
`
`6.2
`
`6.3
`
`7.1
`
`7.2
`
`7.3
`
`7.4
`
`7.5
`
`7.6
`
`7.7
`
`7.8
`
`7.9
`
`7.10
`
`7.11
`
`7.12
`
`7.13
`
`7.14
`
`7.15
`
`7.16
`
`7.17
`
`7.18
`
`8.1
`
`8.2
`
`XILINX, EX. 1024
`Page 12 of 146
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`CHAPTER 1. Introduction
`
`Definition 1.0.
`
`Example 1.0.
`
`Proposition 1.0.
`
`Figure 1.0
`
`Table 1.0:
`
`Page 1
`
`(1.0)
`
`CHAPTER 1. INTRODUCTION
`
`In the last decades, the complexity of digital systems implemented on silicon has
`
`grown at the rate of 2x every 16 months according to Moore’s law. Within 5 years, inte-
`
`grated circuits (IC) will include as many as 190 million transistors [82]. However, the pro-
`
`ductivity of designers does not improve at the same rate. Studies have shown that
`
`designers or programmers write in average 10 lines of code per day. In order to bridge this
`
`productivity gap, Computer Aided Design (CAD) tools and methodologies have been
`
`developed to facilitate the task of designing systems on silicon. CAD tools enable more
`
`efficient system modeling by abstracting some of the implementation details that can be
`
`automatically derived from the specification.
`
`The most dramatic achievements are for the design of digital circuits. Since the seven-
`
`ties, the placement and the connections of standard logic cells from a netlist onto silicon
`
`have been automated in
`
`
`
`physical design (i.e. Place & Route)
`
`
`
`tools. The eighties have seen
`
`the development of
`
`logic synthesis
`
` and the release of tools that automatically generate a
`
`netlist from a cycle-accurate behavioral model expressed using
`
`Hardware Description
`
`
`
`Languages (HDLs). Examples of HDLs include Verilog HDL [44] and VHDL [48]. In the
`
`XILINX, EX. 1024
`Page 13 of 146
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`CHAPTER 1. Introduction
`last decade
`high-level synthesis
`
` (HLS)
`
`tools
`
` (aka
`
`behavioral
`
` or
`
`architectural
`
`
`
`Page 2
`synthesis
`
`tools
`
`) have been developed to automate the mapping of sequential behavioral descriptions
`
`into cycle-accurate representations. Such tools perform both scheduling of operations
`
`(arithmetic and logic operations as well as a memory and register accesses) and resource
`
`binding [14].
`
`1.1 Motivations
`
`Different languages have been used as input to high-level synthesis. HDLs are the
`
`most commonly used. However, designers often write system-level models using pro-
`
`gramming languages, such as C or C++, to estimate the system performances and verify
`
`the functional correctness of the design. C and C++ offer many advantages. The first moti-
`
`vation for using programming languages, as opposed to HDLs at the system level, is the
`
`growing amount of software running on any system. Many of today’s chips incorporate
`
`processor cores running instruction codes compiled from programming languages. More-
`
`over, among programming languages, C and C++ have been the most widely used in the
`
`last two decades. As a result there is a vast amount of legacy code and libraries that can be
`
`reused to quickly model systems.
`
`Even for hardware applications, C and C++ have been often used to accelerate the
`
`design process. C/C++ can be efficiently compiled onto today’s architectures and thus are
`
`used to develop fast simulation models (e.g. to model microprocessors and microcontrol-
`
`ers). Besides, describing both the software and hardware in the same language facilitates
`
`the integration and the verification of hardware components within a software environ-
`
`XILINX, EX. 1024
`Page 14 of 146
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`CHAPTER 1. Introduction
`Page 3
`ment. Recent initiatives [71,79,86,89] attempt to standardize a C/C++-based language for
`
`both hardware and software design.
`
`As a result, today, C/C++ or C/C++-based languages can be used to model systems for
`
`both software and hardware components. However, in order to physically map functional-
`
`ity onto hardware, one usually needed to manually translate C/C++ code into HDLs. This
`
`task is well-known for being both time consuming and error prone. The synthesis of hard-
`
`ware directly from C/C++ would then accelerate the design process. A coding style and a
`
`set of restrictions on the language are often defined to simplify the mapping of such mod-
`
`els onto hardware.
`
` Some of the components originally implemented in software may be also be mapped
`
`to specific hardware components for better performances (increased throughput or reduced
`
`latency), power savings, and/or smaller silicon area. To facilitate such software/hardware
`
`migration, synthesis tools would need to support all C/C++ language constructs.
`
`In order to help designers refine their code from a simulation model to a synthesizable
`
`behavioral description, we are trying to efficiently synthesize the full A
`
`NSI
`
` C standard
`
`[32,48]. C was originally designed to develop the UNIX operating system. It provides con-
`
`structs to directly access memory (through pointers) and to manage memories and I/O
`
`using the standard C library (
`,
`,...). These constructs are widely used in soft-
`malloc
`free
`
`ware. Nevertheless, many of the networking and multimedia applications implemented in
`
`hardware or mixed hardware/software systems are also using complex data structures
`
`stored in one or multiple memory banks. As a result, many of the C/C++ features that were
`
`originally designed for software applications are now making their way into hardware.
`
`XILINX, EX. 1024
`Page 15 of 146
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`CHAPTER 1. Introduction
`Page 4
`The synthesis of such constructs as dynamic memory allocation, function calls, recursions,
`
`’s, type castings and pointers, turns out to be particularly challenging.
`goto
`
`1.2 Design methodology
`
`The general goal of this research is to automate the generation of digital circuits from
`
`an algorithmic or behavioral description written in C. The realization of this research is a
`
`prototyping tool which fits today’s design methodology by taking a C function as an input
`
`and generating hardware automatically. To better understand where this research fits, it is
`
`important at this point to review the different steps involved in the design of a system. In
`
`the top-down approach shown on Figure 1.1, the system is first modeled at the system
`
`level using programming languages or more application-specific descriptions (such as
`
`Matlab, Mathematica, etc.). The models used at this level are designed to verify the func-
`
`tionality and estimate the performances of the system. Several optimizations are usually
`
`performed. They do not depend on whether the final implementation is hardware or soft-
`
`ware. Algorithmic optimizations are performed, for example to reduce the number of oper-
`
`ations in a given computation. In addition, for a given algorithm, data formats are refined,
`
`for example from floating point to fixed point. Data structures are also refined. In particu-
`
`lar, different implementations of queues or of hash tables could be evaluated.
`
`At the architectural level, the system usually consists of a set of communicating pro-
`
`cesses or threads. These processes can either be mapped to hardware or software. The
`
`communication between the different processes is refined at this point. On the software
`
`side, drivers are generated. On the hardware side, controllers are synthesized to implement
`
`communication protocols. Internal and shared storages are defined for each process.
`
`XILINX, EX. 1024
`Page 16 of 146
`
`
`
`
`
`
`
`
`
`CHAPTER 1. Introduction
`
`Page 5
`
`System-level
`Optimizations
`C/C++
`
`C/C++
`Architecture
`Mapping
`
`C/C++
`
`Compiler
`
`Assembly Code
`Software
`
`System Level
`
`Architectural Level
`
`Register-Transfer Level
`
`Netlist Level
`
`Layout Level
`
`C/C++
`
`HDL
`High-level
`Synthesis
`HDL
`
`HDL
`Logic
`Synthesis
`
`HDL
`Physical
`Design
`
`Polygone Representation
`Transistors
`
`Figure 1.1: top-down design flow of a system
`
`Shared storage is used for inter-process communication whereas internal storage stores
`
`local variables and arrays. After architectural mapping and interfaces definitions, the sys-
`
`tem consists of a set of processes to be mapped to software or hardware. On the software
`
`side, compilers can be used to generate assembly code to be executed on a processor. On
`
`the hardware side, high-level synthesis may be used to generate a cycle-accurate represen-
`
`tation of the design.
`
`XILINX, EX. 1024
`Page 17 of 146
`
`
`
`
`
`
`
`
`
`CHAPTER 1. Introduction
`Page 6
`High-level synthesis takes a sequential description of an algorithm and automatically
`
`implements it as data path and control logic. The data path consists of operators, storage,
`
`and steering logic. Operators, which can be shared, usually implement algorithmic or logic
`
`operations (e.g. add, subtract, multiply, etc.). Storage is usually implemented as a set of
`
`registers or memories. Steering logic represents the connection network between storage
`
`and operators. In addition, the control logic implemented as a state-machine or as micro-
`
`code controls the internal flow of data (i.e. multiplexer select signals, register update sig-
`
`nals, tri-state buffer set signals, etc.). The tasks performed during high-level synthesis
`
`consist of scheduling and resource binding according to latency, throughput and resource
`
`constraints. Scheduling and binding correspond to the tasks of mapping operations to time
`
`slots and operators respectively. The cycle-accurate design generated after high-level syn-
`
`thesis can then be mapped to hardware using a traditional CAD flow consisting of logic
`
`synthesis and physical design.
`
`The research presented here fits as a front end to high-level synthesis tools. It can be
`
`also seen as a back end to system-level tools performing hardware/software partitioning
`
`and interface synthesis. The objective is to take C code instead of HDL as an input to high-
`
`level synthesis. From a system-level perspective, the goal is to automate the transition
`
`from a architectural description written in C to an input understandable by current high-
`
`level synthesis tools.
`
`More specifically, this research in on automating the synthesis of hardware from a C
`
`code with such constructs as pointers (including pointer arithmetic and type casting), com-
`
`plex data structures and dynamic memory allocations. These language features are not part
`
`of any synthesizable HDL subsets. To make this research possible, the code is assumed
`
`XILINX, EX. 1024
`Page 18 of 146
`
`
`
`
`
`
`
`
`
`
`
`
`CHAPTER 1. Introduction
`Page 7
`correct and fully visible (complete specification) at compile time. Existing developments
`
`both for high-level synthesis and compilers are also leveraged
`
`1.3 Objectives and Contributions
`
`The first goal for this research is to be applicable to a large subset of the C language.
`
`As a result, the front-end and initial analysis passes used can deal with any kind of C input
`
`without any assumption of a coding style or of a synthesizable subset. Better quality of
`
`results may be achieved by restricting the input to a limited subset of C or a given coding
`
`style. However, this would defeat our original goal of supporting the full ANSI C. To sup-
`
`port all C constructs, my implementation shares the same general-purpose front end and
`
`pointer analysis passes as advanced optimizing C compilers. The contribution here is on
`
`applying such advanced compiler techniques to the synthesis of hardware from C.
`
`The overall intention of this research is to support full synthesis of C. On the other
`
`hand, to make this research possible, existing tools are leveraged when it is possible. As a
`
`back end, commercial tools are used to perform high-level synthesis. The flip side of using
`
`commercial tools is that they cannot be modified and their internal data structures cannot
`
`be accessed in a research environment. In this research, HLS tools are used as black boxes.
`
`This leads to some limitations both on the architecture generated and the synthesizable
`
`subset. The main limitation is on function calls. First, the implementation of such feature
`
`as recursion in general remains a fundamental problem. Tail recursion or limited recursion
`
`could be implemented. Without recursion, one way of synthesizing functions is to inline
`
`them. However designers often want to map functions to separate components. Today’s
`
`commercial synthesis tools have many restrictions on functions mapped to components.
`
`XILINX, EX. 1024
`Page 19 of 146
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`CHAPTER 1. Introduction
`Page 8
`Moreover, the synthesis of C functions mapped to components would require some inter-
`
`action with the HLS tool. For example sharing information would be useful to efficiently
`
`synthesize functions with parameters passed by reference. As a result, the synthesis of
`
`functions is beyond the scope of this thesis and is left as an Appendix.
`
`The contributions of this thesis are the following.
`
`1) This thesis defines the necessary steps to automate the mapping of ANSI C code
`
`into hardware. This work represents the first attempt to efficiently synthesize such
`
`constructs as complex data structures, pointers, pointer arithmetic and type casting.
`
`The resulting tool bridges the gap between high-level synthesis and system-level
`
`tools.
`
`2) A methodology for efficiently supporting dynamic memory allocation and deallo-
`
`cation is also presented. It fits into the general methodology described in Section
`
`1.2 and enables the implementation of recursive data structures into hardware. This
`
`thesis also presents how C code with
`
` and
`malloc
`
` may be automatically
`free
`
`mapped to hardware after system-level optimizations (e.g. refining data structures)
`
`and architectural mapping (defining internal and shared storage).
`
`3) This thesis also describes a set of techniques to optimize the resulting hardware.
`
`- Compiler techniques are used to reduced storage before loads and stores. The
`
`motivation for these optimizations is to reduce the number of registers neces-
`
`sary in a design.
`
`- Encoding algorithm is developed to reduce both the bit-width of pointers and
`
`the size of the circuits translating pointers’ values in assignments and compari-
`
`sons. Assignments of pointers are especially common at function calls.
`
`XILINX, EX. 1024
`Page 20 of 146
`
`
`
`
`
`
`
`
`
`CHAPTER 1. Introduction
`Page 9
`- Novel architecture of hardware memory allocator/deallocator is also intro-
`
`duced. The idea is to encode information about the memory block allocated
`
`inside of the pointers’ value to speed up deallocation.
`
`The outline of the rest of this thesis is the following. In Chapter 2, related work on the
`
`different ways of mapping C code onto different target architectures is presented. Chapter
`
`3 is a background chapter in which compiler techniques, namely pointer analysis, and
`
`their underlaying memory representations are reviewed in the light of hardware synthesis.
`
`The different steps involved in the synthesis of hardware from C code are then presented
`
`in Chapter 4. The outcome of this research, presented in Chapter 5, is the tool SpC that
`
`automates the translation of C models with pointers, complex data structures, etc. into a
`
`code synthesizable by commercially available HLS tools. Results are presented for differ-
`
`ent application domains. Different optimizations can be applied for efficiently mapping C
`
`code onto hardware. In Chapter 6, a set of compiler optimizations to reduce storage before
`
`loads and stores are presented. Chapter 7 describes an encoding algorithm that reduces
`
`both the bit-width of the pointer variables and the complexity of the circuits implementing
`
`comparisons and assignments of pointers. It is important to note at this point that the opti-
`
`mizations presented in Chapters 6 and 7 are also motivated by the synthesis of functions in
`
`C as shown in Appendix A. Finally, Chapter 8 presents how memory allocation and deal-
`
`location can be optimized by selecting different allocator architectures and by applying
`
`compiler optimization techniques. Results are also presented at the end of Chapters 6, 7
`
`and 8 to highlight the benefits of each optimization.
`
`XILINX, EX. 1024
`Page 21 of 146
`
`
`
`
`
`
`
`
`
`
`
`CHAPTER 2. Related Work
`
`Definition 2.0.
`
`Example 2.0.
`
`Proposition 2.0.
`
`Page 10
`
`(2.0)
`
`Figure 2.0
`
`Table 2.0:
`
`CHAPTER 2. RELATED WORK
`
`C/C++ are two of the most common programming languages. C and C++ are both pro-
`
`cedural imperative languages. Their semantics relies on an implicit Von Neuman architec-
`
`ture. In order to be executed, C/C++ programs must be compiled for a given target
`
`architecture. C was originally designed to target a generic software architecture consisting
`
`of a continuous memory space in which all variables are stored and a microprocessor exe-
`
`cuting a sequence of instructions. C/C+ code is therefore sequential by definition.
`
`In the last decade, the implementation of compilers evolved significantly to take
`
`advantage of new target architectures such as multi-processor systems or very large
`
`instruction word (VLIW) machines. At the same time, the features of the language were
`
`exploited, for example using object-oriented features in C++, to support new models of
`
`computation to model mixed hardware-software systems and higher lever of abstractions.
`
`These new constructs may be used to express coarse-grain parallelism, as communicating
`
`sequential processes. However finer-grain parallelism at the operation level cannot easily
`
`be expressed. High-level synthesis as well as compiler analysis may be used to find such
`
`operation-level parallelism.
`
`XILINX, EX. 1024
`Page 22 of 146
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`CHAPTER 2. Related Work
`Page 11
`High-level synthesis (HLS) is defined as the set of transformations necessary to auto-
`
`mate the mapping of sequential behavioral descriptions into cycle-accurate descriptions.
`
`Current HLS tools input behavioral HDL descriptions. However, HLS tools supporting
`
`also C/C++ or C/C++-based languages are starting to appear.
`
`This chapter reviews first related work in the domain of high-level synthesis and mod-
`
`eling using C/C++. A selection of recent works in advanced compilers is then presented,
`
`followed by a presentation of current application-specific memory management methodol-
`
`ogy targeted at mixed software/hardware applications.
`
`2.1 Modeling and synthesizing hardware from C/C++
`
`2.1.1 High-level synthesis
`
`High-level synthesis (aka architectural or behavioral synthesis) consists of generating
`
`a structural view of a sequential architectural-level model [14]. The sequential architec-
`
`tural-level model typically consists of a set of parallel processes or tasks. HLS generates a
`
`circuit (i.e., mapping to a structural view) consisting of a data path composed of hardware
`
`resources and a control unit. The architectural-level model can be abstracted as a set of
`
`operations and dependencies. High-level synthesis then performs the following tasks:
`
`-
`
`-
`
`-
`
`
`
`identify the hardware resources (aka operators) implementing each operation
`
`
`
`
`
`schedule the execution time of the operations
`
`bind these scheduled operations to hardware resources (i.e.
`
`resource alloca-
`
`
`
`tion)
`
`In doing so the synthesis tool defines a structural model consisting of a data path, as an
`
`interconnection of hardware resources and a cycle-accurate model of a control unit that
`
`XILINX, EX. 1024
`Page 23 of 146
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`CHAPTER 2. Related Work
`Page 12
`issues the control signals to the data path according to the schedule. Resources in the data
`
`path represent not only the operators implementing operations in the original model, but
`
`also storage (e.g. registers) and steering logic which connects operators, storage elements
`
`and Input/Output (I/O) ports.
`
`There have been numerous research projects on high-level synthesis in the past decade
`
`[8,10,11,22,23,29,35,36]. Commercial tools are now available. They include Synopsys
`
`Behavioral Compiler [91], Mentor Graphics Monet [85], Get2chip Volare [83] and Arexys
`
`ArchiMate [72]. These tools take HDL (VHDL or Verilog HDL) as an input. High-level
`
`synthesis that take C/C++ as an input can be derived from these tools. Synopsys CoCentric
`
`SystemC compiler [92] is such an example.
`
`2.1.2 Hardware description languages and C/C++
`
`C/C++ are software programming languages and have little support for describing
`
`hardware efficiently. To model hardware in C/C++, we need the following language fea-
`
`tures present in HDL but not present in C/C++.
`
`-
`
`
`
`Concurrency: hardware is inherently parallel, while C/C++ programs are
`
`inherently sequential. The notion of processes (aka
`
` blocks in Verilog
`always
`
`HDL), which encapsulates programs that execute concurrently, is introduced.
`
`A system is described as a network of processes.
`
`-
`
`-
`
`
`
`Signals: hardware processes need to use signals (akin to wires or buffered
`
`channels) to communicate with one another.
`
`
`
`Reactivity: hardware systems are in continuous interaction with their environ-
`
`ment, i.e. they are reactive. The notion of reactivity is essential to describing
`
`XILINX, EX. 1024
`Page 24 of 146
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`CHAPTER 2. Related Work
`hardware systems at all levels of abstraction.
`
`Page 13
`
`-
`
`
`
`Data abstraction: C/C++ supports data abstractions that are useful for soft-
`
`ware programming. However, for hardware, one needs arbitrary precision
`
`signed and unsigned integers, bit vectors and fixed point types.
`
`With such a set of features added, C/C++ can efficiently model hardware/software sys-
`
`tems [58]. C/C++ contains many features which are not present in synthesizable subsets of
`
`HDLs.
`
`Different subsets of C/C++ and C-like HDLs have been defined and used for synthesis.
`
`I mention first those developed in the eighties. H
`ARDWARE
`
`C [36] is a language with a C-
`
`like syntax and a cycle-based semantics. It models hardware components at the architec-
`
`tural level using C constructs. H
`ARDWARE
`
`C can be fully synthesized. However, it makes
`
`both restrictions and additions to ANSI C. Constructs are added to define bit-widths, mod-
`
`ules and ports. The language includes a quite extended subset of the C constructs includ-
`
`ing unbounded loops and some form of f