throbber
Journal of VLSI Signal Processing, 1, 345465 (1990)
`©'l990’Kluwer Edadernic Pfibllsliers, B5‘ston. h7lanui':Tt'fiEd in ’l‘}K,'IT€her]and5_
`Ya‘-4 . (J 15* ‘F; ?'/90
`
`A Clock-Free Chip Set for High—Sampling Rate Adaptive Filters
`
`TERESA H.-Y. MENG*
`Department ofE1ectn'cal' Engineering, Stanford‘ University, Sranfond, California 94305.
`
`ROBEKI‘ W. BRODERSEN AND DAVID G. MESSERSCHMITT
`Departmem of Electrical Engineering and Computer Sciences, University qf California, Berkeley, Califomia 9420.
`
`Received May 8, 1989. Revised October 24, 1989.
`
`Abstract. As digital signal processing systems become larger and clock rates increase, the typical design approach
`using global clock synchronization will become increasingly difficult. The application of asynchronous clock-free
`designs to higivperformance digital signal processing systems is one promising approach to alleviating this problem.
`To demonstrate this approach for a typical signal processing task, the system architecture and circuit design of
`a chip set for implementing high-rate adaptive lattice filters using the asynchronous design techniques is presented.
`
`1. Introduction
`
`The issues in designing computing machines, both in
`hardware and in software, have always shifted in
`response to the evolution in technology. VLSI promises
`great processing power at low-cost, but there are also
`new constraints that potentially prevent us from taking
`advantage of technology advances. The increase in
`processing power available is a direct consequence of
`scaling the digital IC process. As the scaling of the IC
`process continues, it is doubtful that the benefits of the
`faster devices can be fully exploited due to other fun-
`damental limitations. System clock speeds are starting
`to lag behind logic speeds in recent chip designs. While
`gate delays are well below 1 its in advanced CMOS
`technologies, clock rates of more than 50 MHz are dif-
`ficult to obtain and where they have been attained re-
`quire extensive design and simulation effort.
`One important design consideration that limits clock
`speeds is clock skew [1], which is the difierence in phase
`of a global synchronization signal observed at different
`locations in the system. Clock skew can be reduced to
`a minimum with proper clock distribution [2], [3] but
`because of this global constraint high-performance cir-
`cuitry is confined to a small chip area.
`The alternative fiilly asynchronous design synthesis
`approach eliminates the need for a global clock and cir-
`cumvents problems associated with clock skew. At the
`chip level, since there are no global timing considera-
`tions,
`the design time spent on layout and circuit
`
`sponsored in part in; the Semiconductor Research
`"This research
`Corporation and by DARPA.
`
`timing simulation for the worst—case design are greatly
`reduced. At the board level, since the asynchronous ap-
`proach provides design modularity, systems built using
`this approach can be easily extended without problems
`in global synchronization [4], [5].
`Several recent dedicated hardware designs have
`achieved high-speed performance by eliminating clock
`skew problems through asynchronous designs [6], E7]
`and globally-asynchronous locally-synchronous systems
`[8]-[10} have been considered. Our work is highly
`motivated by feasibility and implementation. We are
`particularly concerned with performance, since we
`believe that the usefulness of a system will be eventually
`judged by its cost-effectiveness. Hence, we put special
`emphasis on exploiting maximum concurrency at both
`the architecture and the circuit levels, so that high-
`performance irnplernentation can be obtained without
`dependence on advanced technology. ,
`Motivations for adopting an asynchronous design
`methodology can be summarized as follows.
`Scaling and Iéchnalogical Reasons. The main motiva-
`tion for going to asynchronous circuitry is to eliminate
`the requirement for a global clock. Clock skew is a
`direct result of the RC time constants of the wires and
`the capacitive loading on the clock lines. While scal-
`ing the IC process reduces the area requirements for
`a given circuit, more circuitry is usually placed on the
`chip totake advantage of the extra space and to further
`reduce system costs [ll]. Hence, for a global signal such
`as a clock, the capacitive load tends to stay constant
`with scaling. Therefore, the capacitance associated with
`the interconnect layers cannot scale below a lower limit.
`
`
`
`Page 1 of 21
`
`SAMSUNG EXHIBIT 1011
`
`

`
` agflm&rwrmwMmWsdm
`
`i
`
`As the devices get smaller, the clock load represents
`a larger proportional load to the scaling factor and more
`buffering is needed to drive it, complicating the clock
`skew problem. Calculations show that it would take
`roughly 1 nsec for a very large transistor (1/gm *- 509)
`to charge a capacitance of IO pf in advanced CMOS
`technology [12]. This delay constitutes a constant speed
`limitation if a global signal is to be transmitted through
`the whole chip, not to mention the whole board.
`
`C4D Tools and Layout Factors. Increased circuit com-
`plexity has been addressed by the development of CAD
`tools to aid the designer. These tools often allow the
`designer to specify the functionality of a system at a
`structural level and have a computer program generate
`the actual chip layout according to the interconnection
`specified. The routing of clock wires as well as the load
`on the clock signal are global considerations which are
`difficult to extract from local connectivity. From a
`system design point of view, asynchrony means design
`modularity. A modular design approach greatly sim-
`plifies the design efforts of complicated systems and
`fits well with the macro-cell design methodology often
`adopted by ASIC CAD tools.
`
`Board Level Design. One of the primary advantages
`of using fully asynchronous design in implementing
`high-performance DSP systems is the ease of design
`at the board level, where clock distribution is not an
`
`issue. Inter-board communications have long used asyn-
`chronous Iinks (for example, the VME bus protocol)
`and interchip intra—board communications are becom-
`ing asynchronous too (for example,
`the MC68000
`series}. As DSP systems become complex, it is advan-
`tageous to simplify the design task to only local con-
`siderations by using asynchmnous components. This
`is particularly important for signal processing applica-
`tions using pipelined architectures, where computation
`can be extended by pipelining without any degradation
`in overall system throughput.
`One goal of this work is to demonstrate the ease with
`which asynchronous systems can be designed with a
`minimum amount of design efforts. We have thus
`designed a chip set that realizes a vectorized adaptive
`lattice filter, with arbitrary vector size, using an asyn-
`chronous desigt methodology. Our adaptive filter archi-
`tecture has localized forward-only interconnections," full
`pipelining in which the communication paths’ between
`chips constitute pipeline stages, and asynchronous inter-
`connection eliminatingthe global constraint of clock -
`distribution. Hence our architecture can achieve arbi-
`
`trary throughput consistent with inputloutput rate
`limitations (such as the speed of All) converters). The
`experimental implementation showed that asynchronous
`design simplifies the design process since no effort was
`devoted to clock distribution, nor were any timing
`simulations to verify clock skewing necessary.
`Section 2 gives a short description of the vectorized
`adaptive lattice filter structure and the chip partition
`of the adaptation algorithm. Section 3 reviews the asyn-
`chronous design methodology used in the chip imple-
`mentation. Section 4 illustrates the design of a fast
`self—timed array multiplier as an example of designing
`computation blocks with completion signal generation.
`Section 5 describes the design of the chip set for an
`adaptive lattice filter along with chip performance
`evaluations, and Section 6 gives the conclusions.
`
`2. Vectorized Adaptive Filters
`
`The realization of adaptive filters at high sampling rates
`is important in many applications but made inherently
`difficult by the recursive nature of the algorithms. The
`block adaptive filter scheme was proposed in which
`filters adapt coefficients at time Tusing the coefficients
`at time T—L, where L is the block size. In applications
`that require both fast convergence and fast tracking, in~
`troducing delays in the adaptation algorithm will
`degrade the filter tracking capability and destabilize the
`filter dynamics, such as in some radar signal process-
`ing systems.
`The vectorized adaptive filter scheme introduced in
`[13] achieves two objectives simultaneously. The first
`objective is to allow arbitrarily high-sampling rates for
`a given speed of hardware, at the expense of parallel
`hardware. The second objective is to not modify the
`input—output characteristics of the algorithm, and hence
`not affect the convergence and tracking capability. We
`chose to use a lattice filter [14] in our design because
`the adaptation feedback in a lattice filter can be limited
`to a single stage. This results in much less computa-
`tion within the feedback path and higher inherent speed
`in a given technology. The successive stages of a lattice
`filter are independent and therefore, can be pipelined.
`
`2.1. The Adaptation Algorithm
`
`Linear adaptive filtering is a special case of a first-
`order linear dynamical system. The adaptation of a
`
`Page 2 of 21
`
`

`
`A Clock-Free Chip Set for High-Sampling Rate Adaptive Filters
`
`347
`
`single stage of a lattice filter structure can be des-
`cribed by two equations, the state-update equation
`
`km = a(T)k(T-1) + I30").
`
`and the output computation
`
`y(T) = f(k(T),T),
`
`(1)
`
`(2)
`
`where a('1') and b(T) are read—out (memoryless) func-
`tions of the input signals at time T, and y{T) is a read-
`out function of the state k(T) and the input signals. Since
`the computation ofy(T) is memoryless, the calculation
`can be pipeline-interleaved [l5} and the computation
`throughput can be increased without theoretical limit,
`if input signals and state information can be provided
`at a sufficiently high speed. However, the state-update
`represents a recursive calculation which results in an
`iteration bound on the system throughput [16]. In order
`to relax the iteration bound without changing the
`. systerrfs dynamical properties, Equation (1) can be writ-
`ten as
`
`k( T+L - 1) =a(T+L— l)a(T+L --2)--'a(T+1)a(T)k(T— l)
`
`+a(T+L— l)a(T+ L -2)’-'a( T+ 1)b(T) +
`
`+a(T+L—1)b(T+L -2) +b(T+L -1)
`
`=c(T,L)k{T-— 1) +d(T,L)
`
`(3)
`
`e;(3T+2|n)
`e;(3T+1|n)
`e,(3T|n)
`e;(3T|n)
`e.,(3T+11rt)
`e.(3'I'+2|n)
`
`where c (fD “finctiom
`of input signals, but independent of the state k(T).
`(:02 L) and d(T, L) can be calculated with high—through-
`put using pipelining and parallelism, and the recursion
`k(T + L — 1) = c(’1'§ L)k(T — 1) + d(T, L) need only
`be computed every L samples. Therefore, the iteration
`bound is increased by a factor of L. L is called the vec-
`tor size, since a vector of L samples will be processed
`simultaneously.
`
`2.2. The Chip Partition for the Normalized LMS
`Lattice Filter
`
`A normalized least mean squares (LM S) or stochastic
`gradient adaptation algorithm is chosen for our exam-
`ple because of its simplicity and the absence of
`numerical overflow. A vectorized LMS lattice filter
`stage with L = 3 is shown in figure l, and the opera-
`tions that each processor needs to perform are shown
`in figure 2. The derivation of the algorithm for nor-
`malized LMS lattice filters can be found in [17]. Proc-
`essors A, and A2 jointly calculate the c(T, L) and
`d(T, L) in Equation (3), processor B, and B2 calculate
`the state-update for state k, and processor C, performs
`the output computation. For every processing cycle, a
`
`e;(3T I.-:+1)
`
`e;(3T+1|n+I)
`
`es(3T+2:n+1)
`
`e_,~ (ST |n+l)
`
`e;(31"+I|n+l)
`
`e,(3r+21n+1)
`
`Fig.
`
`I_. An LMS adaptive lattice filter stage with a vector size of three.
`
`Page 3 of 21
`
`

`
`348
`
`Meng, Broder-sen and Messerschmirr
`
`eel;
`
`6,55
`
` d‘
`
` E
`
`a'=(1-s)d+(e,3+e.,=)
`
`5 =Z"[(1-B)’E +d+(l-B)‘(eF+e»’)l
`
`{rep
`
`ab
`
`E:
`3.
`
`ab
`
`5’=(1..5)E+e_2+eE
`a =l—BE"(cf+eE)
`b =2E“e,-er, c'=ac
`d‘=ad+b
`
`9 *
`
`ab
`
`k = Z‘3{[1-E‘1(c +c§)](r:k +d)+2E"e;e5}
`a = l—£'1(ef+e )
`b =2E“¢;es
`
`efe;
`k'=aIc+b
`qr’ = e; -1: at
`£5 = 8,5 -4: 6}-
`
`Fig. 2. Operations required of each processor shown in figure 1.
`
`vector of L input samples are processed and a vector
`of L output samples are calculated. The processing
`speed is L times slower than the filter sampling rate.
`Since there is no theoretical limit to the number of
`
`samples allowed in a vector, the filter sampling rate is
`not constrained by the processing speed, but rather it
`is limited by only the 110 bandwidth (such as the speed
`of data converters). However, the higher sampling rate
`is achieved at the expenses of additional hardware com-
`plexity and system latency, and very high~sampling rates
`will require multiple-chip realizations. For example,
`at the sampling rate of 100 MHz the computational re-
`quirement is approximately 3 billion operations per se-
`cond per stage. Since there is no global clock routing
`at the board level with asynchronous design, pipelined
`computation hardware can be easily extended without
`any degradation in overall throughput as the vector size
`is increased and the number of lattice filter stages is
`increased.
`
`Our goal is to design a set of chips that can be con-
`nected at the board level to achieve any sampling rate
`consistent with 1/0 limitations. This requires the par-
`titioning of a single lattice filter stage into a set of chips
`which can be replicated to achieve any vector size. In
`this partitioning we attempted to minimize the number
`of different chips that had to be designed, allow flex-
`ibility in the vector size (that is, amount of parallelism),
`and minimize the data bandwidth (number of pins) on
`each chip.
`We found a partitioning that uses five different chips
`and meets these requirements. Block diagrams of four
`of them are shown in figure 3, and a fifth chip simply
`implements variable—Iength pipeline stages. 'I‘wo of the
`
`four chips (PEI and PE4) have built-in hardware
`multiplexers so that the chip function can be controlled
`by external signals to form a reconfigurable data path;
`“otherwise eight different chip designs would have been
`. required. The same chip set can also be used to con-
`struct a lattice LMS joint process estimator. These chips
`will be replicated as required and interconnected at the
`board level to achieve any specified system sampling
`rate.
`
`3. Asynchronous System Design
`
`An asynchronous processor is composed of two types
`of basic blocks: computation blocks and interconnec-
`tion blocks. Computation blocks perform processor
`operations, which include any oombinational logic such
`as multipliers and ALUs, and memory elements. A
`computation block must be designed such that the com-
`putation is started by an external request signal and
`generates an output signal to indicate that the computa-
`tion has completed. The completion signal can be read-
`ily generated by using a class of logic family called Dif-
`ferential Cascode Voltage Switch Logic (DCVSL). A
`comprehensive discussion of DCVSL is given in [12}.
`The completion signal can be viewed as a locally
`generated clock to indicate to the succeeding blocks
`when the output data is ready to be fetched and to re-
`quest 'a data transfer.
`When computation blocls are to be interconnected,
`asynchronous interconnection blocks must be inserted
`among them to ensure correct data transfers under all
`temporal variations of the various completion signals.
`
`"-*-w-Iv‘-~~-----=--aewnwwne:~swrm i —
`
`~ ~ .. - ...
`
`-
`
`-
`
`Page 4 of 21
`
`

`
`A Clock-Free Chip Set for High-Sampling Rate Adaptive Filters
`
`349
`
`-
`5
`D.«,.:1afll
`
`@ : Squarer
`® : Multiplier
`G-)*: Adder/Subtracter
`O : Multiplexing
`
`® : Shifter
`® : 16 bits «:4 bits converter
`H :Pipelining Stage
`
`Fig. 3. A chip set designed to implement the LMS adaptive lattice filter shown in figure l. The partition was done in such a way that any
`required filter sampling rate can be achieved by replicating these chips at the board level without redesign. PEI and PE4 have built-in hardware
`multiplexers so that the chip function can be controlled by external signals to form a reconfigurable data path.
`
`These interconnection blocks take the completion
`signals generated by computation blocks and issue re-
`quest signals to start the computation in other computa-
`tion blocks. It can be shown that the two binary signals,
`the request signal for start of computation and the com-
`pletion signal for finish of computation, are necessary
`and sufficient for realizing general asynchronous com-
`putation with unbounded delays [18]. Interconnection
`blocks are composed of hazard—free asynchronous logic,
`which can be synthesized from a behavioral descrip-
`tion specifying the sequence of operation that each in-
`terconnection block needs to perform [19]. We will now
`discuss the properties of asynchronous computation
`from a system point of view.
`
`3.1. Properties of Asynchronous Computation Blocks
`
`DCVSL computation blocks use the so-called dual-mil
`[20] coded signals inside the logic, which compute both
`the data signal and its complement. Contrary to past
`expectations, if there is no carry—chain in the design,
`asynchronous computation using DCVSL blocks always
`gives the same worst-case computation delay, indepen-
`dent of input data patterns, since both rising and fall-
`ing transitions have to be completed before the com-
`pletion signal can go high. Hence, the computational
`latency of each computation block for different input
`data patterns is approximately a constant. This lack of
`data dependency to the first order is actually an advan-
`tage for real time signal processing, since the worst-
`case computation delay can be easily determined.
`
`Once an asynchronous processor has been designed,
`there is no way to slow down the internal computation.
`The throughput can be controlled by issuing the request
`signal at a certain rate, but the computation within the
`processor is performed at the full speed. This prop-
`erty has an impact on testing: unlike synchronous cir-
`cuits, we cannot slow down the circuitry to make it work
`with a slower clock.
`
`In asynchronous design it is tempting to introduce
`shortcuts that improve performance at the expense of
`speed-independence. However, these shortcuts neces-
`sitate extensive timing simulations to verify correctness
`under varying conditions. To minimize design complex-
`ity we therefore generally staycd true to the speed inde-
`pendence design, with the result that tirning simulations
`were required only to predict performance and not to
`verify correctness.
`/
`
`3.2. Properties of Asynchronous Interconnection
`Blocks
`
`When computation blocks communicate between one
`another, an interconnection block is designed to control
`the data transfer mechanism. We require that an inter-
`connection circuit be delay-insensitive; that is, the cir-
`cuit’s behavior is independent of the relative gate delays
`among all the physical elements in the circuit. Delay-
`insensitive circuits demand that the Boolean functions
`
`of these circuits must not allow any hazard conditions.
`An automated synthesis procedure for designing hazard-
`free asynchronous interconnection circuits from a
`
`Page 5 of 21
`
`

`
`.0
`
`Meng Brodersen and Messerschmin‘
`
`behavioral level description has been developed [19] and
`used to generate all the interconnection circuits required
`in the adaptive filter chip set.
`One example of an interconnection block, a four
`phase pipeline handshake circuit is shown in figure 4.
`The pipeline handshake circuit allows the two computa-
`tion blocks connected to it to process data simul-
`taneously. The C-elernent shown in the figure represents
`a Boolean function of c = ab + be + ca where a and
`
`b are the two input signals and c is the output signal.
`The circuit is optimum in the sense that it allows the
`maximum concurrency under the hazard-free condition
`and requires a minimum amount of hardware for irn—
`plementation. Other more complicated interconnection
`circuits such as multiplexers or bus controllers have
`been synthesized to realize a fully asynchronous pro-
`grammable processor [21]. The hardware overhead in-
`curred by these interconnection circuits is usually small,
`often on the order of tens of transistors. However, the
`time-delay overhead incurred by the circuit delays can-
`not be ignored, and will be discussed in Section 5 as
`part of the chip performance evaluation.
`
`Comptzlatlntl
`Block
`
`Fig. 4. An optimum {bur-phase pipeline handshake circuit that allows
`the two computation blocks connected to it to process data at the
`sample time.
`
`3.3. Connection ofRegisters to Interconnection Blocks
`
`The connection between computation blocks and inter-
`connection bloclrs
`is governed by the request-
`completion signal pairs. Registers are usually con-
`sidered as part of the interconnection blocks since they
`implement a pipeline stage in the signal processing
`sense. For example, the connection of the pipeline hand-
`shake circuit (figure 4) to the corresponding register
`is shown in figure 5. The handshake circuit uses the
`rising edge of output acknowledge signal A0“, to con-
`trol register latching. The register shown in figure 5
`is an edge-triggered latch. The reason that an edge-
`triggered latch is needed instead of a level-triggered
`latch is for fast acknowledge signal feedback. Notice
`that in figure 4-,_R,,,,, is fedback to the first C-element
`without waiting for the completion of the block con-
`nected to it. It can be easily verified that if a feedback
`
`Fig. 5. The connection between the pipeline register, the pipeline
`handshake circuit, and computation blocks.
`
`signal is generated after the completion of the suc-
`ceeding block, the handshake circuit will enforce that
`only alternate blocks can compute at the same time and
`that the hardware utilization is reduced to at most 50%.
`
`In figure 5, the first C-element of the full handshake
`controls the feedback acknowledge signal, which in-
`dicates to the previous block whether its output data
`has been acknowledged, or received, by the register
`so that the previous block can proceed to compute the
`next sample. The second C-element controls the request
`to the next block, which indicates to the next block
`when to start computation. A completion detection
`mechanism [22] is used to detect the completion of the
`register latching.
`The data latching mechanism shown in figure 5 is
`logic-delay-insensitive, but not propagation-delay
`insensitive, as the propagation delays of data lines might
`be different from that of the request signal R,-,,. If the
`request signal and data lines do not match in propaga-
`tion delays, differential pairs of data lines, or coded data
`lines, will have to be transmitted and the completion
`signal R,-,, will be generated locally at the input to the
`C-element. In our design, we also assume that the pro-
`pagation delays within each logic block is negligible.
`To gain some overall performance improvement, we
`could match the delays of register latching and the sec-
`ond C-element. However, simulation showed that any
`attempts at logic delay matching proved to be unreliable,
`and thus, we chose to stay true with a delay-insensitive
`design.
`'
`‘
`
`3.4. System Initialization
`
`An asynchronous system can be initialized in several
`states, depending on whether initial samples are neces-
`sary or not [21]. For a feed-forward calculation, the ini-
`tial samples do not have any impact on output behavior,
`so we might as well set all the initial conditions (the
`
`Page 6 of 21
`
`

`
`-A-€loclc—vFree-Gldp-Set-forl-Iighvsarnpling~Rate1#rdaptive"Fi:lters'“"3S1"””‘“"*“””""
`
`request and acknowledge signals controlled by inter-
`connection blocks) to zero. For feedback configura-
`tions, samples within the feedback loop are necessary
`to avoid deadlock; hence, either the request or the
`acknowledge signal has to be initially set high to repre-
`sent a sample in a pipeline stage within a feedback loop.
`The output of a memory element, for example the C-
`element, can be easily set high initially, but when com-
`bined with DCVSL computation blocks the initializa-
`tion process is slightly more complicated than just set-
`ting the C-element.
`When the request signal of a DCVSL block is set
`high while its input register is being cleared at initializa-
`tion, the output data of the DCVSL will be undefined,
`which may result in a garbage sample that will propa-
`gate around the feedback loop. This is no problem for
`data samples in our particular adaptive filtering applica-
`tion, since the effect of an arbitrary initial sample will
`fade away because of the adaptation procedure. On the
`other hand, an undefined logic state may deadlock the
`system right at the start. A two-phase initialization is
`adopted in our design to cope with this problem.
`In the first initialization phase, every handshake
`signal is reset to zero to insure that the input data to
`each DCVSL block is stable, and then in the second
`phase, those request signals corresponding to a sam-
`ple delay in a feedback loop are set high so that a clean
`completion signal can be subsequently generated. The
`initialization can be easily implemented with a set-reset
`memory element. In our design, a set memory element
`can be set high by an int‘: signal, and a reset memory
`element can be reset low by the same init signal, but
`set memory elements and reset memory elements are
`disjoint. A simple logic network as shown in figure 6
`is used to realize the two-phase feedback initialization.
`During the first phase, when init is high and fled is
`low, the logic output Rm is low, and, therefore, a clean .
`precharged signal R_,0,,,_,, can be generated. During the
`second phase, fized is set high, which sets the output
`of the NAND gate to be low (Reg is the request signal
`which is also high initially), and Rm. is set high through
`the select logic. A completion signal is subsequently
`generated by the DCVSL (Rem), indicating that there
`is a sample in the DCVSL block. inf: can then be reset
`and the system can be started without deadlock.
`
`3.5. System Ihmughpur
`
`For a pipelined architecture, the system throughput is
`the reciprocal of the additive delays of the longest pipe
`
`Fig. 6 The logic used to allow two phase initialiuticn set-up. Dur-
`ing the first phase, init is high and all the reset memory elements
`and register contents are reset to low. During the second phase, feed
`is set high and the request signal (Rm) corresponding to a sample
`delay in a feedback loop are set high through the select logic. A com-
`pletion signal is subsequently generated by the DCVSL (Ramp), indi-
`cating that there is a sample in the DCVSL block.
`
`plus one handshake (two set-reset cycles of a C-element)
`plus the precharge. The delays of the handshake and
`the precharge can be considered as overhead as com-
`pared to a synchronous design {although the synchro-
`nous design wili also have some overhead in register
`latching), and may range from 8 ns to 20 ns (1.6_u.
`CMOS) depending on the data bit width and handshake
`circuits used. A nominal delay of 10 its was estimated
`from simulation using the 1.6,u CMOS parameters.
`A 10 ns overhead may seem too much to be prac-
`tically acceptable. But his overhead will be reduced in
`direct proportion with the gate delay in more advanced
`technologies. We expect this overhead to be below 2
`us in 0.8;; CMOS technology, which will allow a system
`throughput up to a few hundred MI-Iz without dis-
`tributing a global clock. Also worth mentioning is that
`the system throughput of a pipelined architecture is in-
`dependent of the system complexity if the delay of the
`longest stage remains a constant. A pipelined system
`can be extended without throughput degradation since
`only local communication is counted for throughput
`limitation and there is no global clock to constitute a
`measure of the systems physical size.
`
`4. A Self-Timed Array Multiplier
`
`In this section we will describe the design of a fast self-
`timed array multiplier to illustrate the circuit design
`considerations often encountered in designing DCVSL
`computation blocks and the solutions to these problems.
`Divisions in {MS stochastic gradient adaptive filter-
`ing are often used for step-size normalization and can
`be approximated by shifters without much degradation
`in convergence properties as verified by simulations
`[23]. Thus, in our application the most complicated
`computation block needed is a fast multiplier. The
`design of a self-timed multiplier will now be described.
`
`Page 7 of 21
`
`

`
`‘.!*:\:‘:nh::I:I:nar-‘.5. '
`
`3-5
`
`eng,
`
`1'0 €?'S€fl
`
`QSSBFSC
`
`11
`
`"
`
`4.]. Architecture of the Multiplier
`
`A multiplication in adaptive filters is usually followed
`by an accumulation operation, and the final addition
`in the multiplication can be combined with the accum-
`ulation to save the delay time of one carry-propagate
`add. As shown in figure 7, the multiplier core outputs
`two partial products to be added in the succeeding
`accumulate-adder, which consists of one carry-save add
`and one final carry-propagate add. The multiplier core
`is composed of three basic cells: registers, Booth en-
`coders, and carry-save adders. The 16 multiplier digits
`are grouped into eight groups and each group is Booth
`encoded to either shift or negate the multiplicand [24].
`Eight partial products are computed and added using
`a modified Wallace tree structure 125}. Four of these
`partial products can be added concurrently and therefore,
`the computation latency for eight additions is four times
`the delay of one carry-save add. Figure 7 shows the ar-
`chitecture of the multiplier core, which consists of eight
`Booth encoders operating concurrently and six can'y-
`save adders with four of them operating sequentially.
`
`The inv signals from encoders to carry-save adders im-
`plement the p1us~one operation in a two‘s complement
`representation of a negative number. The tnaxirnum
`quantization error is designed to be less than one least-
`significant hit by adding a 0.5 to one of the carry-save
`adders for rounding. The multiplier core occupies a sil-
`icon area of 2.7 mm X 3 mm in 1.6 pm CMOS design,
`and a computational latency of40 its was estimated from
`simulation by irsim [26] using the 1.6 pm parameters.
`The test results of the design will be given in Section 5.
`To increase the system throughput, a pipeline stage was
`added between the multiplier core and the accumulate-
`adder. We could have deep-pipelined the multiplier
`core, because pipelined computation within a feedback
`loop can be compensated by look-ahead computation
`without changing the system’s input-output behavior
`[27],
`[28]. However, since the total multiplication
`latency has been reduced to only one Booth encoding
`plus four carry-save adds for 16-bit data, the overhead
`incurred in adding pipeline stages, such as the circuitry
`for pipeline registers and the delay time for register latch-
`ing, completion detection, and handshake operation,
`
`-l
`
`Carry-save adder 1
`
`Carry-save adder 2
`
`Carry-save adder 3
`
`Carry-save adder 4
`
`-I
`
`Carry-save adder 5
`
`Carry-save adder 6
`
`Data_outY
`
`Data_outX
`
`Fig. I The architecture of the multiplier core, which consists of eight Booth encoders operating concurrently and six carry-save adders with
`four of them operating sequentially.
`
`Page 8 of 21
`
`

`
`A Clock-Free Chip Set for High-Sampling Rate Adaptive Filters
`
`353
`
`rules out the practicability of having one more stage
`of pipelining within the multiplier core.
`'
`The accumulate-adder consists of a carry-save adder
`at the front and a carry-propagate adder at the end to
`add three input operands per processing cycle, two of
`three operands being the output partial products from
`the multiplier core, with the third coming from a sec-
`ond source. The accumulate-adder occupies a silicon
`area of 1.1 mm X 1 mm in 1.6 pm CMOS design. A
`propagation delay of 30 its was estimated by simula-
`tion using the 1.6 pm parameters and the test results
`will be given in Section 5.
`
`4.2. Circuit Design for the Multiplier Core
`
`The goal in designing the multiplier core was to reduce
`the DCVSL delay overhead by allowing maximum con-
`current operation in both precharge and evaluation
`phases. Layout is another important design factor since
`a Wallace Tree structure consumes more routing area
`than sequential adds. Data bits have to be shifted two
`places between each Booth encoder array to be aligned
`at the proper position for addition. Since each data bit
`is coded in a differential form, four metal lines (two
`for the sum and two for the carry) have to be shifted
`two places per array, which constitute a 50% routing
`area overhead as compared to the same structure in a
`synchronous design. In this subsection, the basic cells
`used in the multiplier core, modified from a design pro-
`vided by Jacobs & Broder

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket