`Document Description: Provisional Cover Sheet (SB16)
`
`PTO/SB/16 (04-07)
`Approved for use through 06/30/2010 OMB 0651-0032
`U.S. Patent and Trademark Office: U.S. DEPARTMENT OF COMMERCE
`Under the Paperwork Reduction Act of 1995, no personsare required to respond to a collection of information unlessit displays a valid OMB control number
`
`Provisional Application for Patent Cover Sheet
`This is a requestfor filing a PROVISIONAL APPLICATION FOR PATENT under 37 CFR 1.53(c)
`
`Inventor(s)
`
`Inventor 1
`
`
`
`
`
`
`
`Customer Number
`
`Middle Name Family Name CityGiven Name Country;
`
`
`
`Joseph
`
`Bates
`
`Baltimore
`
`US
`
`All Inventors Must Be Listed — Additional Inventor Information blocks may be
`generated within this form by selecting the Add button.
`
`
`
`Title of Invention Massively Parallel Processing with Compact Arithmetic Element
`
`Attorney Docket Number (if applicable)
`
`A0006-1001L
`
`Correspondence Address
`
`Direct all correspondenceto (select one):
`
`@© The address corresponding to Customer Number|() Firm orIndividual Name
`
`
`
`The invention was made by an agencyof the United States Governmentor under a contract with an agency of the United
`States Government.
`
`
`@® No.
`
` © Yes, the nameof the U.S. Government agency and the Governmentcontract numberare:
`
`EFS - Web1.0.1
`
`Google Ex. 1055 - Page 1
`Google Ex. 1055 - Page 1
`
`Google Exhibit 1055
`Google Exhibit 1055
`Google v. Singular
`Google v. Singular
`
`
`
`Doc Code: TR.PROV
`Document Description: Provisional Cover Sheet (SB16)
`
`PTO/SB/16 (04-07)
`Approved for use through 06/30/2010 OMB 0651-0032
`U.S. Patent and Trademark Office: U.S. DEPARTMENT OF COMMERCE
`Under the Paperwork Reduction Act of 1995, no personsare required to respond to a collection of information unlessit displays a valid OMB control number
`
`Entity Status
`Applicant claims small entity status under 37 CFR 1.27
`
`© Yes, applicant qualifies for small entity status under 37 CFR 1.27
`
`© No
`
`Warning
`
`Petitioner/applicant is cautioned to avoid submitting personal information in documentsfiled in a patent application that may
`contribute to identity theft. Personal information such as social security numbers, bank account numbers, or credit card
`numbers(other than a check or credit card authorization form PTO-2038 submitted for payment purposes) is never required
`by the USPTOto support a petition or an application.
`If this type of personal information is included in documents submitted
`to the USPTO, petitioners/applicants should consider redacting such personal information from the documents before
`submitting them to USPTO. Petitioner/applicant is advised that the record of a patent application is available to the public
`after publication of the application (unless a non-publication request in compliance with 37 CFR 1.213(a) is madein the
`application) or issuance of a patent. Furthermore, the record from an abandonedapplication may also be available to the
`public if the application is referenced in a published application or an issued patent (see 37 CFR1.14). Checks and credit
`card authorization forms PTO-2038 submitted for payment purposes are not retained in the application file and therefore are
`not publicly available.
`
`Signature
`
`Please see 37 CFR 1.4(d) for the form ofthe signature.
`
`the provisional application.
`
`/Robert Plotkin/
`
`Date (YYYY-MM-DD)
`
`[2009-06-19
`
`First Name
`
`Robert
`
`Last Name
`
`Plotkin
`
`Registration Number
`(If appropriate)
`
`43861
`
`This collection of information is required by 37 CFR 1.51. The information is required to obtain or retain a benefit by the public whichis to
`file (and by the USPTOto process) an application. Confidentiality is governed by 35 U.S.C. 122 and 37 CFR 1.11 and 1.14. This collection
`is estimated to take 8 hours to complete, including gathering, preparing, and submitting the completed application form to the USPTO.
`Time will vary depending upon the individual case. Any comments on the amount of time you require to complete this form and/or
`suggestions for reducing this burden, should be sent to the Chief Information Officer, U.S. Patent and Trademark Office, U.S. Department
`of Commerce, P.O. Box 1450, Alexandria, VA 22313-1450. DO NOT SEND FEES OR COMPLETED FORMS TO THIS ADDRESS. This
`form can only be used when in conjunction with EFS-Web. If this form is mailed to the USPTO,it may cause delaysin handling
`
`EFS - Web 1.0.1
`
`Google Ex. 1055 - Page 2
`Google Ex. 1055 - Page 2
`
`
`
`Privacy Act Statement
`
`The Privacy Act of 1974 (P.L. 93-579) requires that you be given certain information in connection with your submission of
`the attached form related to a patent application or paten. Accordingly, pursuant to the requirements of the Act, please be
`advised that :
`(1) the general authority for the collection of this information is 35 U.S.C. 2(b)(2); (2) furnishing of the
`information solicited is voluntary; and (3) the principal purpose for which the information is used by the U.S. Patent and
`Trademark Office is to process and/or examine your submission related to a patent application or patent.
`If you do not
`furnish the requested information, the U.S. Patent and Trademark Office may not be able to process and/or examine your
`submission, which may result in termination of proceedings or abandonment of the application or expiration of the patent.
`
`The information provided by you in this form will be subject to the following routine uses:
`
`1.
`
`The information on this form will be treated confidentially to the extent allowed under the Freedom of Information
`Act (5 U.S.C. 552) and the Privacy Act (6 U.S.C 552a). Records from this system of records may be disclosed to the
`Department of Justice to determine whether disclosure of these records is required by the Freedom of Information
`Act.
`
`A record from this system of records may be disclosed, as a routine use, in the course of presenting evidence to
`a court, magistrate, or administrative tribunal, including disclosures to opposing counsel in the course of settlement
`negotiations.
`A record in this system of records may be disclosed, as a routine use, to a Member of Congress submitting a
`requestinvolving an individual, to whom the record pertains, when the individual has requested assistance from the
`Member with respect to the subject matter of the record.
`A record in this system of records may be disclosed, as a routine use, to a contractor of the Agency having need
`for the information in order to perform a contract. Recipients of information shall be required to comply with the
`requirements of the Privacy Act of 1974, as amended, pursuant to 5 U.S.C. 552a(m).
`A record related to an International Application filed under the Patent Cooperation Treaty in this system of
`records maybe disclosed, as a routine use, to the International Bureau of the World Intellectual Property
`Organization, pursuant to the Patent Cooperation Treaty.
`A record in this system of records may be disclosed, as a routine use, to a n other federal agency for purposes
`of National Security review (35 U.S.C. 181) and for review pursuant to the Atomic Energy Act (42 U.S.C. 218(c)).
`A record from this system of records may be disclosed, as a routine use, to the Administrator, General Services,
`or his/her designee, during an inspection of records conducted by GSAaspart of that agency's responsibility to
`recommend improvements in records management practices and programs, under authority of 44 U.S.C. 2904 and
`2906. Such disclosure shall be made in accordancewith the GSA regulations governing inspection of records for this
`purpose, and anyother relevant (i.e., GSA or Commerce) directive. Such disclosure shall not be used to make
`determinations about individuals.
`
`A record from this system of records may be disclosed, as a routine use, to the public after either publication of
`the application pursuant to 35 U.S.C. 122(b) or issuance of a patent pursuant to 35 U.S.C. 151. Further, a record
`may bedisclosed, subject to the limitations of 37 CFR 1.14, as a routine use, to the public if the record wasfiled in an
`application which became abandonedor in which the proceedings were terminated and which application is
`referenced by either a published application, an application open to public inspection or an
`issued patent.
`A record from this system of records may be disclosed, as a routine use, to a Federal, State, or local law
`enforcement agency, if the USPTO becomes awareof a violation or potential violation of law or regulation.
`
`Google Ex. 1055 - Page 3
`Google Ex. 1055 - Page 3
`
`
`
`Title
`
`Massively Parallel Processing with Compact Arithmetic Element
`
`Copyright Notice
`A portion of the disclosure of this patent document contains material which is subject to
`copyright protection. The copyright owner has no objection to the facsimile reproduction
`by anyone of the patent documentor the patent disclosure, as it appears in the Patent
`and Trademark Office patent file or records, but otherwise reservesall copyright rights
`whatsoever.
`
`Field of the Invention
`
`This invention relates to programmable computers, and more particularly to computers
`with very high performancerelative to their cost or power usage. Still more particularly,
`it relates to massively parallel computers built with components that perform arithmetic
`using unusually small amounts ofcircuitry.
`In still more detail, it relates to massively
`parallel computers built with compact componentsthat perform arithmetic at low
`precision but with high dynamic range.
`
`Background of the Invention
`
`The ability to compute rapidly has become enormously important to humanity. Weather
`and climate prediction, medical applications (such as drug design and non-invasive
`imaging), national defense, geological exploration, financial modeling, Internet search,
`network communications, scientific research in varied fields, and even the design of
`new computing hardware have each become dependentonthe ability to rapidly perform
`massive amounts of calculation. Future progress, such as the computer-aided design
`of complex nano-scale systems or development of consumer products that can see,
`hear, and understand, will demand economical delivery of even greater computing
`power.
`
`Gordon Moore's prediction, that computing performance per dollar would double every
`two years, has proved valid for over 30 years and lookslikely to continue in some form.
`But despite this rapid exponential improvement, the reality is that the inherent
`computing poweravailable from silicon has grown far more quickly than it has been
`made available to software.
`In other words, although the theoretical computing power
`of computing hardware has grown exponentially, the interfaces through which software
`is required to access the hardwarelimits the ability of software to use hardwareto
`perform computations at anything approaching the hardware’s theoretical maximum
`computing power.
`
`Consider a modern silicon microprocessor chip containing about one billion transistors,
`clocked at roughly 1 GHz. On each cycle the chip delivers approximately one useful
`
`Google Ex. 1055 - Page 4
`Google Ex. 1055 - Page 4
`
`
`
`arithmetic operation to the softwareit is running. For instance, a value might be
`transferred between registers, another value might be incremented, perhaps a multiply
`is accomplished. This is notterribly different from what chips did 30 years ago, though
`the clock rates are perhaps a thousand timesfaster today.
`
`Real computers are built as physical devices, and the underlying physics from which the
`machines are built often exhibits complex and interesting behavior. For example, a
`silicon MOSFETtransistor is a device capable of performing interesting non-linear
`operations, such as exponentiation. The junction of two wires can add currents.
`If
`configured properly, a billion transistors and wires should be able to perform some
`significant fraction of a billion interesting computational operations within a few
`propagation delays of the basic components (a “cycle”if the overall design is a
`traditional digital design). Yet, today's CPU chipsusetheir billion transistors to enable
`software to perform merely a few such operations per cycle, not the significant fraction
`of the billion that might be possible.
`
`There are valid reasons for microprocessors to be designed as they are. Besides the
`often essential requirement for software compatibility with earlier designs, they deliver
`great precision, performing exact arithmetic with integers typically 32 or 64 bits long and
`performing rather accurate and widely standardized arithmetic with 32 and 64 bit
`floating point numbers. Many applications need this kind of precision. But a hardware
`unit to perform arithmetic of this sort generally requires on the order of a million
`transistors to implement, and there are many economically important applications that
`desperately need a far greater fraction of the inherent computing power that those
`million transistors represent and which are not especially sensitive to precision. Current
`architectures for general purpose computingfail to deliver this power.
`
`Because of the weaknesses of conventional computers, such as typical
`microprocessors, other kinds of computers have been developed to attain higher
`performance. These machinesinclude single instruction stream/multiple data stream
`(SIMD) designs, multiple instruction stream/multiple data stream (MIMD) designs,
`reconfigurable architectures such as field programmable gate arrays (FPGAs), and
`graphics processing unit designs (GPUs) which, when applied to general purpose
`computing, may be viewed as single instruction stream/multiple thread (SIMT) designs.
`
`SIMD machinesfollow a sequential program, with each instruction performing
`operations on a collection of data. They come in two main varieties, vector processors
`and array processors. Vector processors stream data through a processing element(or
`small collection of such elements). Each component of the data stream is processed
`similarly. Vector machines gain speedby eliminating many instruction fetch/decode
`operations and by pipelining the processor so that the clock speed of the operationsis
`increased.
`
`Array processors distribute data across a grid of processing elements (PEs). Each
`element has its own memory.
`Instructions are broadcast to the PEs from a central
`control until, sequentially. Each PE performs the broadcastinstruction onits local data
`
`Google Ex. 1055 - Page 5
`Google Ex. 1055 - Page 5
`
`
`
`(often with the option to sit idle that cycle). Array processors gain speed by using silicon
`efficiently - using just one instruction fetch/decode unit to drive many small simple
`execution units in parallel.
`
`Array processors have been built using a wide variety of bit widths, such as 1, 4, 8, and
`wider using fixed point, and using floating point arithmetic. Small bit widths allow the
`processing elements to be small, which allows more of them tofit in the computer, but
`many operations must be carried out in sequence to perform conventional arithmetic
`calculations. Wider widths allow conventional arithmetic operations to be completed in
`a single cycle.
`In practice, wider widths are desirable. Machines that were originally
`designed with small bit widths, such as the Connection Machine-1 and the Goodyear
`Massively Parallel Processor, which each used 1 bit wide processing elements, evolved
`toward wider data paths to better support fast arithmetic, producing machines such as
`the Connection Machine-2 which included 32 bit floating point hardware and the MasPar
`machines which succeeded the Goodyear machine and provided 4 bit processing
`elements in the MasPar-1 and 32 bit processing elements in the MasPar-2.
`
`Array processors also have been designed to use analog representations of numbers
`and analog circuits to perform computations. The SCAMP is such a machine. These
`machines provide low precision arithmetic, in which each operation might introduce
`perhapsan error of a few percentage points in its results. They also introduce noise
`into their computations, so the computations are not repeatable. Further, they represent
`only a small range of values, corresponding for instance to 8 bit fixed point values rather
`than providing the large dynamic range of typical 32 or 64 bit floating point
`representations. Given theselimitations, the SCAMP wasnot intended as a general
`purpose computer, but instead was designed and usedfor image processing and for
`modeling biological early vision processes. Such applications do not require a full range
`of arithmetic operations in hardware, and the SCAMP for example omits general
`division and multiplication from its design.
`
`While SIMD machines were popular in the 1980s, as price/performance for
`microprocessors improved designers began building machines from large collections of
`communicating microprocessors. These MIMD machinesare fast and can have
`price/performance comparable to their component microprocessors, but they exhibit the
`same inefficiency as those componentsin that they deliver to their softwarerelatively
`little computation per transistor.
`
`Field Programmable Gate Arrays (FPGAs) are integrated circuits containing a large grid
`of general purposedigital elements with reconfigurable wiring between those elements.
`The elements originally were single digital gates, such as AND and OR gates, but
`evolved to larger elements that could, for instance, be programmed to map 6 inputs to 1
`output according to any boolean function. This architecture allows the FPGAto be
`configured from external sources to perform a wide variety of digital computations,
`which allows the device to be used as a co-processor to a CPU to accelerate
`computation. However, arithmetic operations such as multiplication and division on
`integers, and especially on floating point numbers, require many gates and can absorb
`
`Google Ex. 1055 - Page 6
`Google Ex. 1055 - Page 6
`
`
`
`a large fraction of an FPGA’s general purpose resources. For this reason, modern
`FPGAsoften devote a significant portion of their area to providing dozens or hundreds
`of multiplier blocks, which can be used instead of general purpose resources for
`computations requiring multiplication. These multiplier blocks typically perform 18 bit or
`wider integer multiplies, and use manytransistors, as similar multiplier circuits do when
`they are part of a general purpose CPU.
`
`Existing Field Programmable Analog Arrays (FPAAs) are analogous to FPGAs, but their
`configurable elements perform analog processing. These devices generally are
`intendedto do signal processing, such as helping model neural circuitry. They are
`relatively low precision, haverelatively low dynamic range, and introduce noise into
`computation. They have not been designed as, or intended for use as, general purpose
`computers. For instance, they are not seen as machinesthat can run the variety of
`complex algorithms with floating point arithmetic that typically run on high performance
`digital computers.
`
`Finally, Graphics Processing Units are a variety of parallel processor that evolved to
`provide high speed graphics capabilities to personal computers. They offer standard
`floating point computing abilities with very high performancefor certain tasks. Their
`computing model is sometimes based on having thousands of nearly identical threads
`of computing (SIMT), which are executed by a collection of SIMD-like internal
`computing engines, each of which is directed and redirected to perform work for which a
`slow external DRAM memory has provided data. Like other machinesthat implement
`standard floating point arithmetic, they use many transistors for that arithmetic. They
`are as wasteful of those transistors, in the sense discussed above, as are general
`purpose CPUs.
`
`Some graphics processors include support for 16 bit floating point values (sometimes
`called the “Half” format). The graphics processor manufacturers, currently such as
`NVidia or AMD/ATI, describe this capability as being useful for rendering images with
`higher dynamic range than the usual 32 bit RGBA format, which uses8 bits of fixed
`point data per color, while also saving space over using 32 bit floating point for color
`components. The special effects movie firm Industrial Lignt and Magic (ILM)
`independently defined an identical representation in their OpenEXR standard, which
`they describe as “a high dynamic-range (HDR) image file format developed by Industrial
`Light & Magic for use in computer imaging applications.” Wikipedia (late 2008)
`describes the 16 bit floating point representation thusly: “This format is used in several
`computer graphics environments including OpenEXR, OpenGL, and D3DX. The
`advantage over 8-bit or 16-bit binary integers is that the increased dynamic range
`allows for more detail to be preserved in highlights and shadows. The advantage over
`32-bit single precision binary formats is that it requires half the storage and bandwidth.”
`
`Whena graphics processor includes support for 16 bit floating point, that supportis
`alongside supportfor 32 bit floating point, and increasingly, 64 bit floating point. Thatis,
`the format is supported for those applications that wantit, but the higher precision
`formats also are supported because they are needed fortraditional graphics
`
`Google Ex. 1055 - Page 7
`Google Ex. 1055 - Page 7
`
`
`
`applications and also for so called "general purpose" GPU applications. We know of no
`graphics processorchip built in the belief that 16 bit floating point may be used as the
`primary means of arithmetic in a general purpose computational accelerator. Thus,
`existing GPUs devote substantial resources to 32 (and increasingly 64) bit arithmetic
`and are wasteful of transistors in the sense discussed above.
`
`The variety of architectures we have mentionedareall attempts to get more
`performancefrom silicon than is available in a traditional processor design. But
`designersof traditional processors also have been struggling to use the enormous
`increase in available transistors to improve performance of their machines. These
`machines often are required, because of history and economics, to support long existing
`instruction sets, such as the Intel x86 instruction set. This is difficult, because of the law
`of diminishing returns, which does not enable twice the delivered performance from
`twice the transistor count. One facet of these designers’ struggle has been to increase
`the precision of arithmetic operations, since transistors are abundant and some
`applications could be sped up significantly if the processor natively supported long (eg
`64 bit) numbers. With the increase of native fixed point precision from 8 to 16 to 32 to
`64 bits, programmers have come to think in terms of high precision and to develop
`algorithms assuming computers provide such precision, since it comes as an integral
`part of each new generation ofsilicon chips and thus is “free.”
`
`Summary of the Invention
`
`Embodimentsof the present invention are directed to a programmable massively
`parallel processor which includes hardware elements designed to perform arithmetic
`operations, typically but not necessarily including addition and multiplication and in
`some embodiments additional operations, on numerical values of low precision but high
`dynamic range ("LPHDR arithmetic"). Such a processor may, for example, be
`implemented on a single chip. Whether or not implemented on a single chip, the
`number of LPHDR arithmetic elements in the processorin certain embodiments of the
`present invention significantly exceeds (e.g., by at least 20 more than three times) the
`numberof arithmetic elements in the processor which are designed to perform high
`dynamic range arithmetic of traditional precision (Such as 32 bit or 64 bit floating point
`arithmetic).
`
`In some embodiments we maytake low precision to mean that results of arithmetic
`operations commonly will differ from exact results by at least .1%(one tenth of one
`percent). This is far worse precision than the widely used IEEE 754 single precision
`floating point standard. Programmers of such a machine will need to develop
`algorithms that function adequately despite these unusually large relative errors. High
`dynamic range means that values representable to this precision span a range at least
`as large as from one millionth to one million.
`
`If we were to represent and manipulate these values using the methods offloating point
`arithmetic, they would have binary mantissas of no more than 10 bits plus a sign bit and
`binary exponents of at least 5 bits plus a sign bit. However,the circuits to multiply and
`
`Google Ex. 1055 - Page 8
`Google Ex. 1055 - Page 8
`
`
`
`divide these floating point values would berelatively large. One example of a better
`embodiment is to use logarithmic representations of the values.
`In such an approach,
`the values require the same numberofbits to represent, but multiplication and division
`are implemented as addition and subtraction, respectively, of the logarithmic
`representations. Addition and subtraction of represented values is more difficult, but not
`terribly much more. As a result, the area of the arithmetic circuits remains relatively
`small and a greater numberof computing elements can befit into a given area of
`silicon. This means the machine can perform a greater numberof operations per unit of
`time or per unit power, which gives it an advantage for those computations able to be
`expressed in the massively parallel LPHDR framework.
`
`Another embodiment is to use analog representations and processing mechanisms.
`Analog implementation of LPHDR arithmetic has the potential to be superiorto digital
`implementation, because it tends to use the natural analog physics of transistors or
`other physical devices instead of using only the digital subset of the device's behavior.
`This fuller use of the devices’ natural abilities may permit smaller mechanisms for doing
`LPHDR arithmetic.
`In recent years, in the field of silicon circuitry, analog methods have
`been supplanted by digital methods.
`In part, this is because of the easeof doing digital
`design compared to analog design. Alsoin part, it is because of the continued rapid
`scaling of digital technology ("Moore's Law") compared to analog technology.
`In
`particular, at deep submicron dimensions, analog transistors no longer work as they had
`in prior generations of larger-scale technology. This change of familiar behavior has
`made analog design still harder in recent years. However,digital transistors are in fact
`analog transistors used in a digital way, meaning digital circuits are really analog circuits
`designed to attempt to switch the transistors between completely on and completely off
`states. As scaling continues, even this use of transistors is starting to come face to face
`with the realities of analog behavior. Scaling of transistors for digital use is expected
`either to stall or to require digital designers increasingly to acknowledge and workwith
`analog issues. For these reasons, digital embodiments may no longer be easy, reliable,
`and scalable, and analog embodiments of LPHDR arithmetic may come to dominate
`commercial architectures.
`
`Varieties of massively parallel architectures are known and various methods of LPHDR
`arithmetic are known (such as short floating point representations, logarithmic number
`system implementations, and analog implementations). When combined, these
`methods can provide massive amounts of LPHDR computationin relatively little area or
`volume. However, this combination is not obvious, and in particular it has not been
`described or practiced as a means of doing general purpose computing, for at least two
`reasons.Afirst reason is that it is commonly believed that LPHDR computation, and in
`particular massive amounts of LPHDR computation, is not practical as a substrate for
`moderately general computing. A second reason is thatit is commonly believed that
`massive amounts of even high precision computation on a single chip or in a single
`machine, as is enabled by a compact arithmetic processing unit, is not useful. We shall
`discuss both beliefs.
`
`An example of the former view is expressed in marketing literature for Intel's upcoming
`
`Google Ex. 1055 - Page 9
`Google Ex. 1055 - Page 9
`
`
`
`Larrabee processor, which states "The Larrabee architecture fully supports IEEE
`standards for single and double precision floating-point arithmetic. Support for these
`standards is a pre-requisite for many types of tasks including financial applications."
`For decadesthe gold standard of high dynamic range arithmetic has been the IEEE 754
`standard for 32 and 64 bit floating point. The reasonis thatit is easier to write programs
`when the programmercan count on arithmetic operations to produceresults that are
`more than sufficiently accurate for the desired task. So the IEEE standard for single
`precision floating point arithmetic uses 23 bits of mantissa - far greater than the 10 or
`fewer bits used in our LPHDR approach.
`
`Further evidenceof the view that LPHDR arithmetic is not suitable for general
`computation is that today's GPU chipsthat include support for 16 bit floating point
`arithmetic provide at least as much support for 32 bit floating point and increasingly
`support 64 bit floating point, despite the great improvementsin silicon efficiency that
`would result from supporting only 16 bit or shorter floating point values.
`
`To our knowledge, there are no commercial implementations (or theoretical
`discussions) of massively parallel machines that provide LPHDR arithmetic as the
`intended means of doing general purposearithmetic computation, and the common
`wisdom is that such a machine would not be useful for applications that need high
`dynamic range (thatis, floating point applications). A simple argument used to support
`this view is that performing long sequences of LPHDR arithmetic, as is a common
`occurrencein algorithms that perform massive amounts of such arithmetic,is likely to
`cause accumulation of the small errors made at each step into overwhelmingerror in
`the final results. For instance, performing so simple a computation as averaging, say,
`one million values, using the standard algorithm and using arithmetic that introduces
`.1% error into each summation step, sometimeswill result in enormous cumulative
`errors that render the results worthless. This may be a reasonIntel states in particular
`that support for IEEE standards is a pre-requisite for financial applications. However,
`we shall demonstrate that there are methods for taming these errors sufficiently to make
`the present invention useful for a variety of applications, including financial applications.
`Weexpect that as programmers gain experience with massive LPHDR arithmetic they
`will develop new methods that further expand the range of use of the present invention.
`
`Separate from concerns about precision, there is a view held that massive amounts of
`even high precision arithmetic on a single chip or in a massively parallel machine is not
`useful. This view is justified by appeals to both recent history of computer design and to
`a theoretical method for analyzing efficiencies of VLSI designs called "area time
`analysis" (AT*2). Here are the first impressions of an experienced, knowledgeable, and
`highly regarded expertin parallel algorithms, Professor Guy Blelloch of the Carnegie
`Mellon University Computer Science Department, when considering a massively parallel
`machine to do (standard precision) arithmetic:
`"I think such a machine would only be
`useful for a few very limited applications, if that.
`It turns out that the game is in the
`wires, not the FPUs. Forall practical purposes you can assume you have unbounded
`free FPUs and all you haveto pay for is the wires between them. There was nice
`theory (AT‘2 complexity) back around 1980 which basically showed this(i.e. that the
`
`Google Ex. 1055 - Page 10
`Google Ex. 1055 - Page 10
`
`
`
`cost of a computationis all in the wires). The past 30 years of parallel machines have
`proven the theory correct, at least at the high level."
`
`The caveat “at least at the high level" is usually considered a minor point. The
`prevailing wisdom is that the communication costs, between processing elements within
`a massively parallel machine and between such a machine and its conventional digital
`host machine, so dominate the overall cost of computing that there is no point
`investigating waysto fit very large numbersof arithmetic processing elements into a
`massively parallel machine.
`
`Despite these views, that massive amounts of arithmetic on a chip or in a massively
`parallel machine are not useful, and that massive amounts of LPHDR arithmetic are
`even worse, we show below that the massively parallel LPHDR design is in fact useful
`and providessignificant practical benefits in at least several significant applications.
`
`To conclude, modern digital computing systems pr



