throbber
Department of Mathematics and Computer Science
`The System Architecture and Networking group
`
`Accelerating SIFT Feature extraction with a vector DSP-
`A feasibility study
`
`Master Thesis
`
`Arjun Sathyanarayana Prasad
`
`Supervisors:
`Prof. Dr. Ir. C. H. (Kees) Van Berkel
`Ir. D. (David) Van Kampen
`Dr. Ir. W. E. H. (Wim) Kloosterhuis
`Prof. Dr. K. G. W. (Kees) Goossens
`
`This thesis work has been carried out at Ericsson. The content of this report is confidential until August 29, 2015
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00449
`
`Page 1
`
`
`
`
`
`
`
`

`
`Acknowledgments
`I would like to express my deepest gratitude to Professor Dr. Ir. Kees Van Berkel from TU,
`Eindhoven for providing me an opportunity to conduct this thesis work under him. He has been
`motivating and has guided me to carry out this thesis work in a fruitful manner.
`
`I am extremely grateful to David van Kampen from Ericsson for tutoring me throughout this
`thesis work. He has given me valuable suggestions and feedback in each and every stage of my
`work. His willingness to give time so generously is immensely appreciated.
`
`Special thanks to Dr. Ir. Wim Kloosterhuis from Ericsson for advising me in the critical stages of
`the thesis work and providing me the constructive feedback. He has made time to review my
`work even with a busy schedule which is much appreciated.
`
`Lastly, I would also like to extend my thanks to Ericsson and specially the EVP team for being
`helpful and accommodating for this period. This has been a great learning experience.
`
`Arjun Sathyanarayana Prasad
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00449
`
`Page 2
`
`

`
`Table of Contents
`1.
`Introduction ......................................................................................................................................... 5
`1.1. Digital Image ..................................................................................................................................... 6
`1.2. Embedded vector Processor ............................................................................................................... 8
`1.2.1. Implementation of Sobel kernel on EVP ........................................................................................ 9
`1.3. Problem Description ........................................................................................................................ 16
`1.4. Related work .................................................................................................................................... 17
`2. Scale invariant Feature Transform – Algorithm description ....................................................... 18
`2.1. Scale space extrema detection .......................................................................................................... 19
`2.1.1. Difference of Gaussian.............................................................................................................. 19
`2.1.2. Local Extrema detection ........................................................................................................... 21
`2.2. Accurate keypoint localization......................................................................................................... 22
`2.2.1. Peak/Contrast Threshold ........................................................................................................... 23
`2.2.2. Edge Threshold ......................................................................................................................... 25
`2.3. Orientation assignment .................................................................................................................... 26
`2.4. Keypoint Descriptor generation ....................................................................................................... 27
`2.5. Object recognition ............................................................................................................................ 27
`3. Architecture ....................................................................................................................................... 29
`3.1. Experimental setup ........................................................................................................................... 29
`3.2. Data flow .......................................................................................................................................... 30
`4. Mapping SIFT on EVP – Problems and Approaches .................................................................... 32
`4.1. Application ....................................................................................................................................... 32
`4.2. Kernels ............................................................................................................................................. 33
`4.2.1. Up-sampling .............................................................................................................................. 33
`4.2.2. Gaussian blurring ...................................................................................................................... 37
`4.2.3. Difference of Gaussian.............................................................................................................. 51
`4.2.4. Down-sampling ......................................................................................................................... 52
`4.2.5. Extrema detection ..................................................................................................................... 53
`4.2.6. Gradient and orientation assignment ......................................................................................... 57
`4.2.7. Keypoint refining ...................................................................................................................... 59
`5. Results and Observations ................................................................................................................. 61
`5.1 Data-type exploration for precision .................................................................................................. 61
`5.2 Performance of gaussian pyramid ..................................................................................................... 68
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00449
`
`Page 3
`
`

`
`5.3 Performance of difference of gaussian, extrema detection and gradient pyramid ............................ 71
`5.4 Total execution time ......................................................................................................................... 73
`5.5 Benchmarking ................................................................................................................................... 73
`6. Conclusions and Future work .......................................................................................................... 80
`6.1 Conclusions ....................................................................................................................................... 80
`6.2 Future work ....................................................................................................................................... 80
`7. References .......................................................................................................................................... 81
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00449
`
`Page 4
`
`
`
`
`
`
`
`
`
`

`
`1. Introduction
`Over the past decade we have seen rapid adoption of the mobile smartphones among users worldwide. It is estimated
`that the number of smartphones sold in the year 2013 itself was close to 1 billion [1]. It is not only that these devices
`are affordable but the reason for this evolution of the smartphone business can also be attributed to tremendous
`advancements in the processor, battery and memory technologies. These devices can handle tasks that were
`unimaginable few years ago and hence the name Smartphone. The number of applications being built into these
`mobile devices is ever increasing but so is the computational capability of mobile processors. One of the surveys [2]
`show that the majority of users purchase smartphones based on its ability to support personal applications. The
`report also indicates that users are most likely to use their phone for multimedia applications like playing games,
`photo editing, streaming audio/video content and web surfing. Also more and more intriguing applications like
`augmented reality is just around the corner and will soon find its way into our lives [3]. Hence, the ability to handle
`multimedia content is very crucial and this is not true just for smartphones but also for other technologies that are in
`use today such as tablets, high tech glasses, play stations etc. But, it is well known that processing of multimedia
`content is highly resource intensive and has a direct impact on the devices` performance.
`
`Of particular relevance in this report is the field of Image analysis and computer vision. Image analysis involves a
`series mathematical transformations performed on digital images to extract meaningful information. One of the
`ultimate goals that is sought out to be achieved through Image analysis is to use computers for emulating human
`vision. A computer needs to be capable of learning from its visual inputs, make inferences and take the necessary
`action. Computer vision is a field of study that encompasses wide variety of algorithms that are aimed at mimicking
`functionality of human visual system. Computer vision and Image analysis concepts are applied in numerous areas
`spanning from a simple photo editor to complex augmented reality applications.
`
`Sophisticated computer vision applications require performing transformations on images at very high frame rates.
`An image consists of huge chunk of data (pixels) in the orders of hundreds of KBs (Kilobytes) or few MBs
`(Megabytes) and has direct impact on performance of systems when analyzing them. Running Image analysis
`algorithms on GPPs (General purpose processors) is often not recommended as they are very inefficient. Especially
`in Embedded systems with tight constraints on timing, it is difficult to achieve real time performance on resource
`intensive algorithms by using only GPPs (CPU). Hence different alternatives have been explored in the form of Co-
`processors, DSPs (Digital signal processors), GPUs (Graphic processing units), SIMD/MIMD (Single instruction
`multiple data/ Multiple instructions multiple data) processors and hardware accelerators to work in tandem with
`CPUs to share the workload. GPUs are by far the most preferred choice in the state of the art mobile platforms while
`it is no surprise that custom accelerators are the most computationally efficient alternative. GPUs offer high degree
`of flexibility which can be exploited to improve performance of variety of algorithms, but they are power hungry.
`Accelerators, though very efficient, offer least degree of flexibility and hence are not desirable for varying standards.
`SIMD processors occupy a sweet spot in this hierarchy of processing units as they provide a better degree of
`flexibility when compared to accelerators and also they can be computationally efficient when compared to GPUs if
`the algorithms are vectorizable. In many of the image transformations this is often the case. Since SIMD processors
`have similar architectures when compared to GPUs, considering SIMD processors as a design alternative in mobile
`platforms for Image analysis is highly relevant in this context.
`
`In this report we focus on implementation of a widely popular and robust feature extraction algorithm called Scale
`invariant feature transform (SIFT) [4] that is applied in variety of computer vision applications. Some of the
`transformations used in SIFT exhibit high degree of data level parallelism and these are mapped on a SIMD
`processor developed by Ericsson called the Embedded vector Processor (EVP).
`
`The structure of this chapter is as follows: in section.1.1 we briefly discuss basics of a digital image and its
`characteristics. Section.1.2 gives a conceptual description of the architecture of embedded vector processor (EVP)
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00449
`
`Page 5
`
`

`
`from Ericsson with an illustration of a simple image analysis algorithm mapped on EVP. In section.1.3 we formulate
`the problem statement for our work. In section.1.4 we briefly discuss the related work on SIFT.
`
`1.1. Digital Image
`A digital image can be considered as numerical representation of a 2 dimensional array. There are two types of
`representation of digital images: Raster images and Vector graphics. Normally digital images are associated with
`raster/bitmap images since most of the imaging devices, displays etc. are raster devices. A raster image consists of
`finite number of elements called pixels arranged in rows (Height) and columns (Width). Each pixel has a value
`(integers) assigned to it and indicates the brightness level of a particular color which can be components of RGB
`(Red, Blue and Green) colors or grayscale.
`Color depth/ bit depth of an image is the number of bits required/used to indicate the brightness level of a single
`pixel in an image. In case of color images a single pixel can be a group of sub pixels for individual color
`components (RGB) and color depth applies to each individual sub pixels represented as bits per channel (BPC) or
`bits per sample (BPS) or bits per pixel(BPP). In case of grayscale images, pixels have a single brightness value and
`the BPP indicates the number of gray levels that can be present in an image varying between and including white
`and black colors. Color depth of an image specifies the number of distinct colors that can be displayed but does not
`indicate which colors. There is another parameter called the color gamut. Color gamut specifies the subset of colors
`that can be represented accurately within a given color space (perceived by human eye) or by certain output device.
`Figure.1.1 called the CIE (international commission on illumination) diagram explains the concept of the color
`gamut. The figure shows a full range of colors perceived by the human eye and a subset of colors that a common
`HDTV can display. Sometimes along with the color components another channel to indicate transparency is also
`encoded in an image which is called the alpha channel. Typically images used are of 8 BPC color depth with or
`without alpha channel since the operations on images are computationally simpler and the number of colors are
`sufficient given the sensitivity of the human eye. Table.1.1 gives different variants of commonly used color systems
`along with their applications.
`
`Figure.1.1. Color gamut of a HDTV (CIE diagram) Source [5]
`
`Bitmap/Raster images are often stored in compressed or uncompressed file formats. Once rasterized, these images
`are transformed into grids of pixels with designated color depth for displaying it on the screen. Compression
`techniques are required to store images since storing images in raster format needs lot of memory space. The size of
`an image is directly correlated to the color depth and the number of pixels present in an image. Compression
`techniques can be either lossy or lossless.
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00449
`
`Page 6
`
`

`
`Color systems
`
`8 bit true color
`
`16 bit direct color
`
`18 bit color
`
`24 bit true color
`
`Deep color(30/36/48 bit)
`
`Bits per channel
`(Color depth)
`3(R,G) and 2(B)
`
`4(RGB) + 4 alpha
`
`5(RGB) + 1alpha
`
`5(R) + 6 (G) + 5(B)
`
`6 BPC (RGB)
`
`8 BPC (RGB) +
`Alpha(optional)
`10/12/16 BPC
`
`Color gamut
`
`Applications
`
`256
`
`Early computers
`
`4096 colors + 16 level
`transparency
`
`32768 colors + 2 level
`transparency
`
`65536 colors
`
`262,144 colors
`
`Mobile phones
`Small device displays
`Photographic images
`
`LCD displays
`
`16,777,216 colors
`
`Modern desktops
`
`Billions of colors
`
`Graphic cards
`
`Table.1.1. Different coloring systems, their characteristics and applications
`
`Following are the different categories of file formats commonly used to store images.
`
`Uncompressed - Stores images in rasterized format. It is easy to perform Image transformations, as each pixel data
`can be easily parsed. Examples: BMP, PGM, PBM.
`
`Lossless Compression - In lossless compression an exact replica of original raster grid is stored but with reduced file
`size. The file size depends on the actual graphical complexity of an image. It is often advantageous when storing
`graphically simpler images (Continuous regions in an image like cartoons or intermediate edited images) as it can
`achieve better compression. Examples: PNG, GIF, TIFF.
`
`Lossy Compression - In lossy compression the image quality is compromised for file size. Often algorithms are
`designed to exploit the limitation of human eye characteristics and discard unnecessary information. So in actuality
`the image may appear to be perfectly good even though it is of low quality. They are generally better than lossless
`compression in achieving lower file size. They are typically used for images that are stored on the internet.
`Examples: JPEG, TIFF, EXIF.
`
`Table.1.2 summarizes different file formats, their characteristics and applications. For most of the image
`transformations it is required that the images are in uncompressed format. Hence in this report, test images that are
`chosen are of PGM file format and also they are of 8 bit color depth (Gray Scale) as SIFT is applied on gray images.
`But this concept can be extended to images with different color depths and color systems.
`
`File formats
`
`JPEG
`
`PNG
`
`GIF
`
`BMP
`
`
`
`Color depth
`
`Compression
`
`Use
`
`Type
`
`Lossy
`
`Lossless
`
`8 BPC (RGB, Gray)
`
`8/24/48 bit true color
`With/Without alpha
`
`++
`
`+
`
`+
`
`--
`
`Final Distribution,
`Internet
`
`Animation,
`Intermediate images
`
`Logos ,cartoons
`
`Cameras, photos,
`image processing
`
`Page 7
`
`Lossless
`
`8 bit(256 colors)
`
`uncompressed
`
`1,4,8,16,24,32
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00449
`
`

`
`PGM
`
`uncompressed
`
`8 bit grayscale
`
`--
`
`Image processing
`
`Table.1.2. File formats, their characteristics and applications
`
`1.2. Embedded vector Processor
`Embedded vector processor (EVP) is a vector DSP developed by Ericsson and is being used in Ericsson modems in
`baseband processing for mobile standards [6]. Figure.1.2 gives a conceptual overview of the architecture of EVP.
`
`EVP has VLIW architecture and hence has the ability to pack multiple operations in one instruction and executes
`them in a single clock cycle on parallel Functional Units. It is made up of a scalar data computation unit (SDCU)
`and a vector data computation unit (VDCU). SDCU can be operate in parallel to VDCU and can perform common
`scalar operations. The SIMD (vector) width of EVP is 256 bits and the vector operations can be performed in the
`granularities of 8 bits (limited), 16 bits and 32 bits. Thus the number of elements per vector (P) that can be operated
`on in parallel can be 32, 16 or 8 depending on the granularity. From the implementation point of view the VALU,
`VLSU, VMAC, VBCST and VSHU units of the VCDU are of prime importance and will be discussed in more detail
`below.
`
`Figure.1.2. Architecture of EVP- Source [6]
`
`(cid:120)(cid:3) The VDCU unit consists of 16 general purpose vector registers shared among all functional units and each
`vector is 256 bit wide.
`(cid:120)(cid:3) The vector load and store unit (VLSU) forms the interface between vector units and the data memory. It
`supports aligned and unaligned loads of vectors from the memory. Aligned load operates on 256 bit
`boundaries. Unaligned loads can load vectors at 8 bit boundaries. Aligned vector loads have a latency of 3
`cycles while unaligned loads take 4 cycles. The initiation interval (Cycles between two operations of given
`type) for both kinds of loads is 1 cycle. The load and store unit internally maintains a fully associative
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00449
`
`Page 8
`
`

`
`cache for unaligned loads and fetches consecutive vectors in a stream. This cache is limited in size and
`hence often restrictive (only a maximum of three data streams are supported efficiently).
`(cid:120)(cid:3) The vector arithmetic and logic unit (VALU) supports basic ALU operations such as addition, subtraction,
`comparison etc. and logical bitwise operations on 8/16/32 bit data types. The operations have a latency and
`initiation interval of 1 cycle.
`(cid:120)(cid:3) The vector multiplication and accumulation unit (VMAC) supports integer/fixed and floating point
`multiplications. EVP is capable of performing multiplications of only 16 bit arguments with accumulations
`supported up to 16/32/40 bits. The latency of integer MAC operations is 2 cycles with initiation interval of
`1 cycle. There is no support for 8 bit granularity. Capabilities such as saturation and rounding are also
`included.
`(cid:120)(cid:3) The vector shuffle unit (VSHU) is used to organize, shift/rotate elements in a vector. It uses shuffle patterns
`(16 patterns supported) that can be used to re-organize values inside a vector. The latency and initiation
`interval is 1 cycle each.
`(cid:120)(cid:3) Vector Broadcast unit (VBCST) can be used to broadcast a scalar element to all vector data paths.
`(cid:120)(cid:3) Vector mask registers are supported which is used in combination with various operations on EVP to
`manipulate only parts of vector data. For example, the shuffle operations in conjunction with mask registers
`can be used to merge two different vectors. EVP also consists of a logical unit called Vector mask
`arithmetic and logic units (VMALU) for performing logical operations on vectors masks.
`(cid:120)(cid:3) Almost all operations can be used as predicated operations. The predicated operations have an additional
`parameter of type bool which determines whether the operation is performed or not.
`(cid:120)(cid:3) Extra information: An IVU (intra vector unit) unit is present on EVP which supports operations on data
`elements within a vector. EVP supports bypass so that results of one functional unit can be directly passed
`to other units without the need for register store in the pipeline stages. An address computation unit (ACU)
`is also present for performing arithmetic operations on addresses. Specialized program control units that
`can run in parallel to VDCU and SDCU are included to support hardware loops, calls and branching.
`(cid:120)(cid:3) Software development is done in a language called EVP-C which is a superset of basic ANSI-C language to
`support specialized functions of EVP.
`
`
`1.2.1. Implementation of Sobel kernel on EVP
`To illustrate mapping of image analysis kernels on EVP let us take an example of a simple edge detection algorithm
`called the Sobel operator. The Sobel filter consists of two kernels Gx and Gy of size 3x3 convolved with an image
`(I) in X and Y direction as shown below.
`
`(cid:1833)(cid:1876)(cid:3404)(cid:3)(cid:3429)(cid:3398)(cid:883) (cid:882) (cid:883)
`
`
`
`(cid:3398)(cid:883) (cid:882) (cid:883)(cid:3433)(cid:1499)(cid:1835)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:1833)(cid:1877)(cid:3404)(cid:3)(cid:3429)(cid:3398)(cid:883) (cid:3398)(cid:884) (cid:3398)(cid:883)(cid:883) (cid:884) (cid:883)(cid:3433)(cid:1499)(cid:1835)
`(cid:3398)(cid:884) (cid:882) (cid:884)
`(cid:882)
`(cid:882)
`(cid:882)
`
`The two kernels compute the gradient variations in their respective directions. These gradients extract sharp
`variations in the brightness levels across the image and thus detect edges [6]. Figure1.3. shows the output of a Sobel
`filter applied on a standard test image of size 512x512.
`
`Convolution is performed by moving each of the kernels across the image, one pixel at a time computed as below.
`Consider the kernels as a kind of an overlay centered on a pixel of interest. Then the gradient is calculated as the
`weighted values of pixel of interest and its corresponding 8 neighbors. Figure.1.4 demonstrates the same.
`
`Consider “I11” as the pixel of interest. Then the resulting gradients at I11 can be calculated as follows.
`
`Gx11/Gy11 = G00*I00 + G01*I01 + G02*I02 + G10*I10 + G11*I11 + G12*I12 + G20*I20 + G21*I21 + G22*I22
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00449
`
`Page 9
`
`

`
`The coefficients are integer values and are multiplied with pixel values of range [0:255]. After Gx11 and Gy11 are
`calculated the next pixel which is I12 is made as the pixel of interest and the filters are centered on it. The boundary
`pixels are not computed on EVP and are instead set to zero while storing back the output image.
`
`Figure.1.3.Sobel edge detection performed on EVP – “lena” standard test image
`
`Figure.1.4. Illustration of convolution using Sobel kernels
`
`After calculating the gradients in the two directions the gradient magnitude at each pixel is calculated using the
`following equation.
`
`For simplicity an approximate formula for calculating gradient magnitude is used which is as below.
`
`(cid:1833)(cid:3404)(cid:3)(cid:3495)(cid:1833)(cid:3025)(cid:2870)(cid:3397)(cid:1833)(cid:3052)(cid:2870)(cid:3)
`(cid:1833)(cid:3404)(cid:3)(cid:1313)(cid:1833)(cid:3051)(cid:1313)(cid:3397)(cid:3)(cid:3630)(cid:1833)(cid:3052)(cid:3630)
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00449
`
`Page 10
`
`

`
`The gradient magnitude G is stored as the output image which will contain edges. The pixels at the corners are filled
`with zeroes and are not operated on by the kernels. Although there are 18 coefficients from the two kernels Gx and
`Gy, 6 of them are zeroes and hence in total we need to convolve the image with only 12 coefficients.
`
`Let us see how a sequential pseudo code would look like if we were to apply Sobel filters.
`W -> width of the image
`H -> height of the image
`I[W][H] -> input image
`O[W][H] -> output image
`
`for(i = 1; i < H-1; i++)
`{
`
`
`for(j = 1; j < W-1; j++)
`{
`
`
`
`
`
`
`
`
`
`
`}
`
`}
`
`X = -1*I[i-1][j-1];
`X += -2*I[i][j-1];
`X += -1*I[i+1][j-1]; // Convolution with Gx kernel
`X += 1*I[i-1][j+1];
`X += 2*I[i][j+1];
`X += 1*I[i+1][j+1];
`
`Y = -1*I[i-1][j-1];
`Y += 1*I[i+1][j-1];
`y += -2* I[i-1][j];
`y += 2* I[i+1][j];
`Y += -1*I[i-1][j+1];
`Y += 1*I[i+1][j+1];
`
`// Convolution with Gy kernel
`
` G
`
` = abs(X) + abs(Y);
`G = min(G,255); // Saturate
`O[i][j] = G;
`
`Now let us consider implementation of Sobel kernels on a SIMD machine which in this case is EVP. We re-
`represent the whole image as array of vectors with each vector having 16 (P) elements. This is called vectorization.
`For example if we consider a 512x512 image, then we divide each row into 32 vectors containing 16 pixels each.
`Figure.1.5 demonstrates an example of how convolution is performed on vectors for Gx kernel. In the Y direction
`three consecutive rows are convolved with the corresponding coefficients. In the X direction, shifted versions of
`input stream are obtained and convolved and this is illustrated in the figure for row Ri. The same can be applied to
`other rows Ri+1 and Ri+2. The whole process is repeated in similar fashion for Gy kernel.
`
`The shifted versions of input streams are obtained either by using simple unaligned loads on EVP or by using
`aligned loads in combination with shuffle and rotate operations. Figure.1.6. shows how this can be achieved. These
`operations help us in accessing neighbors of each element within a particular vector. Thus when convolved, the
`output will have the weighted sum of the relevant neighboring pixel values.
`
`There are two possible approaches for performing convolution which is explained in the following sections.
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00449
`
`Page 11
`
`

`
`Figure.1.5. Illustration of convolution on EVP
`
`
`
`
`
`Figure.1.6. Illustration of aligned and unaligned loads
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00449
`
`Page 12
`
`

`
`Approach 1
`Let us consider implementation on EVP that is similar to the sequential implementation that we discussed in the
`previous section. The operations are performed on 16 pixels together and the resulting output is for all 16 pixels.
`Figure.1.5 actually illustrates this approach. Below is the pseudo code for the same.
`I [H][W/P] -> input vector of elements
`for(i = 0; i< H-2; i++)
`{
`
`
`j = 0;
`
`I1 = I[i+0][j*P : j*P+P-1];
`I2 = I[i+1][j*P : j*P+P-1]; // Aligned load of first vector in a row
`I3 = I[i+2][j*P : j*P+P-1];
`
`I11 = I[i+0][(j+1)*P : (j+1)*P+P-1];
`I22 = I[i+1][(j+1)*P : (j+1)*P+P-1]; // Load the next vector in the row
`I33 = I[i+2][(j+1)*P : (j+1)*P+P-1];
`for(j = 0; j < W/P; j++)
`{
`
`X[ 0 : P-1] = (-1) .* I1;
`X[ 0 : P-1] += (-2) .* I2 // Mac operations with coefficients
`X[ 0 : P-1] +=(-1) .* I3;
`
`Y[ 0 : P-1] = (-1).* I1;
`Y[ 0 : P-1] += (1).* I3;
`
`T1 = (I1[1:P-1],I11[0]); //Rotate one position
`T3 = (I3[1:P-1],I33[0]); //Merge and rotate 1 position on VSHU
`
`Y[ 0 : P-1] += (-2).* T1;
`Y[ 0 : P-1] += (2).* T3;
`
`T1 = (I1[2:P-1],I11[0],I11[1]);
`T2 = (I2[1:P-1],I22[0],I22[1]);// Merge and Rotate 2 positions
`T3 = (I3[2:P-1],I33[0],I33[1]);
`X[ 0 : P-1] += (2) .* T2;
`X[ 0 : P-1] += (1) .* T1;
`X[ 0 : P-1] += (1) .* T3;
`
`Y[ 0 : P-1] += (-1).* T1
`Y[ 0 : P-1] += (1).* T3;
`
`G[0 : P-1] = abs(X[0 : P-1)) + abs(Y[0 : P-1)); // Final gradient
`
`G[0 : P-1] = min(S[0 : P-1],bcst(255)); // Clipping/Saturation
`O[i+1][j*P :j*P+P-1] = G[0 : P-1]; //Store
`
`I1 = I11;
`
`
`
`
`
`I2 = I22;
` // Use input vector for next iteration
`I3 = I33;
`
`I11 = I[i+0][(j+2)*P : (j+2)*P+P-1];
`I22 = I[i+1][(j+2)*P : (j+2)*P+P-1]; // Load a new vector
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00449
`
`Page 13
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`

`
`I33 = I[i+2][(j+2)*P : (j+2)*P+P-1];
`
`}
`
`}
`
`The implementation generates two bogus values at the end of each row. The schedule generated by compiler for the
`innermost loop on EVP is as shown in the figure1.7. It takes 12 cycles to calculate output for 16 pixels. The number
`of cycles in this case is determined by the load on the MAC unit
`
`Figure.1.7. Schedule of innermost loop for approach 1
`
`Approach 2
`Now, in the schedule for approach 1 we see that there are 11 MAC operations in 12 cycles and thus VMAC becomes
`critical resource as it is used 11 out of 12 cycles to compute the output and thus determines the performance of the
`innermost loop. Since the coefficients of the kernels are simple, many of these MAC operations can be replaced with
`simple addition and subtraction operations. This way we can exploit the VLIW architecture of EVP to reduce the
`load on critical resource by sharing loads on different functional units. Below is the pseudo code for this approach.
`I [H][W/P] -> input vector of elements
`for(i = 0; i< H-2; i++)
`{
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`j = 0;
`I1 = I[i+0][j*P : j*P+P-1]; // aligned load
`I2 = I[i+1][j*P : j*P+P-1];
`
`I3 = I[i+2][j*P : j*P+P-1];
`
`I11 = I[i+0][(j+1)*P : (j+1)*P+P-1];
`I22 = I[i+1][(j+1)*P : (j+1)*P+P-1]; // next vector in the row
`I33 = I[i+2][(j+1)*P : (j+1)*P+P-1];
`Xmul1[ 0 : P-1] = (2) .* I2;
`// Partial values of convolution
`Ymul1[ 0 : P-1] = (2) .* I1;
`
`Ymul3[ 0 : P-1] = (2) .* I3;
`
`
`X1[ 0 : P-1] = I1 + I3 + Xmul1;
`Y1[ 0 : P-1] = I3 + I1;
`Y2[ 0 : P-1] = Ymul1+ Ymul3;
`for(j = 0; j < W/P; j++)
`
`
`
`Bradium Technologies LLC Exhibit 2002
`Microsoft Corp. v. Bradium Tech., IPR2016-00449
`
`Page 14
`
`

`
`{
`
`
`
`
`
`
`
`
`
`
`
`
`
`}
`
`}
`
`Xmul10[ 0 : P-1] = (2) .* I2;
`Ymul30[ 0 : P-1] = (2) .* I3;
`Ymul10[ 0 : P-1] = (2) .* I1;
`X10[ 0 : P-1] = I10 + I30 + Xmul10;
`Y10[ 0 : P-1] = I30 – I10;
`Y20[ 0 : P-1] = Ymul30 – Ymul10;
`
`T1 = (X1[2:P-1],X10[0],X10[1]);
`// Merge two vectors, rotate 2 positions
`
`X[ 0 : P-1] = T1 - X1;
`
`T2 = (Y2[1:P-1],Y20[0]); //Merge two vectors, rotate 1 position
`
`Y[ 0 : P-1] = Y1 + T2;
`
`T2 = (Y1[1:P-1],Y10[0], Y10[1]);
`
`Y[ 0 : P-1] = Y + T2;
`G[0 : P-1] = abs(X[0 : P-1)) + abs(Y[0 : P-1));
`G[0 : P-1] = min(G[0 : P-1],bcst(2

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket