throbber
Performance Comparison of the AES Submissions
`
`Bruce Schneier∗
`
`John Kelsey†
`
`Doug Whiting‡
`Niels Ferguson (cid:107)
`
`Version 2.0
`February 1, 1999
`
`David Wagner§
`
`Chris Hall¶
`
`1 Introduction
`
`The principal goal guiding the design of any encryp-
`tion algorithm must be security. In the real world,
`however, performance and implementation cost are
`always of concern. Making the assumption that the
`major AES candidates are secure (a big assumption,
`to be sure, but one that is best dealt with in an-
`other paper), the most important properties the al-
`gorithms will be judged on will be the performance
`and cost of implementation.
`In this paper, we will completely ignore secu-
`rity.
`Instead, we will compare the performance of
`the leading AES candidates on a variety of common
`platforms: 32-bit CPUs, 64-bit CPUs, cheap 8-bit
`smart-card CPUs, and dedicated hardware. For each
`platform, we first make some general observations
`on the performance issues for each of the platforms,
`then compare the various AES candidates, and fi-
`nally look at the specific issues for each of the can-
`didates.
`
`2 Performance as a Function of
`Key Length
`
`of the algorithms have different performance char-
`acteristics for different key lengths. The results are
`summarized in Table 1.
`Nine algorithms—CAST-256, Crypton, DFC,
`E2, Frog, HPC, Mars, RC6, and Serpent—are com-
`pletely independent of key length. Two algorithms—
`Loki97 and Twofish—encrypt and decrypt indepen-
`dent of key length, but take different times to set up
`different-size keys1. Twofish takes longer to set up
`a longer key, while Loki97 actually takes less time
`to set up a longer key than it does to set up a
`shorter key. Four algorithms—DEAL, Magenta, Ri-
`jndael, and SAFER+—encrypt and decrypt at dif-
`ferent speeds, depending on key length.
`In this paper, we will concentrate on key setup
`and encryption for 128-bit keys. Obviously, encryp-
`tion speed is more important than key setup speed.
`The reader should keep in mind that DEAL and Ma-
`genta encrypt 33% slower using 256-bit keys. Rijn-
`dael encrypts 20% slower for 192-bit keys and 40%
`slower for 256-bit keys. SAFER+ has the largest
`performance degradation for large keys; it is 50%
`slower for 192-bit keys and 100% slower for 256-bit
`keys.
`
`3 Performance on 32-bit CPUs
`
`The speed (both encryption and key setup) of most
`Efficiency on 32-bit CPUs is one of NIST’s stated
`AES candidates is independent of key length. That
`performance criteria. Actually, there are two dif-
`is, the time required to set up a key and encrypt
`ferent 32-bit efficiencies that need to be considered.
`a block of text is the same, regardless of whether
`The first is what NIST delineated: performance on
`the key is 128, 192, or 256 bits long. A minority
`∗Counterpane Systems; 101 E Minnehaha Parkway, Minneapolis, MN 55419, USA; schneier@counterpane.com.
`†Counterpane Systems; kelsey@counterpane.com.
`‡Hi/fn, Inc., 5973 Avenida Encinas Suite 110, Carlsbad, CA 92008, USA; dwhiting@hifn.com.
`§University of California Berkeley, Soda Hall, Berkeley, CA 94720, USA; daw@cs.berkeley.edu.
`¶Counterpane Systems; hall@counterpane.com.
`(cid:107)Counterpane Systems; niels@counterpane.com.
`1Twofish is actually much more complicated. There are ways to implement Twofish where the encryption speed is different
`for different key lengths, but they are only suitable for encrypting small text blocks. See [SKW+98a, WS98, SKW+99] for
`more details.
`
`1
`
`DivX, LLC Exhibit 2007
`Page 2007 - 1
`Netflix Inc. et al. v. DivX, LLC, IPR2020-00614
`
`

`

`Algorithm
`Name
`CAST-256 [Ada98]
`Crypton [Lim98]
`DEAL [Knu98]
`
`DFC [GGH+98]
`E2 [NTT98]
`Frog [GLC98]
`HPC [Sch98]
`Loki97 [BP98]
`Magenta [JH98]
`
`Mars [BCD+98]
`RC6 [RRS+98]
`Rijndael [DR98a]
`
`SAFER+ [CMK+98]
`
`Serpent [ABK98a]
`Twofish [SKW+98a]
`
`constant
`constant
`constant
`constant
`decreasing
`increasing
`
`constant
`constant
`increasing
`
`Key Setup Encryption
`constant
`constant
`constant
`constant
`increasing
`128,192: 6 rounds
`256: 8 rounds
`constant
`constant
`constant
`constant
`constant
`128,192: 6 rounds
`256: 8 rounds
`constant
`constant
`128: 10 rounds
`192: 12 rounds
`256: 14 rounds
`128: 8 rounds
`192: 12 rounds
`256: 16 rounds
`constant
`constant
`
`increasing
`
`constant
`increasing
`
`Table 1: Speed of AES Candidates for Different Key Lengths.
`
`the high-end Pentium Pro (and similar) CPUs that
`are likely to dominate desktop computing for the
`next few years. To be sure, this is an arbitrary
`measure—in ten years, using Pentium Pro perfor-
`mance as a yardstick will seem as quaint as com-
`paring algorithm performance on 80286 chips does
`today—but it is a good measure of relative high-
`end performance. Even when 64-bit CPUs become
`commonplace, algorithms that are the fastest on the
`Pentium Pro are likely to remain the fastest on these
`new microprocessors.
`The second is performance on low-end 32-bit
`CPUs—such as the 80386 and 68000 variants, as
`well as various simple RISC chips—that will increas-
`ingly replace 8-bit CPUs on high-end smart cards
`and computing-intensive embedded applications.
`Performance on these commodity 32-bit CPUs is
`much more important than performance on the Pen-
`tium Pro/II. Over the course of AES’s lifetime, the
`architecture of desktop computing will likely change
`beyond recognition. The Pentium Pro/II architec-
`ture has oddities not present in other 32-bit CPUs
`and also not present in Merced (now called Pentium
`III) and other 64-bit CPUs. But just as 8-bit CPUs
`are still being used today (and will still be used
`tomorrow), commodity 32-bit CPUs will find their
`
`particular price/performance “sweet spot” in widely
`deployed embedded applications. Performance on
`these CPUs will still be important a decade from
`now.
`
`3.1 Comparing Performance on 32-
`bit CPUs
`
`We looked at the performance of the major AES can-
`didates on both the Pentium and the Pentium Pro.
`We either used the data in the candidate’s AES sub-
`mission documentation or, where such data was un-
`available or unreliable, calculated our own or used
`the data from Brian Gladman’s web page [Gla98].
`Where we were unsure if a particular optimization
`technique would work, we gave the algorithm the
`benefit of the doubt. When there were multiple
`sources of data, we took the fastest. The results are
`shown in Table 2. Numbers derived from estimates
`and not from an actual implementation are marked
`with an asterisk. Crypton and Rijndael have very
`different key setup speeds for encryption and decryp-
`tion. Rijndael’s encryption key setup is 300 clocks,
`and its decryption key setup is 1400 clocks. Cryp-
`tion’s encryption and decryption key setups are 540
`and 1370 clocks, respectively. For all tables except
`
`2
`
`DivX, LLC Exhibit 2007
`Page 2007 - 2
`Netflix Inc. et al. v. DivX, LLC, IPR2020-00614
`
`

`

`Algorithm
`Name
`
`CAST-256
`Crypton
`DEAL
`DFC
`E2
`Frog
`HPC
`Loki97
`Magenta
`Mars
`RC6
`Rijndael
`SAFER+
`Serpent
`Twofish
`
`Encrypt
`Encrypt
`Encrypt
`Key Setup
`Pentium
`Pentium Pro
`Pentium Pro Pentium Pro
`C (clocks)
`C (clocks) ASM (clocks) ASM (clocks)
`4300
`660
`600*
`600*
`955
`476
`345
`390
`4000*
`2600
`2200
`2200
`7200
`1700
`750
`?
`2100
`720
`410
`410*
`1386000
`2600
`?
`?
`120000
`1600
`?
`?
`7500
`2150
`?
`?
`50
`6600
`?
`?
`4400
`390
`320*
`550*
`1700
`260
`250
`700*
`850
`440
`291
`320
`4000
`1400
`800*
`1100*
`2500
`1030
`900*
`1100*
`8600
`400
`258
`290
`
`Table 2: AES Candidates’ Performance with 128-bit Keys on Pentium-Class CPUs.
`
`the 32-bit CPU (RC6 and Mars), while others are
`that for hashing, we used the average.
`largely CPU-independent.
`The C performance numbers are included be-
`cause they were required by NIST, and were in-
`For bulk encryption, the fastest algorithms on
`cluded in each of the algorithms’ submission doc-
`general 32-bit CPUs are (in order) Twofish, Rijn-
`umentation. Since in any application where encryp-
`dael, Crypton, and E2, followed by Mars and RC6.
`tion speed is a potential bottleneck, the algorithm
`Both Mars and RC6 are significantly faster on 32-bit
`will be hand-coded in optimized assembly language,
`CPUs with fast 32-bit multiplications and variable
`we primarily focused on assembly language imple-
`rotations, like the Pentium Pro and Pentium II, but
`mentations. The algorithm is a discrete and well-
`are significantly slower on CPUs without these fea-
`defined chunk of code with a single entry point that
`tures. In fact, both of these algorithms also run more
`needs to run fast: a perfect example of something
`slowly than the four fastest on a DEC Alpha. Re-
`that should be coded in assembly.2
`stricting the comparison to the Pentium Pro/II (the
`This comparison is for 128-bit keys; as noted ear-
`NIST reference platform), the fastest algorithms are
`lier, some algorithms have a running time that de-
`(in order) RC6, Twofish, Mars, Rijndael, Crypton,
`pends on the size of the key. We concentrated on
`and E2. All candidates other than these six are sig-
`128-bit keys, as these are the fastest (key setup in
`nificantly slower across all 32-bit CPUs.
`Loki97 is the only exception), and will provide ad-
`For comparison, on a Pentium, DES encrypts at
`equate security for virtually any application for the
`43 clock cycles per byte, IDEA at 74 clock cycles per
`next several decades (at least).
`byte, 16-round RC5 at 25 clock cycles per byte, and
`The AES submissions vary greatly in their 32-bit
`Blowfish at 20 clock cycles per byte [PRB98].
`CPU performance, from 250 clock cycles per block
`For encryption of short blocks of plaintext, it
`(RC6) to 6600 (Magenta). Some candidates’ perfor-
`makes sense to look at both encryption speed and
`mances depends heavily on the particular details of
`2For fastest assembly-language performance on current Intel CPUs, Twofish’s implemementation compiles key-dependent
`data directly in the encryption and decryption code. In this implementation, a separate copy of the executable code is stored
`for each key along with the key-dependent S-box tables. This copy of the code is modified by patching immediate values in
`the code with actual subkey values, removing the need to load the subkey from data cache during execution and freeing up
`registers. Although this key-specific code could be technically considered to be “self-modifying,” the code is totally reentrant,
`and the process is much more akin to compilation. This compilation trick seems to be helpful only in register-poor CPUs like
`the Pentium or Pentium Pro, where the availability of an extra register can make a significant difference in performance. Most
`other AES candidates could also benefit somewhat from using a similar compiled mode, but the resulting gains are believed to
`be fairly small (considerably less than 10%). Twofish has been specially designed to take advantage of this compiled option,
`achieving about 20% faster execution than a naive assembly language implementation on a Pentium Pro.
`
`3
`
`DivX, LLC Exhibit 2007
`Page 2007 - 3
`Netflix Inc. et al. v. DivX, LLC, IPR2020-00614
`
`

`

`Twofish are slower, but also provide acceptable hash
`function performance. SAFER+ and Serpent are
`marginal.
`All of these speeds are significantly slower than
`dedicated hash functions.
`SHA-1,
`for example,
`hashes at a rate of 13 clock cycles per byte on a
`Pentium, and RIPEMD-160 hashes at 16 clock cy-
`cles per byte on the same machine [PRB98]. Thus,
`it is likely that AES will be used as a hash function
`only when space (either code space in software or
`gates in hardware) is at a premium.
`
`3.2 Algorithm-Specific Comments on
`32-bit Performance
`3.2.1 CAST-256
`
`CAST-256 is an incomplete Feistel network. Each
`round is fast—17 clock cycles—but 48 rounds gives
`a total speed of 815 clock cycles per block. Hence,
`CAST-256 is about three times slower than the
`fastest AES candidates, even though it uses mostly
`32-bit operations. CAST’s use of only simple RISC
`operations implies that the relative performance of
`CAST-256 will be fairly uniform across processor
`types.
`
`key setup speed. Again, RC6’s and Mars’s reliance
`on CPU-specific operations means that we need
`to consider general 32-bit CPUs and the Pentium
`Pro/II separately. Table 3 compares the speed to
`key and encrypt of six of the AES candidates for
`different size texts on the Pentium (and by exten-
`sion, most general-purpose 32-bit CPUs.
`To calculate key setup speeds in assembly, we es-
`timated assembly-language key setup time by taking
`the percentage speedup between C and assembly en-
`cryption on the Pentium Pro/II, and applying that
`to the assembly-language key setup.
`Rijndael is the fastest algorithm for small blocks
`of text, followed closely by Crypton. What is sur-
`prising, though, is how quickly all the algorithms
`converge to their raw encryption speed. For 1024-
`byte texts—the size of a very small e-mail message—
`all the algorithms are within 15%.
`For 4096-
`byte texts—the size of a more reasonable e-mail
`message—the speed ordering of the algorithms is the
`same as it would be if the key schedule were not
`taken into account. Note that the AES candidates
`not listed here are slower.
`Table 4 compares the speed to key and encrypt of
`seven of the AES candidates for different size texts
`on the Pentium Pro/II. The primary difference of
`note is how much faster RC6 and Mars are on this
`one platform.
`The speed of the key schedule becomes even more
`important when we look at the algorithm as a hash
`function. There are many hash function construc-
`tions that rely on block ciphers as primitives, and
`all involve changing the encryption key with every
`block. Table 5 summarizes this data.
`These values are the sum of the number of clock
`cycles required to set up a key plus the number of
`clock cycles required to encrypt one block. Starting
`with the data in Table 2, we calculated the Pentium
`Pro/II assembly key setup time by multiplying the
`C key setup time by the percentage speed improve-
`ment between the encryption speed in C and assem-
`bly; this is a reasonable first estimate of the speed
`improvement of taking the key setup from C to as-
`sembly. Then we added in the encryption speeds in
`both Pentium Pro/II and Pentium. While the real
`performance numbers are likely to be somewhat dif-
`ferent, this is a reasonable first estimate.
`The fastest hash function is Rijndael, followed
`closely by Crypton.3 These two algorithms are par-
`ticularly suited as hash functions because key setup
`in the encryption direction is much faster than key
`DEAL has basically the same performance charac-
`setup in the decryption direction. E2, RC6, and
`teristics as triple-DES. It is a six-round Feistel net-
`3Previous versions of this report mentioned possible weaknesses in Crypton’s key schedule, making it unsuitable as a hash
`function. In the interest of fairness, we withhold judgment on this issue pending better documentation of the weaknesses.
`
`Crypton is a very clever enhancement to the Square
`algorithm [DKR97]. It is structured so that encryp-
`tion and decryption are identical operations (with
`reversed subkey schedules), and there are two sep-
`arate 8-by-8-bit S-boxes used in the transforma-
`tion, which are easily built from 4-bit permutations.
`Crypton replaces Square’s MDS matrix with a sim-
`pler (and admittedly less powerful) diffusion opera-
`tion that can be implemented very cheaply in hard-
`ware and on smart cards. Perhaps because the S-
`boxes and diffusion are weaker than in Square, Cryp-
`ton consists of twelve rounds instead of eight for
`Square (and ten, twelve, or fourteen for Rijndael).
`An optimized version on a 32-bit CPU looks iden-
`tical to Square except the fixed lookup tables are dif-
`ferent, so performance scales based on the number
`of rounds. Encryption speed is 390 clocks per block
`in Pentium/Pentium Pro assembler, and is uniform
`across different 32-bit CPUs.
`
`3.2.2 Crypton
`
`3.2.3 DEAL
`
`4
`
`DivX, LLC Exhibit 2007
`Page 2007 - 4
`Netflix Inc. et al. v. DivX, LLC, IPR2020-00614
`
`

`

`Text Size
`(bytes)
`16
`32
`64
`128
`256
`512
`210
`211
`212
`213
`214
`215+
`
`Crypton
`73
`49
`37
`30
`27
`26
`25
`25
`25
`24
`24
`24
`
`E2 Mars RC6 Rijndael
`100
`260
`146
`59
`63
`147
`95
`39
`44
`91
`69
`30
`35
`63
`57
`25
`30
`48
`50
`22
`38
`41
`47
`21
`27
`38
`45
`21
`26
`36
`45
`20
`26
`35
`44
`20
`26
`35
`44
`20
`26
`35
`44
`20
`26
`34
`44
`20
`
`Serpent Twofish
`205
`175
`137
`119
`103
`91
`86
`70
`77
`48
`73
`38
`71
`31
`70
`25
`69
`22
`69
`21
`69
`20
`69
`19
`
`Table 3: Clock Cycles, per Byte, to Key and Encrypt Different Text Sizes on a Pentium.
`
`Text Size
`(bytes)
`16
`32
`64
`128
`256
`512
`210
`211
`212
`213
`214
`215+
`
`Crypton
`70
`46
`34
`28
`25
`23
`22
`22
`22
`22
`22
`22
`
`E2 Mars RC6 Rijndael
`100
`246
`118
`53
`63
`133
`67
`36
`44
`76
`41
`27
`35
`48
`28
`23
`30
`34
`22
`20
`28
`27
`19
`19
`27
`24
`17
`19
`26
`22
`16
`18
`26
`21
`16
`18
`26
`20
`16
`18
`26
`20
`16
`18
`26
`20
`16
`18
`
`Serpent Twofish
`193
`132
`125
`93
`90
`73
`73
`64
`65
`48
`61
`33
`58
`25
`57
`20
`57
`18
`57
`17
`56
`17
`56
`16
`
`Table 4: Clock Cycles, per Byte, to Key and Encrypt Different Text Sizes on a Pentium Pro/II.
`
`work with DES itself used as the round function.
`The Pentium and Pentium Pro performance is thus
`about 2000 clocks per block. In a sense, DEAL is
`a straw-man algorithm that provides a worst-case
`performance benchmark for AES.
`
`the fastest candidates. Due to the heavy reliance on
`multiply operations, we expect the performance on
`general 32-bit CPUs to be significantly worse. With
`only eight rounds, the speed vs.
`security tradeoff
`for DFC feels mismatched compared to most of the
`other AES candidates.
`
`3.2.4 Decorrelated Fast Cipher
`
`DFC is an eight-round Feistel cipher, in which the
`round function includes a 64-bit modular addition
`and multiplication over 264 + 13, as well as a single
`small (6-by-32-bit) S-box lookup. The modular mul-
`tiply does a nice job of diffusing bits, but it also hurts
`performance significantly. Despite the low number
`of rounds, DFC takes 750 clocks per block in assem-
`bly language on a Pentium Pro. This is about the
`speed of DES and nearly three times slower than
`
`3.2.5 E2
`
`E2 is a twelve-round Feistel cipher, with each round
`function including two levels of subkey XOR and
`8-by-8-bit S-box lookup, with linear mixing in be-
`tween. Only simple RISC operations are used, so
`the performance should be relatively uniform across
`processor types. The speed of E2 is reasonable at
`410 clocks per block in assembly language (making
`it the fourth fastest on the Pentium and the sixth
`
`5
`
`DivX, LLC Exhibit 2007
`Page 2007 - 5
`Netflix Inc. et al. v. DivX, LLC, IPR2020-00614
`
`

`

`Algorithm
`Name
`
`CAST-256
`Crypton
`DEAL
`DFC
`E2
`Frog
`HPC
`Loki97
`Magenta
`Mars
`RC6
`Rijndael
`SAFER+
`Serpent
`Twofish
`
`Hash Speed
`Hash Speed
`Pentium
`Pentium Pro
`ASM (clocks) ASM (clocks)
`282*
`282*
`46*
`49*
`349*
`349*
`245*
`?
`100*
`100*
`?
`?
`?
`?
`?
`?
`?
`?
`246*
`260*
`118*
`146*
`32*
`34*
`193*
`212*
`193*
`205*
`132
`175
`
`Table 5: Hash-Function Performance, per Byte, of AES Candidates (128-bit key) on Pentium and Pentium
`Pro/II.
`
`fastest on the Pentium Pro/II), and is uniform across
`different 32-bit CPUs.
`
`3.2.6 Frog
`
`Frog has a byte-oriented structure. It seems that any
`implementation has to operate on individual bytes,
`which means that Frog cannot take advantage of 32-
`bit operations. This hurts the performance on larger
`CPUs. Frog’s key schedule is by far the slowest of
`any AES candidate.
`
`3.2.7 Hasty Pudding Cipher
`
`HPC was optimized for 64-bit CPUs. The heavy use
`of rotations of 64-bit words are difficult to implement
`(and therefore slow) on most 32-bit CPUs.
`
`3.2.8 Loki97
`
`Loki97 is one of the few algorithms that uses a bit-
`level permutation. Although these are very fast in
`hardware, their cost in software is similar to that of
`a full S-box layer. Performance on 32-bit CPUs is
`poor compared to most other candidates, and may
`be slower than triple-DES.
`
`3.2.9 Magenta
`
`Magenta is, by far, the slowest of any of the AES
`candidates. The core of the algorithm is formed by
`
`a byte-level FFT-like structure. This seems to force
`any 32-bit implementation to use 8-bit manipula-
`tions, which is generally inefficient. Magenta also
`has a very large number of operations per round.
`Each Π function contains 16 8-bit table lookups.
`There are four Π functions in each T , and three
`T functions in each round for a total of 192 S-box
`lookups per round, and 1152 per encryption with a
`128-bit key. (As a comparison, Frog uses a total of
`128 S-box lookups per encryption.) There seems to
`be no way of implementing Magenta in software at
`a speed comparable to the other candidates.
`
`3.2.10 Mars
`
`Mars is a fast cipher that relies on a veritable
`“kitchen sink” of primitive operator elements, in-
`cluding rotation, multiplication, and 8-by-32-bit S-
`box lookups. Mars has the fastest claimed speed for
`a C version (390 clocks per block) on the Pentium
`Pro of any of the AES candidates. However, this
`speed advantage is largely due to the use of a dif-
`ferent optimizing compiler (DJGPP), which would
`probably benefit other candidates dramatically as
`well; using the standard AES compiler (Borland),
`Mars is fairly slow at 900 clocks/block.
`Mars encryption has three phases: an initial for-
`ward mixing, the cryptographic “core,” and a final
`mixing. The mixing phases are unkeyed (except for
`input/output whitening) and constitute roughly two
`
`6
`
`DivX, LLC Exhibit 2007
`Page 2007 - 6
`Netflix Inc. et al. v. DivX, LLC, IPR2020-00614
`
`

`

`cryptographic cycles each. The cryptographic core
`consists of sixteen rounds of an unbalanced Feistel ci-
`pher (totaling roughly four cycles), with each round
`using a 32-bit to 96-bit keyed expansion function E.
`Both fixed and data-dependent rotations are used
`extensively throughout the cipher. A careful study
`of the encryption round reveals a fairly long crit-
`ical path through the E function that limits the
`throughput. On a Pentium Pro, this critical path
`appears to be at least 12 clock cycles, which leads
`to a core processing time of at least 192 clocks per
`block. With the forward and reverse mixing requir-
`ing at least 100 clocks, the overall speed of this algo-
`rithm in assembler on a Pentium Pro appears to be
`no faster than 310-320 clocks. Because of its use of
`data-dependent rotation and multiplication, the per-
`formance of Mars on a Pentium or Pentium MMX
`chip is considerably slower: about 550 clocks/block.
`As with RC6, the relative performance of Mars
`across processor types will vary considerably, and
`timing attacks are also of concern in implementa-
`tions where multiplies and data-dependent rotations
`do not take constant time. Compared to RC6, Mars
`is somewhat slower and far less elegant, yet it retains
`most of the performance concerns discussed below
`for RC6.
`
`3.2.11 RC6
`
`RC6 is by far the most elegant and easily under-
`stood AES candidate.
`It is remarkably simple, as
`has become almost expected of Rivest’s ciphers (e.g.,
`RC4, RC5). On the AES target platform (Pentium
`Pro/Pentium II), it is also the fastest algorithm, at
`about 250 clocks per block in assembly language.
`The unrolled code size of the encryption routine is
`very modest, and a looping version with very good
`performance can be implemented with an extremely
`small code size (under 200 bytes on the Pentium
`family). The performance in C is very good, even
`though it is hurt considerably by the lack of intrin-
`sic support for rotations in most C compilers.
`However, RC6 does not perform anywhere near
`as well on many other common platforms. For ex-
`ample, on the Pentium (or Pentium MMX) chip,
`multiplication and variable rotation opcodes do not
`pair, and they require 10 and 4 clocks, respectively.
`Thus, the time required for multiplication and rota-
`tion alone is 560 clocks per block on a Pentium, with
`the other RC6 operations increasing this total to at
`least 700 clocks for an optimized assembly version.
`Thus, RC6 is nearly three times slower on a Pentium
`than on a Pentium Pro. By comparison, most other
`candidate algorithms have identical or nearly identi-
`
`cal performance on the two platforms. On a 386/486
`(and many embedded CPUs), the multiplication and
`rotation speed is also fairly slow.
`In general, be-
`cause it does not rely on “basic” RISC instructions
`as most other AES candidates do, RC6’s relative
`performance will be much less uniform across pro-
`cessor types. This makes it much less attractive for
`low-end 32-bit CPUs.
`The fact that many processors (e.g., 486, 68000,
`smart cards) have data-dependent execution times
`for multiplication/rotation is also a concern, since
`RC6 could potentially leak key information to a tim-
`ing attack. Constant-time implementations of data-
`dependent rotations on those platforms is possible,
`but at a significant speed penalty.
`
`3.2.12 Rijndael
`
`Rijndael is another variant of Square: a fast cipher
`that works very well across all platforms. It boasts
`a clean mathematical structure involving only ta-
`ble lookup and XOR operations. Although the en-
`cryption and decryption algorithms are not exactly
`identical, their general structure and performance
`are virtually indistinguishable.
`Rijndael’s assembly language on both the Pen-
`tium and Pentium Pro processors is about 300 clocks
`per block. Unlike RC6 and Mars, there are no known
`CPU platforms (8-bit or 32-bit) on which Rijndael’s
`relative performance would be unduly negatively af-
`fected or on which timing attacks would be possible.
`
`3.2.13 SAFER+
`
`SAFER+ is obviously designed for 8-bit CPUs. Ev-
`ery operation in SAFER+ is a simple RISC oper-
`ation, but there are no 32-bit operations anywhere.
`This lack of 32-bit operations, particularly in the dif-
`fusion phase involving the PHT and the “Armenian
`shuffle,” significantly hurts the relative performance
`of this algorithm on modern CPUs. A rough esti-
`mate is that SAFER+ with a 128-bit key requires
`800 clocks per block on a Pentium Pro in assembly,
`which is slower per-byte than DES. On a Pentium,
`the speed slows to about 1100 clocks per block. The
`throughput decreases further for the larger key sizes;
`in fact, SAFER+ with a 256-bit key is slower than
`triple-DES. As an SP-network, SAFER+ with a 128-
`bit key has only one-fourth the number of cycles of
`Serpent, yet it is approximately the same speed. In
`general, the speed vs. security tradeoff for SAFER+
`feels mismatched compared to most of the other AES
`candidates.
`
`7
`
`DivX, LLC Exhibit 2007
`Page 2007 - 7
`Netflix Inc. et al. v. DivX, LLC, IPR2020-00614
`
`

`

`3.2.14 Serpent
`
`Serpent is an extremely conservative SP-network ci-
`pher, carefully designed to allow “bit-slice” imple-
`mentations on 32-bit CPUs. Its authors claim that
`Serpent’s performance on a Pentium Pro is at least
`as fast as DES [ABK98a], but this claim involves
`C implementations, which are always subject to the
`vagaries of compiler optimization.
`In assembly language, Serpent is actually con-
`siderably slower than a good DES implementation,
`which can run at about 45 clocks/byte on a Pen-
`tium. A conservative estimate, counting only ALU
`operations and assuming (very optimistically) that
`all memory accesses can be accomplished for “free,”
`shows that Serpent requires at least 900 clocks per
`block on a Pentium Pro (56 clocks/byte); in fact, 960
`clocks seems like a more likely practical bound. On
`a Pentium, this number rises to at least 1100 clocks
`per block. An optimized C version can achieve about
`1020 clocks per block on a Pentium Pro, proving that
`Serpent compiles very well from C (much better than
`does DES)—which is an interesting fact but hardly
`of great importance here.
`The fact remains that the best Serpent software
`implementation will be 3–4 times slower than other
`candidates, including Mars, Twofish, RC6, and Ri-
`jndael. Thus, even a non-conservative version of Ser-
`pent with only sixteen rounds would be considerably
`slower than these other algorithms. The nature of
`the algorithm, which uses basic RISC operations ex-
`clusively, strongly suggests that its relative perfor-
`mance will not be negatively affected on any plat-
`form.
`
`3.2.15 Twofish
`
`Twofish is the fastest AES candidate on a Pentium
`(and other 32-bit CPUs), and is second-fastest on
`the Pentium Pro/II. Its key setup is about average
`(eight are faster, and six are slower); however, for
`applications where key setup is a large factor in en-
`cryption speed (for small amounts of text), there
`are alternate implementations of Twofish that trade
`off encryption speed for key setup speed. As a hash
`function, for example, Twofish is much more efficient
`using one of these alternative implementations. See
`[SKW+98a] for details.
`Twofish uses only basic RISC operations, and
`Twofish’s performance on other 32-bit platforms will
`be comparable.
`
`of
`
`Performance
`3.3 Comparing
`“Minimal Secure Variants”
`In [Bih98], Biham introduced the notion of compar-
`ing the algorithms based on their “minimal secure
`variants.” Different design teams were more or less
`conservative than each other; the number of rounds
`in their final submissions was not a fixed percent-
`age of the number of rounds they could success-
`fully cryptanalyze. Biham tried to normalize the
`algorithms by determining the minimal number of
`rounds that is secure (either as described by the de-
`signers or other cryptographers, or Biham’s “best
`guess”), and then added a standard two cycles.
`We do not believe that this measurement is a fair
`way to compare algorithms. Two examples illustrate
`this. One, Biham maintains that the minimal num-
`ber of rounds for DFC is 7 and for Serpent is 15.
`It seems unreasonable to add two rounds to each,
`when that is a 29% increase for DFC and only a
`13% increase for Serpent. The comparative analy-
`sis clearly shows that one round of DFC is stronger
`than one round of Serpent. Two, Biham’s measures
`don’t take into account initial and final transforma-
`tions for some of the ciphers: e.g., the whitening in
`Twofish and the initial/final transformations in E2.
`At least in Twofish’s case, the transformations have
`the effect of adding a round to the cipher without
`the performance penalty.
`Aside from the fairness issue, we also question
`the usefulness of this measure: we feel that each al-
`gorithm should be judged on its own merits, and that
`changing the number of rounds is no different than
`changing the structure of the rounds. Also, NIST
`has not indicated that they would allow algorithms
`to be selected with modifications.
`And finally, in some cases we even question Bi-
`ham’s choice of values. Being about to break ten
`rounds of Twofish, for example, seems overly opti-
`mistic. If such a metric seems useful, we recommend
`that several cryptanalysts be polled for their opin-
`ions.
`Regardless of these concerns, we include Table 6,
`based on Biham’s numbers. We urge caution in us-
`ing this data. Biham also presented a table of rel-
`ative performance, but he used speeds for compiled
`C code [Bih98], which makes no sense for any appli-
`cation where speed is important.
`
`4 Performance on 64-bit CPUs
`
`We have some data regarding performance on the
`Alpha in Table 7. Aside from DFC and HPC, these
`numbers are based on analyzing the algorithms and
`
`8
`
`DivX, LLC Exhibit 2007
`Page 2007 - 8
`Netflix Inc. et al. v. DivX, LLC, IPR2020-00614
`
`

`

`Algorithm
`Name
`
`CAST-256
`Crypton
`DEAL
`DFC
`E2
`Frog
`HPC
`Loki97
`Magenta
`Mars
`RC6
`Rijndael
`SAFER+
`Serpent
`Twofish
`
`Rounds
`
`48
`12
`6
`8
`12
`8
`8
`16
`6
`32
`20
`10
`8
`32
`16
`
`Minimal MSR Encrypt MSR Encrypt
`Secure
`Pentium Pro
`Pentium
`Rounds ASM (clocks) ASM (clocks)
`40
`500*
`500*
`11
`316
`358
`9
`3300
`3300
`9
`844
`?
`10
`342*
`342*
`?
`?
`?
`?
`?
`?
`> 36
`?
`?
`> 10
`?
`?
`20
`200*
`344*
`20
`250
`700*
`8
`233
`256
`7
`700*
`963*
`17
`478*
`584*
`12
`194
`218
`
`Table 6: Minimal Secure Round Performance of AES Candidates with 128-bit Keys on Pentium-class CPUs.
`
`Algorithm
`Name
`CAST-256
`Crypton
`DEAL
`DFC
`E2
`Frog
`HPC
`Loki97
`Magenta
`Mars
`RC6
`Rijndael
`SAFER+
`Serpent
`Twofish
`
`Cycles
`600
`408
`2528*
`304
`471
`?
`376
`?
`?
`478
`467*
`340*
`656
`915
`360*
`
`Table 7: AES Candidate Performance on the DEC Alpha.
`
`9
`
`DivX, LLC Exhibit 2007
`Page 2007 - 9
`Netflix Inc. et al. v. DivX, LLC, IPR2020-00614
`
`

`

`studying data sheets and not on actual implemen-
`tations [Alm99]. Perhaps the most interesting thing
`is that, for many of the AES candidates, the Al-
`pha is actually slower, in terms of clocks per block,
`than the Pentium II. DFC is the fastest algorithm on
`the Alpha, followed by Rijndael, Twofish, HPC, and
`Crypton. All the other submisions are considerably
`slower.
`
`4.1 RC6
`RC6’s performance on the DEC Alpha processor is
`also hurt by its reliance on multiplications and rota-
`tions. The 21164 can run up to 700 MHz (or faster)
`and has a 4-way superscalar architecture. The Alpha
`can perform a signed 32-bit multiply in 8 clocks, and
`a second multiply operation can be started 4 clocks
`after the first, so two of them take 12 clocks to-
`tal. This multiply opcode should work for RC6’s un-
`signed multiplies, assuming that the signed multiply
`overflow is properly handled; otherwise, we would
`need to use the slower 64-by-64-bit multiply, adding
`8 more clocks. The remainder of the RC6 round
`logic takes an additional 11 clocks, due in no small
`part to the lack of rotation opcodes, which must be
`synthesized out of shifts. Thus, an RC6 round re-
`quires a total of 23 clocks on the Alpha. At twenty
`rounds, this is 460 clocks per block, about twice the
`clock count on a Pentium II. In other words, a 600
`MHz Alpha runs RC6 at a slower absolute speed (in
`Mbits/s) th

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket