`
`Bruce Schneier∗
`
`John Kelsey†
`
`Doug Whiting‡
`Niels Ferguson (cid:107)
`
`Version 2.0
`February 1, 1999
`
`David Wagner§
`
`Chris Hall¶
`
`1 Introduction
`
`The principal goal guiding the design of any encryp-
`tion algorithm must be security. In the real world,
`however, performance and implementation cost are
`always of concern. Making the assumption that the
`major AES candidates are secure (a big assumption,
`to be sure, but one that is best dealt with in an-
`other paper), the most important properties the al-
`gorithms will be judged on will be the performance
`and cost of implementation.
`In this paper, we will completely ignore secu-
`rity.
`Instead, we will compare the performance of
`the leading AES candidates on a variety of common
`platforms: 32-bit CPUs, 64-bit CPUs, cheap 8-bit
`smart-card CPUs, and dedicated hardware. For each
`platform, we first make some general observations
`on the performance issues for each of the platforms,
`then compare the various AES candidates, and fi-
`nally look at the specific issues for each of the can-
`didates.
`
`2 Performance as a Function of
`Key Length
`
`of the algorithms have different performance char-
`acteristics for different key lengths. The results are
`summarized in Table 1.
`Nine algorithms—CAST-256, Crypton, DFC,
`E2, Frog, HPC, Mars, RC6, and Serpent—are com-
`pletely independent of key length. Two algorithms—
`Loki97 and Twofish—encrypt and decrypt indepen-
`dent of key length, but take different times to set up
`different-size keys1. Twofish takes longer to set up
`a longer key, while Loki97 actually takes less time
`to set up a longer key than it does to set up a
`shorter key. Four algorithms—DEAL, Magenta, Ri-
`jndael, and SAFER+—encrypt and decrypt at dif-
`ferent speeds, depending on key length.
`In this paper, we will concentrate on key setup
`and encryption for 128-bit keys. Obviously, encryp-
`tion speed is more important than key setup speed.
`The reader should keep in mind that DEAL and Ma-
`genta encrypt 33% slower using 256-bit keys. Rijn-
`dael encrypts 20% slower for 192-bit keys and 40%
`slower for 256-bit keys. SAFER+ has the largest
`performance degradation for large keys; it is 50%
`slower for 192-bit keys and 100% slower for 256-bit
`keys.
`
`3 Performance on 32-bit CPUs
`
`The speed (both encryption and key setup) of most
`Efficiency on 32-bit CPUs is one of NIST’s stated
`AES candidates is independent of key length. That
`performance criteria. Actually, there are two dif-
`is, the time required to set up a key and encrypt
`ferent 32-bit efficiencies that need to be considered.
`a block of text is the same, regardless of whether
`The first is what NIST delineated: performance on
`the key is 128, 192, or 256 bits long. A minority
`∗Counterpane Systems; 101 E Minnehaha Parkway, Minneapolis, MN 55419, USA; schneier@counterpane.com.
`†Counterpane Systems; kelsey@counterpane.com.
`‡Hi/fn, Inc., 5973 Avenida Encinas Suite 110, Carlsbad, CA 92008, USA; dwhiting@hifn.com.
`§University of California Berkeley, Soda Hall, Berkeley, CA 94720, USA; daw@cs.berkeley.edu.
`¶Counterpane Systems; hall@counterpane.com.
`(cid:107)Counterpane Systems; niels@counterpane.com.
`1Twofish is actually much more complicated. There are ways to implement Twofish where the encryption speed is different
`for different key lengths, but they are only suitable for encrypting small text blocks. See [SKW+98a, WS98, SKW+99] for
`more details.
`
`1
`
`DivX, LLC Exhibit 2007
`Page 2007 - 1
`Netflix Inc. et al. v. DivX, LLC, IPR2020-00614
`
`
`
`Algorithm
`Name
`CAST-256 [Ada98]
`Crypton [Lim98]
`DEAL [Knu98]
`
`DFC [GGH+98]
`E2 [NTT98]
`Frog [GLC98]
`HPC [Sch98]
`Loki97 [BP98]
`Magenta [JH98]
`
`Mars [BCD+98]
`RC6 [RRS+98]
`Rijndael [DR98a]
`
`SAFER+ [CMK+98]
`
`Serpent [ABK98a]
`Twofish [SKW+98a]
`
`constant
`constant
`constant
`constant
`decreasing
`increasing
`
`constant
`constant
`increasing
`
`Key Setup Encryption
`constant
`constant
`constant
`constant
`increasing
`128,192: 6 rounds
`256: 8 rounds
`constant
`constant
`constant
`constant
`constant
`128,192: 6 rounds
`256: 8 rounds
`constant
`constant
`128: 10 rounds
`192: 12 rounds
`256: 14 rounds
`128: 8 rounds
`192: 12 rounds
`256: 16 rounds
`constant
`constant
`
`increasing
`
`constant
`increasing
`
`Table 1: Speed of AES Candidates for Different Key Lengths.
`
`the high-end Pentium Pro (and similar) CPUs that
`are likely to dominate desktop computing for the
`next few years. To be sure, this is an arbitrary
`measure—in ten years, using Pentium Pro perfor-
`mance as a yardstick will seem as quaint as com-
`paring algorithm performance on 80286 chips does
`today—but it is a good measure of relative high-
`end performance. Even when 64-bit CPUs become
`commonplace, algorithms that are the fastest on the
`Pentium Pro are likely to remain the fastest on these
`new microprocessors.
`The second is performance on low-end 32-bit
`CPUs—such as the 80386 and 68000 variants, as
`well as various simple RISC chips—that will increas-
`ingly replace 8-bit CPUs on high-end smart cards
`and computing-intensive embedded applications.
`Performance on these commodity 32-bit CPUs is
`much more important than performance on the Pen-
`tium Pro/II. Over the course of AES’s lifetime, the
`architecture of desktop computing will likely change
`beyond recognition. The Pentium Pro/II architec-
`ture has oddities not present in other 32-bit CPUs
`and also not present in Merced (now called Pentium
`III) and other 64-bit CPUs. But just as 8-bit CPUs
`are still being used today (and will still be used
`tomorrow), commodity 32-bit CPUs will find their
`
`particular price/performance “sweet spot” in widely
`deployed embedded applications. Performance on
`these CPUs will still be important a decade from
`now.
`
`3.1 Comparing Performance on 32-
`bit CPUs
`
`We looked at the performance of the major AES can-
`didates on both the Pentium and the Pentium Pro.
`We either used the data in the candidate’s AES sub-
`mission documentation or, where such data was un-
`available or unreliable, calculated our own or used
`the data from Brian Gladman’s web page [Gla98].
`Where we were unsure if a particular optimization
`technique would work, we gave the algorithm the
`benefit of the doubt. When there were multiple
`sources of data, we took the fastest. The results are
`shown in Table 2. Numbers derived from estimates
`and not from an actual implementation are marked
`with an asterisk. Crypton and Rijndael have very
`different key setup speeds for encryption and decryp-
`tion. Rijndael’s encryption key setup is 300 clocks,
`and its decryption key setup is 1400 clocks. Cryp-
`tion’s encryption and decryption key setups are 540
`and 1370 clocks, respectively. For all tables except
`
`2
`
`DivX, LLC Exhibit 2007
`Page 2007 - 2
`Netflix Inc. et al. v. DivX, LLC, IPR2020-00614
`
`
`
`Algorithm
`Name
`
`CAST-256
`Crypton
`DEAL
`DFC
`E2
`Frog
`HPC
`Loki97
`Magenta
`Mars
`RC6
`Rijndael
`SAFER+
`Serpent
`Twofish
`
`Encrypt
`Encrypt
`Encrypt
`Key Setup
`Pentium
`Pentium Pro
`Pentium Pro Pentium Pro
`C (clocks)
`C (clocks) ASM (clocks) ASM (clocks)
`4300
`660
`600*
`600*
`955
`476
`345
`390
`4000*
`2600
`2200
`2200
`7200
`1700
`750
`?
`2100
`720
`410
`410*
`1386000
`2600
`?
`?
`120000
`1600
`?
`?
`7500
`2150
`?
`?
`50
`6600
`?
`?
`4400
`390
`320*
`550*
`1700
`260
`250
`700*
`850
`440
`291
`320
`4000
`1400
`800*
`1100*
`2500
`1030
`900*
`1100*
`8600
`400
`258
`290
`
`Table 2: AES Candidates’ Performance with 128-bit Keys on Pentium-Class CPUs.
`
`the 32-bit CPU (RC6 and Mars), while others are
`that for hashing, we used the average.
`largely CPU-independent.
`The C performance numbers are included be-
`cause they were required by NIST, and were in-
`For bulk encryption, the fastest algorithms on
`cluded in each of the algorithms’ submission doc-
`general 32-bit CPUs are (in order) Twofish, Rijn-
`umentation. Since in any application where encryp-
`dael, Crypton, and E2, followed by Mars and RC6.
`tion speed is a potential bottleneck, the algorithm
`Both Mars and RC6 are significantly faster on 32-bit
`will be hand-coded in optimized assembly language,
`CPUs with fast 32-bit multiplications and variable
`we primarily focused on assembly language imple-
`rotations, like the Pentium Pro and Pentium II, but
`mentations. The algorithm is a discrete and well-
`are significantly slower on CPUs without these fea-
`defined chunk of code with a single entry point that
`tures. In fact, both of these algorithms also run more
`needs to run fast: a perfect example of something
`slowly than the four fastest on a DEC Alpha. Re-
`that should be coded in assembly.2
`stricting the comparison to the Pentium Pro/II (the
`This comparison is for 128-bit keys; as noted ear-
`NIST reference platform), the fastest algorithms are
`lier, some algorithms have a running time that de-
`(in order) RC6, Twofish, Mars, Rijndael, Crypton,
`pends on the size of the key. We concentrated on
`and E2. All candidates other than these six are sig-
`128-bit keys, as these are the fastest (key setup in
`nificantly slower across all 32-bit CPUs.
`Loki97 is the only exception), and will provide ad-
`For comparison, on a Pentium, DES encrypts at
`equate security for virtually any application for the
`43 clock cycles per byte, IDEA at 74 clock cycles per
`next several decades (at least).
`byte, 16-round RC5 at 25 clock cycles per byte, and
`The AES submissions vary greatly in their 32-bit
`Blowfish at 20 clock cycles per byte [PRB98].
`CPU performance, from 250 clock cycles per block
`For encryption of short blocks of plaintext, it
`(RC6) to 6600 (Magenta). Some candidates’ perfor-
`makes sense to look at both encryption speed and
`mances depends heavily on the particular details of
`2For fastest assembly-language performance on current Intel CPUs, Twofish’s implemementation compiles key-dependent
`data directly in the encryption and decryption code. In this implementation, a separate copy of the executable code is stored
`for each key along with the key-dependent S-box tables. This copy of the code is modified by patching immediate values in
`the code with actual subkey values, removing the need to load the subkey from data cache during execution and freeing up
`registers. Although this key-specific code could be technically considered to be “self-modifying,” the code is totally reentrant,
`and the process is much more akin to compilation. This compilation trick seems to be helpful only in register-poor CPUs like
`the Pentium or Pentium Pro, where the availability of an extra register can make a significant difference in performance. Most
`other AES candidates could also benefit somewhat from using a similar compiled mode, but the resulting gains are believed to
`be fairly small (considerably less than 10%). Twofish has been specially designed to take advantage of this compiled option,
`achieving about 20% faster execution than a naive assembly language implementation on a Pentium Pro.
`
`3
`
`DivX, LLC Exhibit 2007
`Page 2007 - 3
`Netflix Inc. et al. v. DivX, LLC, IPR2020-00614
`
`
`
`Twofish are slower, but also provide acceptable hash
`function performance. SAFER+ and Serpent are
`marginal.
`All of these speeds are significantly slower than
`dedicated hash functions.
`SHA-1,
`for example,
`hashes at a rate of 13 clock cycles per byte on a
`Pentium, and RIPEMD-160 hashes at 16 clock cy-
`cles per byte on the same machine [PRB98]. Thus,
`it is likely that AES will be used as a hash function
`only when space (either code space in software or
`gates in hardware) is at a premium.
`
`3.2 Algorithm-Specific Comments on
`32-bit Performance
`3.2.1 CAST-256
`
`CAST-256 is an incomplete Feistel network. Each
`round is fast—17 clock cycles—but 48 rounds gives
`a total speed of 815 clock cycles per block. Hence,
`CAST-256 is about three times slower than the
`fastest AES candidates, even though it uses mostly
`32-bit operations. CAST’s use of only simple RISC
`operations implies that the relative performance of
`CAST-256 will be fairly uniform across processor
`types.
`
`key setup speed. Again, RC6’s and Mars’s reliance
`on CPU-specific operations means that we need
`to consider general 32-bit CPUs and the Pentium
`Pro/II separately. Table 3 compares the speed to
`key and encrypt of six of the AES candidates for
`different size texts on the Pentium (and by exten-
`sion, most general-purpose 32-bit CPUs.
`To calculate key setup speeds in assembly, we es-
`timated assembly-language key setup time by taking
`the percentage speedup between C and assembly en-
`cryption on the Pentium Pro/II, and applying that
`to the assembly-language key setup.
`Rijndael is the fastest algorithm for small blocks
`of text, followed closely by Crypton. What is sur-
`prising, though, is how quickly all the algorithms
`converge to their raw encryption speed. For 1024-
`byte texts—the size of a very small e-mail message—
`all the algorithms are within 15%.
`For 4096-
`byte texts—the size of a more reasonable e-mail
`message—the speed ordering of the algorithms is the
`same as it would be if the key schedule were not
`taken into account. Note that the AES candidates
`not listed here are slower.
`Table 4 compares the speed to key and encrypt of
`seven of the AES candidates for different size texts
`on the Pentium Pro/II. The primary difference of
`note is how much faster RC6 and Mars are on this
`one platform.
`The speed of the key schedule becomes even more
`important when we look at the algorithm as a hash
`function. There are many hash function construc-
`tions that rely on block ciphers as primitives, and
`all involve changing the encryption key with every
`block. Table 5 summarizes this data.
`These values are the sum of the number of clock
`cycles required to set up a key plus the number of
`clock cycles required to encrypt one block. Starting
`with the data in Table 2, we calculated the Pentium
`Pro/II assembly key setup time by multiplying the
`C key setup time by the percentage speed improve-
`ment between the encryption speed in C and assem-
`bly; this is a reasonable first estimate of the speed
`improvement of taking the key setup from C to as-
`sembly. Then we added in the encryption speeds in
`both Pentium Pro/II and Pentium. While the real
`performance numbers are likely to be somewhat dif-
`ferent, this is a reasonable first estimate.
`The fastest hash function is Rijndael, followed
`closely by Crypton.3 These two algorithms are par-
`ticularly suited as hash functions because key setup
`in the encryption direction is much faster than key
`DEAL has basically the same performance charac-
`setup in the decryption direction. E2, RC6, and
`teristics as triple-DES. It is a six-round Feistel net-
`3Previous versions of this report mentioned possible weaknesses in Crypton’s key schedule, making it unsuitable as a hash
`function. In the interest of fairness, we withhold judgment on this issue pending better documentation of the weaknesses.
`
`Crypton is a very clever enhancement to the Square
`algorithm [DKR97]. It is structured so that encryp-
`tion and decryption are identical operations (with
`reversed subkey schedules), and there are two sep-
`arate 8-by-8-bit S-boxes used in the transforma-
`tion, which are easily built from 4-bit permutations.
`Crypton replaces Square’s MDS matrix with a sim-
`pler (and admittedly less powerful) diffusion opera-
`tion that can be implemented very cheaply in hard-
`ware and on smart cards. Perhaps because the S-
`boxes and diffusion are weaker than in Square, Cryp-
`ton consists of twelve rounds instead of eight for
`Square (and ten, twelve, or fourteen for Rijndael).
`An optimized version on a 32-bit CPU looks iden-
`tical to Square except the fixed lookup tables are dif-
`ferent, so performance scales based on the number
`of rounds. Encryption speed is 390 clocks per block
`in Pentium/Pentium Pro assembler, and is uniform
`across different 32-bit CPUs.
`
`3.2.2 Crypton
`
`3.2.3 DEAL
`
`4
`
`DivX, LLC Exhibit 2007
`Page 2007 - 4
`Netflix Inc. et al. v. DivX, LLC, IPR2020-00614
`
`
`
`Text Size
`(bytes)
`16
`32
`64
`128
`256
`512
`210
`211
`212
`213
`214
`215+
`
`Crypton
`73
`49
`37
`30
`27
`26
`25
`25
`25
`24
`24
`24
`
`E2 Mars RC6 Rijndael
`100
`260
`146
`59
`63
`147
`95
`39
`44
`91
`69
`30
`35
`63
`57
`25
`30
`48
`50
`22
`38
`41
`47
`21
`27
`38
`45
`21
`26
`36
`45
`20
`26
`35
`44
`20
`26
`35
`44
`20
`26
`35
`44
`20
`26
`34
`44
`20
`
`Serpent Twofish
`205
`175
`137
`119
`103
`91
`86
`70
`77
`48
`73
`38
`71
`31
`70
`25
`69
`22
`69
`21
`69
`20
`69
`19
`
`Table 3: Clock Cycles, per Byte, to Key and Encrypt Different Text Sizes on a Pentium.
`
`Text Size
`(bytes)
`16
`32
`64
`128
`256
`512
`210
`211
`212
`213
`214
`215+
`
`Crypton
`70
`46
`34
`28
`25
`23
`22
`22
`22
`22
`22
`22
`
`E2 Mars RC6 Rijndael
`100
`246
`118
`53
`63
`133
`67
`36
`44
`76
`41
`27
`35
`48
`28
`23
`30
`34
`22
`20
`28
`27
`19
`19
`27
`24
`17
`19
`26
`22
`16
`18
`26
`21
`16
`18
`26
`20
`16
`18
`26
`20
`16
`18
`26
`20
`16
`18
`
`Serpent Twofish
`193
`132
`125
`93
`90
`73
`73
`64
`65
`48
`61
`33
`58
`25
`57
`20
`57
`18
`57
`17
`56
`17
`56
`16
`
`Table 4: Clock Cycles, per Byte, to Key and Encrypt Different Text Sizes on a Pentium Pro/II.
`
`work with DES itself used as the round function.
`The Pentium and Pentium Pro performance is thus
`about 2000 clocks per block. In a sense, DEAL is
`a straw-man algorithm that provides a worst-case
`performance benchmark for AES.
`
`the fastest candidates. Due to the heavy reliance on
`multiply operations, we expect the performance on
`general 32-bit CPUs to be significantly worse. With
`only eight rounds, the speed vs.
`security tradeoff
`for DFC feels mismatched compared to most of the
`other AES candidates.
`
`3.2.4 Decorrelated Fast Cipher
`
`DFC is an eight-round Feistel cipher, in which the
`round function includes a 64-bit modular addition
`and multiplication over 264 + 13, as well as a single
`small (6-by-32-bit) S-box lookup. The modular mul-
`tiply does a nice job of diffusing bits, but it also hurts
`performance significantly. Despite the low number
`of rounds, DFC takes 750 clocks per block in assem-
`bly language on a Pentium Pro. This is about the
`speed of DES and nearly three times slower than
`
`3.2.5 E2
`
`E2 is a twelve-round Feistel cipher, with each round
`function including two levels of subkey XOR and
`8-by-8-bit S-box lookup, with linear mixing in be-
`tween. Only simple RISC operations are used, so
`the performance should be relatively uniform across
`processor types. The speed of E2 is reasonable at
`410 clocks per block in assembly language (making
`it the fourth fastest on the Pentium and the sixth
`
`5
`
`DivX, LLC Exhibit 2007
`Page 2007 - 5
`Netflix Inc. et al. v. DivX, LLC, IPR2020-00614
`
`
`
`Algorithm
`Name
`
`CAST-256
`Crypton
`DEAL
`DFC
`E2
`Frog
`HPC
`Loki97
`Magenta
`Mars
`RC6
`Rijndael
`SAFER+
`Serpent
`Twofish
`
`Hash Speed
`Hash Speed
`Pentium
`Pentium Pro
`ASM (clocks) ASM (clocks)
`282*
`282*
`46*
`49*
`349*
`349*
`245*
`?
`100*
`100*
`?
`?
`?
`?
`?
`?
`?
`?
`246*
`260*
`118*
`146*
`32*
`34*
`193*
`212*
`193*
`205*
`132
`175
`
`Table 5: Hash-Function Performance, per Byte, of AES Candidates (128-bit key) on Pentium and Pentium
`Pro/II.
`
`fastest on the Pentium Pro/II), and is uniform across
`different 32-bit CPUs.
`
`3.2.6 Frog
`
`Frog has a byte-oriented structure. It seems that any
`implementation has to operate on individual bytes,
`which means that Frog cannot take advantage of 32-
`bit operations. This hurts the performance on larger
`CPUs. Frog’s key schedule is by far the slowest of
`any AES candidate.
`
`3.2.7 Hasty Pudding Cipher
`
`HPC was optimized for 64-bit CPUs. The heavy use
`of rotations of 64-bit words are difficult to implement
`(and therefore slow) on most 32-bit CPUs.
`
`3.2.8 Loki97
`
`Loki97 is one of the few algorithms that uses a bit-
`level permutation. Although these are very fast in
`hardware, their cost in software is similar to that of
`a full S-box layer. Performance on 32-bit CPUs is
`poor compared to most other candidates, and may
`be slower than triple-DES.
`
`3.2.9 Magenta
`
`Magenta is, by far, the slowest of any of the AES
`candidates. The core of the algorithm is formed by
`
`a byte-level FFT-like structure. This seems to force
`any 32-bit implementation to use 8-bit manipula-
`tions, which is generally inefficient. Magenta also
`has a very large number of operations per round.
`Each Π function contains 16 8-bit table lookups.
`There are four Π functions in each T , and three
`T functions in each round for a total of 192 S-box
`lookups per round, and 1152 per encryption with a
`128-bit key. (As a comparison, Frog uses a total of
`128 S-box lookups per encryption.) There seems to
`be no way of implementing Magenta in software at
`a speed comparable to the other candidates.
`
`3.2.10 Mars
`
`Mars is a fast cipher that relies on a veritable
`“kitchen sink” of primitive operator elements, in-
`cluding rotation, multiplication, and 8-by-32-bit S-
`box lookups. Mars has the fastest claimed speed for
`a C version (390 clocks per block) on the Pentium
`Pro of any of the AES candidates. However, this
`speed advantage is largely due to the use of a dif-
`ferent optimizing compiler (DJGPP), which would
`probably benefit other candidates dramatically as
`well; using the standard AES compiler (Borland),
`Mars is fairly slow at 900 clocks/block.
`Mars encryption has three phases: an initial for-
`ward mixing, the cryptographic “core,” and a final
`mixing. The mixing phases are unkeyed (except for
`input/output whitening) and constitute roughly two
`
`6
`
`DivX, LLC Exhibit 2007
`Page 2007 - 6
`Netflix Inc. et al. v. DivX, LLC, IPR2020-00614
`
`
`
`cryptographic cycles each. The cryptographic core
`consists of sixteen rounds of an unbalanced Feistel ci-
`pher (totaling roughly four cycles), with each round
`using a 32-bit to 96-bit keyed expansion function E.
`Both fixed and data-dependent rotations are used
`extensively throughout the cipher. A careful study
`of the encryption round reveals a fairly long crit-
`ical path through the E function that limits the
`throughput. On a Pentium Pro, this critical path
`appears to be at least 12 clock cycles, which leads
`to a core processing time of at least 192 clocks per
`block. With the forward and reverse mixing requir-
`ing at least 100 clocks, the overall speed of this algo-
`rithm in assembler on a Pentium Pro appears to be
`no faster than 310-320 clocks. Because of its use of
`data-dependent rotation and multiplication, the per-
`formance of Mars on a Pentium or Pentium MMX
`chip is considerably slower: about 550 clocks/block.
`As with RC6, the relative performance of Mars
`across processor types will vary considerably, and
`timing attacks are also of concern in implementa-
`tions where multiplies and data-dependent rotations
`do not take constant time. Compared to RC6, Mars
`is somewhat slower and far less elegant, yet it retains
`most of the performance concerns discussed below
`for RC6.
`
`3.2.11 RC6
`
`RC6 is by far the most elegant and easily under-
`stood AES candidate.
`It is remarkably simple, as
`has become almost expected of Rivest’s ciphers (e.g.,
`RC4, RC5). On the AES target platform (Pentium
`Pro/Pentium II), it is also the fastest algorithm, at
`about 250 clocks per block in assembly language.
`The unrolled code size of the encryption routine is
`very modest, and a looping version with very good
`performance can be implemented with an extremely
`small code size (under 200 bytes on the Pentium
`family). The performance in C is very good, even
`though it is hurt considerably by the lack of intrin-
`sic support for rotations in most C compilers.
`However, RC6 does not perform anywhere near
`as well on many other common platforms. For ex-
`ample, on the Pentium (or Pentium MMX) chip,
`multiplication and variable rotation opcodes do not
`pair, and they require 10 and 4 clocks, respectively.
`Thus, the time required for multiplication and rota-
`tion alone is 560 clocks per block on a Pentium, with
`the other RC6 operations increasing this total to at
`least 700 clocks for an optimized assembly version.
`Thus, RC6 is nearly three times slower on a Pentium
`than on a Pentium Pro. By comparison, most other
`candidate algorithms have identical or nearly identi-
`
`cal performance on the two platforms. On a 386/486
`(and many embedded CPUs), the multiplication and
`rotation speed is also fairly slow.
`In general, be-
`cause it does not rely on “basic” RISC instructions
`as most other AES candidates do, RC6’s relative
`performance will be much less uniform across pro-
`cessor types. This makes it much less attractive for
`low-end 32-bit CPUs.
`The fact that many processors (e.g., 486, 68000,
`smart cards) have data-dependent execution times
`for multiplication/rotation is also a concern, since
`RC6 could potentially leak key information to a tim-
`ing attack. Constant-time implementations of data-
`dependent rotations on those platforms is possible,
`but at a significant speed penalty.
`
`3.2.12 Rijndael
`
`Rijndael is another variant of Square: a fast cipher
`that works very well across all platforms. It boasts
`a clean mathematical structure involving only ta-
`ble lookup and XOR operations. Although the en-
`cryption and decryption algorithms are not exactly
`identical, their general structure and performance
`are virtually indistinguishable.
`Rijndael’s assembly language on both the Pen-
`tium and Pentium Pro processors is about 300 clocks
`per block. Unlike RC6 and Mars, there are no known
`CPU platforms (8-bit or 32-bit) on which Rijndael’s
`relative performance would be unduly negatively af-
`fected or on which timing attacks would be possible.
`
`3.2.13 SAFER+
`
`SAFER+ is obviously designed for 8-bit CPUs. Ev-
`ery operation in SAFER+ is a simple RISC oper-
`ation, but there are no 32-bit operations anywhere.
`This lack of 32-bit operations, particularly in the dif-
`fusion phase involving the PHT and the “Armenian
`shuffle,” significantly hurts the relative performance
`of this algorithm on modern CPUs. A rough esti-
`mate is that SAFER+ with a 128-bit key requires
`800 clocks per block on a Pentium Pro in assembly,
`which is slower per-byte than DES. On a Pentium,
`the speed slows to about 1100 clocks per block. The
`throughput decreases further for the larger key sizes;
`in fact, SAFER+ with a 256-bit key is slower than
`triple-DES. As an SP-network, SAFER+ with a 128-
`bit key has only one-fourth the number of cycles of
`Serpent, yet it is approximately the same speed. In
`general, the speed vs. security tradeoff for SAFER+
`feels mismatched compared to most of the other AES
`candidates.
`
`7
`
`DivX, LLC Exhibit 2007
`Page 2007 - 7
`Netflix Inc. et al. v. DivX, LLC, IPR2020-00614
`
`
`
`3.2.14 Serpent
`
`Serpent is an extremely conservative SP-network ci-
`pher, carefully designed to allow “bit-slice” imple-
`mentations on 32-bit CPUs. Its authors claim that
`Serpent’s performance on a Pentium Pro is at least
`as fast as DES [ABK98a], but this claim involves
`C implementations, which are always subject to the
`vagaries of compiler optimization.
`In assembly language, Serpent is actually con-
`siderably slower than a good DES implementation,
`which can run at about 45 clocks/byte on a Pen-
`tium. A conservative estimate, counting only ALU
`operations and assuming (very optimistically) that
`all memory accesses can be accomplished for “free,”
`shows that Serpent requires at least 900 clocks per
`block on a Pentium Pro (56 clocks/byte); in fact, 960
`clocks seems like a more likely practical bound. On
`a Pentium, this number rises to at least 1100 clocks
`per block. An optimized C version can achieve about
`1020 clocks per block on a Pentium Pro, proving that
`Serpent compiles very well from C (much better than
`does DES)—which is an interesting fact but hardly
`of great importance here.
`The fact remains that the best Serpent software
`implementation will be 3–4 times slower than other
`candidates, including Mars, Twofish, RC6, and Ri-
`jndael. Thus, even a non-conservative version of Ser-
`pent with only sixteen rounds would be considerably
`slower than these other algorithms. The nature of
`the algorithm, which uses basic RISC operations ex-
`clusively, strongly suggests that its relative perfor-
`mance will not be negatively affected on any plat-
`form.
`
`3.2.15 Twofish
`
`Twofish is the fastest AES candidate on a Pentium
`(and other 32-bit CPUs), and is second-fastest on
`the Pentium Pro/II. Its key setup is about average
`(eight are faster, and six are slower); however, for
`applications where key setup is a large factor in en-
`cryption speed (for small amounts of text), there
`are alternate implementations of Twofish that trade
`off encryption speed for key setup speed. As a hash
`function, for example, Twofish is much more efficient
`using one of these alternative implementations. See
`[SKW+98a] for details.
`Twofish uses only basic RISC operations, and
`Twofish’s performance on other 32-bit platforms will
`be comparable.
`
`of
`
`Performance
`3.3 Comparing
`“Minimal Secure Variants”
`In [Bih98], Biham introduced the notion of compar-
`ing the algorithms based on their “minimal secure
`variants.” Different design teams were more or less
`conservative than each other; the number of rounds
`in their final submissions was not a fixed percent-
`age of the number of rounds they could success-
`fully cryptanalyze. Biham tried to normalize the
`algorithms by determining the minimal number of
`rounds that is secure (either as described by the de-
`signers or other cryptographers, or Biham’s “best
`guess”), and then added a standard two cycles.
`We do not believe that this measurement is a fair
`way to compare algorithms. Two examples illustrate
`this. One, Biham maintains that the minimal num-
`ber of rounds for DFC is 7 and for Serpent is 15.
`It seems unreasonable to add two rounds to each,
`when that is a 29% increase for DFC and only a
`13% increase for Serpent. The comparative analy-
`sis clearly shows that one round of DFC is stronger
`than one round of Serpent. Two, Biham’s measures
`don’t take into account initial and final transforma-
`tions for some of the ciphers: e.g., the whitening in
`Twofish and the initial/final transformations in E2.
`At least in Twofish’s case, the transformations have
`the effect of adding a round to the cipher without
`the performance penalty.
`Aside from the fairness issue, we also question
`the usefulness of this measure: we feel that each al-
`gorithm should be judged on its own merits, and that
`changing the number of rounds is no different than
`changing the structure of the rounds. Also, NIST
`has not indicated that they would allow algorithms
`to be selected with modifications.
`And finally, in some cases we even question Bi-
`ham’s choice of values. Being about to break ten
`rounds of Twofish, for example, seems overly opti-
`mistic. If such a metric seems useful, we recommend
`that several cryptanalysts be polled for their opin-
`ions.
`Regardless of these concerns, we include Table 6,
`based on Biham’s numbers. We urge caution in us-
`ing this data. Biham also presented a table of rel-
`ative performance, but he used speeds for compiled
`C code [Bih98], which makes no sense for any appli-
`cation where speed is important.
`
`4 Performance on 64-bit CPUs
`
`We have some data regarding performance on the
`Alpha in Table 7. Aside from DFC and HPC, these
`numbers are based on analyzing the algorithms and
`
`8
`
`DivX, LLC Exhibit 2007
`Page 2007 - 8
`Netflix Inc. et al. v. DivX, LLC, IPR2020-00614
`
`
`
`Algorithm
`Name
`
`CAST-256
`Crypton
`DEAL
`DFC
`E2
`Frog
`HPC
`Loki97
`Magenta
`Mars
`RC6
`Rijndael
`SAFER+
`Serpent
`Twofish
`
`Rounds
`
`48
`12
`6
`8
`12
`8
`8
`16
`6
`32
`20
`10
`8
`32
`16
`
`Minimal MSR Encrypt MSR Encrypt
`Secure
`Pentium Pro
`Pentium
`Rounds ASM (clocks) ASM (clocks)
`40
`500*
`500*
`11
`316
`358
`9
`3300
`3300
`9
`844
`?
`10
`342*
`342*
`?
`?
`?
`?
`?
`?
`> 36
`?
`?
`> 10
`?
`?
`20
`200*
`344*
`20
`250
`700*
`8
`233
`256
`7
`700*
`963*
`17
`478*
`584*
`12
`194
`218
`
`Table 6: Minimal Secure Round Performance of AES Candidates with 128-bit Keys on Pentium-class CPUs.
`
`Algorithm
`Name
`CAST-256
`Crypton
`DEAL
`DFC
`E2
`Frog
`HPC
`Loki97
`Magenta
`Mars
`RC6
`Rijndael
`SAFER+
`Serpent
`Twofish
`
`Cycles
`600
`408
`2528*
`304
`471
`?
`376
`?
`?
`478
`467*
`340*
`656
`915
`360*
`
`Table 7: AES Candidate Performance on the DEC Alpha.
`
`9
`
`DivX, LLC Exhibit 2007
`Page 2007 - 9
`Netflix Inc. et al. v. DivX, LLC, IPR2020-00614
`
`
`
`studying data sheets and not on actual implemen-
`tations [Alm99]. Perhaps the most interesting thing
`is that, for many of the AES candidates, the Al-
`pha is actually slower, in terms of clocks per block,
`than the Pentium II. DFC is the fastest algorithm on
`the Alpha, followed by Rijndael, Twofish, HPC, and
`Crypton. All the other submisions are considerably
`slower.
`
`4.1 RC6
`RC6’s performance on the DEC Alpha processor is
`also hurt by its reliance on multiplications and rota-
`tions. The 21164 can run up to 700 MHz (or faster)
`and has a 4-way superscalar architecture. The Alpha
`can perform a signed 32-bit multiply in 8 clocks, and
`a second multiply operation can be started 4 clocks
`after the first, so two of them take 12 clocks to-
`tal. This multiply opcode should work for RC6’s un-
`signed multiplies, assuming that the signed multiply
`overflow is properly handled; otherwise, we would
`need to use the slower 64-by-64-bit multiply, adding
`8 more clocks. The remainder of the RC6 round
`logic takes an additional 11 clocks, due in no small
`part to the lack of rotation opcodes, which must be
`synthesized out of shifts. Thus, an RC6 round re-
`quires a total of 23 clocks on the Alpha. At twenty
`rounds, this is 460 clocks per block, about twice the
`clock count on a Pentium II. In other words, a 600
`MHz Alpha runs RC6 at a slower absolute speed (in
`Mbits/s) th