`doi:10.1006/abio.2000.4744, available online at http://www.idealibrary.com on
`
`A Holistic Approach for Protein Secondary Structure
`Estimation from Infrared Spectra in H2O Solutions
`
`Ganesh Vedantham,* H. Gerald Sparks,†,1 Samir U. Sane,†,2 Stelios Tzannis,†,3
`and Todd M. Przybycien*,4
`*Applied Biophysics Laboratory, Department of Chemical Engineering, Carnegie Mellon University,
`Pittsburgh, Pennsylvania 15213; and †Howard P. Isermann Department of Chemical Engineering,
`Rensselaer Polytechnic Institute, Troy, New York 12180
`
`Received January 31, 2000
`
`We present an improved technique for estimating pro-
`tein secondary structure content from amide I and
`amide III band infrared spectra. This technique com-
`bines the superposition of reference spectra of pure sec-
`ondary structure elements with simultaneous aromatic
`side chain, water vapor, and solvent background sub-
`traction. Previous attempts to generate structural refer-
`ence spectra from a basis set of reference protein spec-
`tra have had limited success because of inaccuracies
`arising from sequential background subtractions and
`spectral normalization, arbitrary spectral band trunca-
`tion, and attempted resolution of spectroscopically de-
`generate structure classes. We eliminated these inaccu-
`racies by defining a single mathematical function for
`protein spectra, permitting all subtractions, normaliza-
`tions, and amide band deconvolution steps to be per-
`formed simultaneously using a single optimization algo-
`rithm. This approach circumvents many of the problems
`associated with the sequential nature of previous meth-
`ods, especially with regard to removing the subjectivity
`involved in each processing step. A key element of this
`technique was the calculation of reference spectra for
`ordered helix, unordered helix, sheet, turns, and unor-
`dered structures from a basis set of spectra of well-char-
`acterized proteins. Structural reference spectra were
`generated in the amide I and amide III bands, both of
`which have been shown to be sensitive to protein sec-
`ondary structure content. We accurately account for
`overlaps between amide and nonamide regions and al-
`
`1 Current address: DuPont Experimental Station, Route 141 and
`Henry Clay Road, Wilmington, DE 19880.
`2 Current address: Genentech, Inc., 1 DNA Way, South San Fran-
`cisco, CA 94080.
`3 Current address: Inhale Therapeutics, 150 Industrial Road, San
`Carlos, CA 94070.
`4 To whom correspondence should be addressed. Fax: (412) 268
`7139. E-mail: todd@andrew.cmu.edu
`
`0003-2697/00 $35.00
`Copyright © 2000 by Academic Press
`All rights of reproduction in any form reserved.
`
`low different structure types to have different extinction
`coefficients. The agreement between our structure esti-
`mates, for proteins both inside and outside the basis set,
`and the corresponding determinations from X-ray crys-
`tallography is good. © 2000 Academic Press
`Key Words: infrared spectroscopy; spectral deconvo-
`lution; protein secondary structure; reference spectra.
`
`Fourier-transformed infrared (FTIR)5 spectroscopy
`is perhaps the most versatile spectroscopic technique
`for analyzing protein secondary structure in diverse
`physiochemical environments. FTIR spectroscopy has
`been applied to investigate protein structure in solu-
`tion (1, 2), in aggregates and inclusion bodies (3, 4), as
`well as during lyophilization (5–7) and freeze/thaw pro-
`cessing (8). In addition, attenuated total reflection
`(ATR) FTIR spectroscopy is ideal for studying protein
`adsorption onto catheter surfaces (9), chromatographic
`media (10 –12), and a variety of other polymeric sur-
`faces (13–15).
`In the past decade, a plethora of methods to estimate
`protein secondary structure contents via analysis of
`amide I, II, and III, band spectra have been reported.
`These methods include, but are not limited to, solitary
`use or combinations of factor analysis (FA) (16 –20),
`singular value decomposition (SVD) (21, 22), Fourier
`self-deconvolution (FSD; or resolution enhancement)
`(14, 23–26), second derivative (SD) band identification
`and fitting (27–29), and the development of spectral
`correlation coefficients (30, 31). Recent reviews of these
`
`5 Abbreviations used: FTIR, Fourier-transformed infrared; ATR,
`attenuated total reflection; FA, factor analysis; SVD, singular value
`decomposition; FSD, Fourier self-deconvolution; SD, second deriva-
`tive; R.H.S., right-hand side; GL, Gaussian–Lorentzian.
`
`33
`
`Chugai Exhibit 2306
`Pfizer, Inc. v. Chugai Pharmaceutical Co. Ltd.
`IPR2017-01357
`Page 00001
`
`
`
`34
`
`VEDANTHAM ET AL.
`
`techniques by Pelton and McLean (32) and Jackson
`and Mantsch (33) are instructive. This large body of
`work devoted to protein secondary structure estima-
`tion from infrared spectra has led to a number of dis-
`crepancies that persist throughout the literature.
`In a classic work cited by virtually every researcher
`in the field, Byler and Susi (24) used FSD to analyze
`the spectra of 21 globular proteins in 2H2O and were
`able to assign components of amide I band spectra to
`helices, b-sheet, turns, and random (unordered) struc-
`ture. By their method, segments with similar structure
`do not necessarily exhibit peaks with identical frequen-
`cies from protein to protein. For example, Byler and
`Susi (24) reported frequencies varying from 1651 to
`1657 cm21 for helical vibrations in proteins, and fre-
`quencies for homopolypeptides in helical conforma-
`tions have been reported as low as 1634 cm21 (24). Also,
`Chirgadze et al. (34) reported that for helical struc-
`tures, the corresponding peak width increases with
`decreasing helical order. In light of this, when decon-
`voluting protein amide bands, many algorithms in-
`volve subjective peak assignments or allow the peak
`positions and widths to vary during the structure esti-
`mation procedure. To circumvent these difficulties,
`many authors have invoked either resolution enhance-
`ment or second derivative techniques to help identify
`the positions of relevant peaks, followed by the assign-
`ment of a structure type to each peak and a fit of each
`peak with Gaussian and/or Lorentzian distribution
`functions. However, significant bias in the results can
`still be introduced because choices of the resolution
`enhancement factor for FSD and the peak assignments
`in both methods are subjective.
`An alternative to case-specific peak assignment
`methods is the direct or indirect development of struc-
`tural reference spectra, or eigenspectra, that theoreti-
`cally represent either pure motifs, such as a-helix,
`b-sheet, and turns, or linear combinations of pure mo-
`tifs (33). These idealized spectra are then fit to the
`spectrum of a protein of unknown structure by varying
`the corresponding motif
`fractions. These fractions
`serve as the weighting factors in a linear superposition
`scheme. The reference spectra are generated by the
`decomposition of a calibration set or basis set of real
`protein spectra covering a broad range of structural
`fractions, utilizing methods such as SVD, band fitting,
`or matrix inversion (17, 19, 21, 35). The reference spec-
`tra approach has been successfully applied to both CD
`spectra (36) and Raman spectra (37), but has had
`mixed results when applied to FTIR spectra (19, 21).
`Contrary to the results of Byler and Susi (24), the
`reference spectra method assigns fixed positions to
`peaks representing the various structure motifs. How-
`ever, as will be demonstrated by the results of this
`work, the mixed success of the past reference spectra
`methods for protein secondary structure predictions
`
`from FTIR spectra is associated with the structure
`class assignments and not the seeming contradiction
`with the work of Byler and Susi (24).
`In addition to uncertainty in peak positions and as-
`signments, the shortcomings of most previous routines
`involve the sequential subtraction of background sol-
`vent and water vapor contributions to the protein so-
`lution spectra, followed by an arbitrary baseline as-
`signment to isolate the amide region of
`interest.
`Baseline correction can be a function of operator expe-
`rience with the subtraction procedure (38). Obtaining a
`so-called “flat-region” in the 1750 –2200 cm21 frequency
`range is the typical criterion used for bulk water sub-
`traction. The degree of background subtraction is often
`determined manually and “flat” is rarely quantified.
`After all background subtractions, the amide I region is
`often isolated for analysis by truncating the spectrum
`at 1600 and 1700 cm21, followed by the subtraction of a
`linear baseline to zero the ends of the spectrum. When
`examining the amide I and II regions together, end
`points of 1480 and 1700 cm21 are typically used, while
`the amide III region is often bounded at 1200 and 1300
`cm21 (39). In this subjective approach, an early error in
`sequential background and baseline subtractions will
`be carried through to the band fitting or reference
`spectra routine and will produce potentially erroneous
`results. Additionally, choosing arbitrary end points for
`a baseline subtraction ignores any contributions from
`adjacent vibrational modes that tail into the amide
`regions and vice versa.
`No current algorithm for protein secondary structure
`estimation from infrared spectra accounts for the impact
`of solutes on background solvent spectra or the possibility
`that different secondary structure motifs may absorb
`with varying extinction coefficients. As demonstrated via
`Raman spectroscopy, the O–H bending and stretching
`vibrations of water undergo significant changes in the
`presence of proteins and other solutes (37, 40, 41). In-
`creasing evidence also supports the idea that different
`molar extinction coefficients exist for the various struc-
`ture types contributing to the protein amide vibrations
`(33, 42, 43). Accurate subtraction of background solvent
`and assignment of the proper weights to the amide band
`components are critical for obtaining reliable secondary
`structure estimates, especially in cases involving low pro-
`tein concentrations.
`Another major discrepancy in current protein struc-
`ture estimation algorithms concerns the paradox seem-
`ingly generated when normalizing spectra. It is com-
`mon practice during analysis to normalize a spectrum
`after all background subtractions have been performed
`and a particular amide band has been isolated. How-
`ever, to accurately account for all the overlapping re-
`gions between peaks that correlate with protein struc-
`ture and those that do not, the amide region should be
`normalized before subtraction. In addition, possible
`
`Chugai Exhibit 2306
`Pfizer, Inc. v. Chugai Pharmaceutical Co. Ltd.
`IPR2017-01357
`Page 00002
`
`
`
`HOLISTIC REFERENCE SPECTRA CALCULATION
`
`35
`
`TABLE 1
`List of Proteins Used for FTIR Spectroscopic Studies
`
`Abbreviation
`
`Protein
`
`Source
`
`Cat. No.
`
`Lot No.
`
`PDB fileb
`
`ALA
`BGH
`BLB
`CAL
`CAN
`CHYa
`CONa
`CYT
`HSA
`LYSa
`MYOa
`PAPa
`PEP
`RNAa
`SUBa
`TPIa
`
`a-Lactalbumin
`Bovine growth hormone
`b-Lactoglobulin
`Conalbumin
`Carbonic anhydrase
`a-Chymotrypsin
`Concanavalin A
`Cytochrome c
`Human serum albumin
`Lysozyme
`Myoglobin
`Papain
`Pepsin
`Rnase
`Subtilisin-BPN9
`Triosephosphate isomerase
`
`Bovine milk
`E. coli (recombinant)
`Bovine milk
`Chicken egg white
`Bovine erythrocytes
`Bovine pancreas
`Canavalia ensiformis
`Horse heart
`Human serum
`Chicken egg white
`Sperm whale
`Papaya latex
`Porcine stomach mucosa
`Bovine pancreas
`Bacillus licheniformis
`Rabbit muscle
`
`L-5385
`
`L-8005
`C-0755
`C-3934
`C-7762
`C-7275
`C-7752
`A-9511
`L-6876
`M-7527
`P-4762
`P-6887
`R-5500
`101129
`T-6258
`
`92H7015
`M901-004
`13H7150
`116H7035
`47H1358
`27H7010
`118F7160
`25H7045
`24H9314
`65H7025
`17H6660
`107H7015
`120H8095
`86H7046
`69618
`96H9554
`
`1hfx
`1bst
`1beb
`1aiv
`2cba
`5cha
`1apn
`1hrc
`1bj5
`1azf
`104m
`1ppn
`4pep
`3rn3
`2st1
`1ag1
`
`a Included in the basis set for generation of the reference spectra.
`b Protein Data Bank file listing. URL: http://www.rcsb.org/pdb/.
`
`variations in secondary structure extinction coeffi-
`cients imply that the areas of the amide bands also
`depend on the overall protein secondary structure con-
`tent. This enigma can be resolved by performing the
`subtractions, normalization, and deconvolution of the
`amide band of interest simultaneously.
`In this paper, we describe a holistic reference spectra
`calculation technique for the generation of idealized ref-
`erence infrared spectra in the amide I and amide III
`regions, followed by a procedure for the estimation of
`protein secondary structure for unknown samples. Our
`prediction technique did not make use of the amide II
`region because this vibrational mode has been shown to
`be less sensitive to variations in protein secondary struc-
`ture content (39). In the calculation of the reference spec-
`tra, all subtractions, normalization, and amide band de-
`convolution steps
`are
`performed
`simultaneously,
`following the method Sane and co-workers (37) developed
`for Raman spectral deconvolution. All non-structure-re-
`lated vibrational peaks are fit using equally weighted
`Gaussian–Lorentzian product functions; peaks correlat-
`ing with protein secondary structure are allowed to have
`different molar extinctions. This method places no re-
`strictions on the frequency ranges analyzed: overlaps be-
`tween non-structure- and structure-associated peaks are
`accounted for since all components are fit simulta-
`neously. The introduction of a protein-dependent effec-
`tive concentration variable solved the normalization
`problem. The calculation of reference spectra involved
`multivariate nonlinear
`least-squares minimization
`which was implemented in Matlab 5.0 (Mathworks Inc.,
`Natik, MA). The idealized reference spectra were opti-
`mized for internal consistency via a bootstrapping algo-
`rithm. FTIR spectra of proteins outside the basis protein
`
`set were then analyzed to validate the secondary struc-
`ture estimation algorithm. Results presented here for
`calculated structural reference spectra compare well with
`those in the literature and provide good secondary struc-
`ture estimates for proteins.
`
`MATERIALS AND METHODS
`Materials
`The proteins in and outside the reference set were
`chosen to cover a broad range of secondary structure
`motifs; a list of the proteins studied is given in Table 1.
`The protein’s secondary structure assignment is de-
`pendent on the choice of assignment algorithm (44, 45).
`In this report, all secondary structure assignments
`were made using the STRIDE algorithm of Frishman
`and Argos (45). The use of a single assignment algo-
`rithm eliminates the discrepancies that ensue from the
`application of dissimilar criteria and algorithms to
`crystallographic data (46). The STRIDE secondary
`structure assignments of the proteins analyzed in this
`report, both within and outside the reference set, are
`shown on a triangular diagram in Fig. 1. We have
`assigned STRIDE-identified 310 helices as well as a-he-
`lices of three or less contiguous residues as unordered
`helices in this work. The Protein Data Bank files used
`to generate the STRIDE estimates are listed in Table 1.
`All the proteins studied exhibit significant ordered sec-
`ondary structure content in their native states.
`Subtilisin BPN9 was purchased from ICN Biomedicals
`Inc. (Irvine, CA). Bovine growth hormone was a gift from
`Monsanto (St. Louis, MO). All other proteins, see Table 1
`for abbreviations used throughout this work, and re-
`agents for buffers were purchased from Sigma Chemical
`
`Chugai Exhibit 2306
`Pfizer, Inc. v. Chugai Pharmaceutical Co. Ltd.
`IPR2017-01357
`Page 00003
`
`
`
`36
`
`VEDANTHAM ET AL.
`
`three intermediate reservoir changes. In addition, solids
`remaining in the lysozyme solution were sedimented in a
`Eppendorff 5415C microcentrifuge (Brinkman Instru-
`ments, Westbury, NY) at 14,000 rpm for 15 min and the
`supernatant was pipetted off for study. Triose phosphate
`isomerase was dialyzed, as described above, to remove
`borate and EDTA. Myoglobin was obtained in liquid form
`at a concentration of 4.8 mg/mL. All other proteins were
`dissolved directly into the corresponding buffers listed in
`Table 2. a-Chymotrypsin was centrifuged, as above, to
`remove residual solids. After dissolution, the concanava-
`lin A protein solution remained slightly cloudy; however,
`centrifugation precipitated the protein and thus the tur-
`bid solution was used for analysis. In addition, myoglobin
`and papain were concentrated in a Beckmann Instru-
`ments, Inc. (Palo Alto, CA), TJ-6 centrifuge at 3000 rpm
`to a final volume of 250 mL with Centricon-3, 3000
`MWCO, centrifugal membrane concentrators from Ami-
`con, Inc. (Beverly, MA). Prior to protein dissolution, all
`buffers were filtered through syringe filters with 0.45-mm
`nylon membranes to remove dust and undissolved salts.
`The proteins included in the reference set are CHY, CON,
`LYS, MYO, PAP, RNA, SUB, and TPI.
`
`FIG. 1. Secondary structure assignments for proteins analyzed in
`this work: 1, ALA; 2, BGH; 3, BLB; 4, CAL; 5, CAN; 6, CHY; 7, CON; 8,
`CYT; 9, HAS; 10, LYZ; 11, MYO; 12, PAP; 13, PEP; 14, RNA; 15, SUB;
`16, TPI. All structure assignments were based on the STRIDE algo-
`rithm of Frishman and Argos (45). Symbols: S, total sheet; Ho, ordered
`helix; T 1 R 1 Hu, turn 1 random coil 1 unordered helix.
`
`Co. (St. Louis, MO). The final buffer conditions used for
`all protein solutions are listed in Table 2. Several pro-
`teins required processing to remove additives. Lysozyme
`and papain were dissolved into their respective buffers
`and then dialyzed with Spectra/Por Biotech 500 MWCO
`cellulose ester membranes (Cat. No. 08-750-1A), pur-
`chased from Fisher Scientific Inc. (Pittsburgh, PA), to
`remove sodium acetate; dialyses were conducted against
`500-mL reservoirs of final buffer solutions for 12 h, with
`
`FTIR Spectroscopy
`All protein spectra were recorded in H2O solution. All
`spectra were collected with a Nicolet Magna 550 Series II
`FTIR spectrometer (Madison, WI) with a horizontal ATR
`accessory from SpectraTech, Inc. (Shelton, CT). The ATR
`accessory used a trapezoidal germanium crystal (7.0 3
`1.0 cm; length 3 width), with ends cut to 45° generating
`12 internal reflections, that was mounted into a sample-
`
`TABLE 2
`Protein Solution Conditions and Spectral Quality
`
`Protein
`
`Buffer
`
`ALA
`BGH
`BLB
`CAL
`CAN
`CHY
`CON
`CYT
`HSA
`LYS
`MYO
`PAP
`PEP
`RNA
`SUB
`TPI
`
`10 mM sodium phosphate, pH 6.0, with 100 mM NaCl
`DI with a trace of HCl, pH 3.8a
`50 mM sodium phosphate, pH 7.0
`100 mM NaCl, pH 6.0
`DI water
`DI with a trace of HCl, pH 3.8a
`DI with a trace of HCl, pH 3.8a
`25 mM sodium phosphate, pH 6.0, with 100 mM NaCl
`25 mM sodium phosphate, pH 7.0, with 100 mM NaCl
`DI water
`20 mM Tris–HCl, pH 8.0
`DI with a trace of HCl, pH 3.8a
`25 mM sodium phosphate, pH 7.0 with 100 mM NaCl
`DI with a trace of HCl, pH 3.8a
`25 mM sodium phosphate, pH 6.0, with 100 mM NaCl
`DI water
`
`Protein
`concentration
`(mg/ml)
`
`S/N
`(amide I band)
`
`S/N
`(amide III band)
`
`28
`8
`40
`20
`24
`18
`21
`15
`28
`32
`18
`27
`22
`18
`19
`18
`
`209
`178
`161
`167
`669
`184
`439
`186
`225
`877
`172
`195
`172
`402
`389
`211
`
`22
`33
`65
`103
`74
`42
`38
`35
`167
`73
`25
`27
`160
`52
`48
`13
`
`a Trace is defined as approximately 50 to 100 ml of 2 M HCl in 1 liter DI water.
`
`Chugai Exhibit 2306
`Pfizer, Inc. v. Chugai Pharmaceutical Co. Ltd.
`IPR2017-01357
`Page 00004
`
`
`
`HOLISTIC REFERENCE SPECTRA CALCULATION
`
`37
`
`boat/trough. The spectrometer was equipped with a liq-
`uid nitrogen-cooled mercury cadmium telluride detector.
`To reduce the contributions of water vapor and carbon
`dioxide, the IR system was continuously purged with air
`from a Balston, Inc. (Haverhill, MA) 75-45 FTIR Purge
`Gas Generator at 30 standard cubic feet per minute and
`supplemented with nitrogen gas from the vent of a liquid
`nitrogen tank. To obtain protein solution and correspond-
`ing buffer background spectra, approximately 250 mL of
`each solution was spread evenly to completely cover the
`germanium crystal. The crystal was then sealed with
`parafilm to minimize evaporation during acquisition.
`Protein concentrations above 20 mg/mL ensured that less
`than 2% of the FTIR signal derived from molecules ad-
`sorbed to the germanium crystal, assuming a worst case
`scenario of monolayer coverage attained by random se-
`quential adsorption with a jamming limit of 55%. All
`ATR-corrected spectra were collected in the 1000 to 4000
`cm21 range as sets of 2048 time-averaged, double-sided
`interferograms with Happ–Genzel apodization. Spectral
`resolution was set at 2 cm21 and a gain of 8 and an
`aperture of 38 were used. After each experiment, the
`exposed surface of the germanium crystal was cleaned
`via a five-step process: (1) rinsing with DI water, (2)
`soaking in a 1% (w/w) SDS solution for 10 min, (3) rinsing
`thoroughly with DI water, (4) rinsing thoroughly with a
`50% (w/w) aqueous ethanol solution, and (5) drying with
`compressed air filtered through cotton to remove oils and
`particulates. Amide I band signal-to-noise (S/N) ratios
`varied from 877 to 161, whereas amide III band S/N
`ratios varied from 166 to 12, as shown in Table 2. Amide
`band S/N ratios were calculated as 2.5 times the maxi-
`mum intensity of the background-subtracted band di-
`vided by 3 times the standard deviation of the intensity
`between 1850 and 2200 cm21.
`
`Data Analysis
`Mathematical representation of protein FTIR spec-
`In addition to the secondary structure-sensitive
`tra.
`amide I and amide III bands, there are several other
`vibrational modes active in the spectral region of interest,
`including the amide II band. Protein solution FTIR spec-
`tra also contain background contributions from buffer
`and water vapor. In addition, spectra may have a contri-
`bution from a sloping baseline. By assuming that the
`contributions of all underlying spectral components are
`additive, invoking the principle of superposition, any set
`of spectra from p proteins (p . 1) can be represented in
`matrix form as
`
`calc 5 vz31a13p 1 1z31b13p 1 Bz3mAm3p 1 Nz3nDn3p
`I z3p
`
`
`
`
`
`
`1 @S z3qI E q3rI F r3pI 1 S z3sIII E s3tIII F t3pIII #C p3p
`eff
`,
`
`[1]
`
`calc is the calculated spectral intensity for p
`where I z3p
`proteins at z frequencies. All subscripts in Eq. [1] cor-
`
`respond to the dimensions (rows 3 columns) of the
`associated matrices, each of which will be elaborated
`upon below.
`The first two terms on the right-hand side (R.H.S.) of
`Eq. [1] describe a linear baseline for the spectral range
`of interest, 1000 to 2200 cm21, during the optimization
`routine. Here v z31 and 1 z31 are vectors of length z
`containing frequencies and ones, respectively. The
`baseline slope and intercept for each protein spectrum
`are compiled in the vectors a 13p and b 13p, respectively.
`Background contributions from buffer (or solvent;
`m 5 1), water vapor (m 5 2), and, where necessary,
`an underlying surface (m 5 3), are accounted for in the
`third term on the R.H.S. of Eq. [1]. The matrix B z3m,
`representing m independently measured background
`spectra recorded at z frequencies, is multiplied by the
`matrix of background signal magnitudes (or ampli-
`tudes), A m3p, containing the respective background
`contributions to each protein spectrum.
`The fourth term on the R.H.S. of Eq. [1] accounts for
`the vibrational peaks in the frequency range analyzed
`that are not correlated with protein secondary struc-
`ture, here on designated as nonstructure peaks. These
`peaks embody vibrations associated with amino acid
`side chains and the amide II band. We have not in-
`cluded individual side-chain resonances that contrib-
`ute intensity in the amide I and III bands (47). These
`resonances typically account for 5 to 15% of the signal
`intensity in the amide I region (43), but are highly
`variable in position from protein to protein (33).
`Each individual peak i is expressed as a Gaussian–
`Lorentzian (GL) product function
`
`2pwi
`
`p~pw i
`
`J4 ~12Y!
`
`[2]
`
`2 1 4~v# i 2 v!! 2G Y
`GLi 5F
`3 3 2˛ln~2!
`expH 24 ln~2!~v# i 2 v! 2
`
`p
`pwi
`
`2
`pw i
`
`each of which has an associated mean frequency posi-
`tion, v# i, and peak width at half-height, pw i. Equation
`[2] is used to generate n nonstructure peaks at z fre-
`quencies across the whole spectral range, forming the
`matrix N z3n. The matrix D n3p contains the nonstruc-
`ture peak magnitudes (or amplitudes) for each corre-
`sponding protein in the reference set. In our formula-
`tion, the number, associated mean peak positions, and
`peak widths of nonstructure GL peaks are identical for
`each protein (i.e., protein independent); however, the
`amplitudes corresponding to the contribution of each
`nonstructure peak to an individual protein spectrum
`vary from protein to protein. The exponent Y in Eq. [2]
`is a weighting factor that determines the relative
`Gaussian–Lorentzian character of the nonstructure
`
`Chugai Exhibit 2306
`Pfizer, Inc. v. Chugai Pharmaceutical Co. Ltd.
`IPR2017-01357
`Page 00005
`
`
`
`38
`
`VEDANTHAM ET AL.
`
`peaks. Following the results of Sane and co-workers
`(37), Y was set equal to 0.5 and used for all nonstruc-
`ture peaks throughout this work, although other val-
`ues have been used successfully (48).
`The final term on the R.H.S. of Eq. [2] represents the
`amide I and amide III band contributions to the calcu-
`lated spectral intensities. The columns of the matrices
`I
`III
`contain Gaussian–Lorentzian peaks,
`and S z3s
`S z3q
`again generated by Eq. [2], with q and s peaks corre-
`lated with protein structure in the amide I and III
`regions, respectively. The molar extinction coefficients
`for the various amide I and amide III structure classes
`I
`III ,
`are contained in the columns of matrices E q3r
`and E s3t
`I
`respectively. Multiple GL peaks in the matrices S z3q
`III (i.e., multiple columns) may represent a single
`and S z3s
`I
`III
`structure class. As a result the matrices E q3r
`and E s3t
`are block diagonal. The different amide I and III struc-
`ture class percentages (or fractions) for each protein
`represented by Eq. [1] are contained in the matrices
`I
`III . Finally, the effective protein concentra-
`and F t3p
`F r3p
`eff , is diagonal, with one nonzero ele-
`tion matrix, C p3p
`ment for each protein. As with the nonstructure peaks,
`the number of peaks as well as the mean position and
`peak widths of each structure-related GL component
`peak is identical for each protein. The molar extinction
`coefficients for each structure class component peak
`are protein independent as well.
`In this work, four
`Reference spectra generation.
`structure classes were associated with the amide I and
`III bands. We performed a SVD analysis on the isolated
`amide I and III bands using the eight proteins in the
`reference set. Our analysis of the singular values sug-
`gested that we could reliably extract between three and
`five different linearly independent pieces of informa-
`tion or secondary structure classes from the amide
`band spectra. Based on an eigenvector analysis of the
`isolated amide band matrix we restricted our structure
`classes to four. We decomposed the amide I band into
`ordered helix (Ho), unordered helix and random (Hu 1
`R), sheet (S), and turn (T) classes (r 5 4). The amide
`III band was decomposed into helix (H), sheet (S), turn
`(T), and random (R) classes (t 5 4). Differing classes in
`the amide I and III regions reflect the differing over-
`laps between underlying component peaks in the two
`regions. Treating these regions separately also aids in
`determining the goodness of fit to the two regions in-
`dependently.
`Each of the proteins outside the reference set was
`added to the reference set, one at a time, to determine
`whether we could confidently deconvolute more than
`four structure classes. SVD analysis suggested that
`augmentation of the reference set does not increase the
`information content of the isolated amide I and III
`band spectra. The size of our reference protein set is
`small. However, adding more proteins to the reference
`
`set degraded the condition number of the matrix of
`isolated amide bands as the set of spectra become in-
`creasingly linearly dependent. Sarver and Krueger (17)
`also parsed secondary structure elements into four
`classes based on an SVD analysis of the amide I bands
`of 17 proteins in aqueous solution; this is consistent
`with our analysis. The number of proteins included in
`the calculation of structural reference spectra is not as
`important as the structure content space that set of
`proteins spans.
`For a given basis set of p proteins, the known vari-
`meas) measured at z
`ables are the spectral intensities (I z3p
`frequencies (v z31), the corresponding background spec-
`tra contributions (B z3m), and the various structure
`III ) for each protein. Becauseclass fractions (F r3pI and F t3p
`
`
`the background and nonstructure peak subtractions as
`well as the amide region fits are performed simulta-
`neously, it is impossible to calculate the area under the
`amide bands a priori. In addition, the amide band area
`is also a function of the relative content of different
`classes of secondary structure since we permit the mo-
`lar extinction coefficients of each structure class to
`vary. To circumvent this problem, the areas under the
`amide I and III bands for each protein spectrum are
`normalized by the effective concentration parameter.
`Given the known variables outlined above, the un-
`known variables can be used as fitting parameters to fit
`Eq. [1] to the set of measured solution spectra of the
`basis set proteins. The fitted parameters include the
`following: all mean peak positions, v# i, and peak widths,
`pw i; the linear baseline parameters, a 13p and b 13p; the
`background and nonstructure peak amplitudes, A m3p
`I
`and D n3p; the molar extinction coefficients, E q3r
`and
`eff . TheE s3tIII ; and the effective protein concentrations, C p3p
`
`
`objective function for optimizing all the fitted parame-
`ters is based on the sum of square differences between
`the calculated and measured total spectral intensities:
`
`measi2.
`objective 5 minimize iI z3pcalc 2 I z3p
`
`
`
`[3]
`
`The computer code for the optimization routine was
`written in a format suitable for the Matlab 5.0 (Math-
`works Inc., Natick, MA) software package.
`The method used by Sane and co-workers (37) to
`separate the linear and nonlinear unknown parame-
`ters during the optimization is a unique feature of this
`algorithm, reducing the problem dimensionality in
`nonlinear space thus leading to significantly faster con-
`vergence. The format of the algorithm is depicted in the
`flowchart in Fig. 2. The calculated spectral intensities
`are linearly related to the baseline parameters, a 13p
`and b 13p, background and nonstructure peak ampli-
`tudes, A m3p and D n3p, and effective protein concentra-
`eff . The spectra are nonlinear functions of
`tions C p3p
`amide I and III mean peak positions, v# i, and peak
`
`Chugai Exhibit 2306
`Pfizer, Inc. v. Chugai Pharmaceutical Co. Ltd.
`IPR2017-01357
`Page 00006
`
`
`
`HOLISTIC REFERENCE SPECTRA CALCULATION
`
`39
`
`The “nnls” routine solves Eq. [4] subject to constraint that
`G is positive, semidefinite; all peak amplitudes and effec-
`tive protein concentrations must be greater than or equal
`to zero to be physically meaningful. However, the slope
`and intercept baseline parameters may be less than zero,
`which violates the “nnls” routine constraint. To resolve
`this difficulty, an arbitrary but known positive slope and
`intercept was added to each spectrum prior to each invo-
`cation of the “nnls” routine; this arbitrary linear back-
`ground addition was subsequently subtracted before con-
`tinuing with the next iteration of the “constr” routine. At
`each iteration step, “constr” updated guesses for the non-
`linear parameters by employing an analytical Jacobian of
`the objective function. Matlab continued the iterative
`procedure until the objective function, Eq. [3], reached a
`minimum, indicating that the best fit between the calcu-
`lated and measured FTIR protein solution spectra had
`been obtained.
`There are several complications in obtaining ideal-
`ized reference spectra that can be dealt with by the
`method of Sane and co-workers (37) in a rather unique
`way. For a basis set of p proteins with m background
`signals, n nonstructure peaks, and q 1 s structure
`peaks, the total number of unknowns in Eq. [1] is quite
`substantial. The number of unknown linear parame-
`ters is (3 1 m 1 n) p and the number of unknown
`nonlinear parameters is 2n 1 3(q 1 s) 2 1. Because
`the problem of developing a best fit to the measured
`spectra is nonlinear, multiple solutions can potentially
`arise. Finally, it is not possible to know a priori how
`many nonstructure and structure peaks are necessary
`to describe protein solution spectra.
`To circumvent these problems, two separate meth-
`ods were utilized in a bootstrapping fashion to gener-
`ate two sets of idealized reference spectra. The first
`method is to generate reference spectra directly from
`the structure-related Gaussian–Lorentzian product
`functions fit to each protein solution spectra via the
`solution to Eq. [3]. Once all the unknowns have been
`fit, reference spectra for the amide I and III re