`
`Review
`
`www.elsevier.com/locate/jhep
`
`Methodology of superiority vs. equivalence trials
`and non-inferiority trials
`
`Erik Christensen*
`
`Clinic of Internal Medicine I, Bispebjerg University Hospital, Bispebjerg Bakke 23, DK-2400 Copenhagen NV, Copenhagen, Denmark
`
`The randomized clinical trial (RCT) is generally accepted as the best method of comparing effects of therapies. Most
`often the aim of an RCT is to show that a new therapy is superior to an established therapy or placebo, i.e. they are planned
`and performed as superiority trials. Sometimes the aim of an RCT is just to show that a new therapy is not superior but
`equivalent to or not inferior to an established therapy, i.e. they are planned and performed as equivalence trials or non-
`inferiority trials. Since the types of trials have different aims, they differ significantly in various methodological aspects.
`The awareness of the methodological differences is generally quite limited. This paper reviews the methodology of these
`types of trials with special reference to differences in respect to planning, performance, analysis and reporting of the trial.
`In this context the relevant basal statistical concepts are reviewed. Some of the important points are illustrated by
`examples.
`Ó 2007 European Association for the Study of the Liver. Published by Elsevier B.V. All rights reserved.
`
`1. Introduction
`
`The randomized clinical trial (RCT) is generally
`accepted as the best method of comparing effects of ther-
`apies [1,2]. Most often the aim of an RCT is to show
`that a new therapy is superior to an established therapy
`or placebo, i.e. they are planned and performed as supe-
`riority trials. Sometimes the aim of an RCT is just to
`show that a new therapy is not superior but equivalent
`to or not inferior to an established therapy, i.e. they
`are planned and performed as equivalence trials or
`non-inferiority trials [3]. Since these types of trials have
`different aims, they differ significantly in various meth-
`odological aspects [4]. The awareness of the methodo-
`logical differences
`is generally quite limited. For
`example it is a rather common belief that failure of find-
`ing a significant difference between therapies in a superi-
`ority trial implies that the therapies have the same effect
`or are equivalent [5–10]. However, such a conclusion is
`
`* Tel.: +45 3531 2854; fax: +45 3531 3556.
`E-mail address: ec05@bbh.hosp.dk
`
`not correct because of a considerable risk of overlooking
`a clinically relevant effect due to insufficient sample size.
`The purpose of this paper is to review the method-
`ology of the different types of trials, with special refer-
`ence to differences in respect to planning, performance,
`analysis and reporting of the trial. In this context the
`relevant basal statistical concepts will be reviewed.
`Some of the important points will be illustrated by
`examples.
`
`2. Superiority trials
`
`2.1. Sample size estimation and power of an RCT
`
`An important aspect in the planning of any RCT is to
`estimate the number of patients necessary i.e. the sample
`size. The various types of trials differ in this respect
`[1,2,11]. A superiority trial aims to demonstrate the
`superiority of a new therapy compared to an established
`therapy or placebo. The following description applies to
`a superiority trial. The features, by which an equivalence
`or a non-inferiority trial differ, will be described later.
`
`0168-8278/$32.00 Ó 2007 European Association for the Study of the Liver. Published by Elsevier B.V. All rights reserved.
`doi:10.1016/j.jhep.2007.02.015
`
`Mylan Exhibit 1059
`Mylan v. Regeneron, IPR2022-01226
`Page 1
`
`
`
`948
`
`E. Christensen / Journal of Hepatology 46 (2007) 947–954
`
`To estimate the sample size one needs to consider some
`important aspects described in the following.
`By how much should the new therapy be better than
`the reference therapy? This extra effect of the new com-
`pared to the reference therapy is called the Least Rele-
`vant Difference or the Clinical Significance. It is often
`denoted by the Greek letter D (Fig. 1).
`By how much would the difference in effect between
`the two groups be influenced by random factors? Like
`any other biological measurement a treatment effect is
`subject to a considerable ‘‘random’’ variation, which
`needs to be determined and taken into account. The
`magnitude of the variation is described in statistical
`terms by the standard deviation S or the variance S2
`(see Fig. 1c). The variance of the effect variable would
`need to be obtained from a pilot study or from previ-
`ously published similar studies. The trial should dem-
`onstrate as precisely as possible the true difference in
`effect between the treatments. However, because of
`the random variation the final result of the trial may
`deviate from the true difference and give erroneous
`results. If for example the null hypothesis H0 of no
`difference were true, it could be still that the trial in
`some cases would show a difference. This type of error
`– the type 1 error (‘‘false positive’’) (Fig. 1) – would
`have the consequence of
`introducing an ineffective
`therapy. If on the other hand the alternative hypothe-
`sis HD of the difference being D were true, the trial
`could in some cases fail to show a difference. This type
`of error – the type 2 error (‘‘false negative’’) (Fig. 1) –
`would have the consequence of rejecting an effective
`therapy.
`Thus one needs to specify how large risks of type 1
`and type 2 errors would be acceptable for the trial. Ide-
`ally the type 1 and type 2 error risks should be near zero,
`but this would need extremely large trials. Limited
`resources and patient numbers make it necessary to
`accept some small risk of type 1 and 2 errors.
`Most often the type 1 error risk a would be specified
`to 5%. In this paper, a means the type 1 error risk in one
`direction i.e. either up or down from H0 i.e. a = 5%.
`However, in many situations one would be interested
`in detecting both beneficial and harmful effects of the
`new therapy compared to the control therapy, i.e. one
`would be interested in ‘‘two-sided’’ testing for a differ-
`ence in both ‘‘upward’’ and ‘‘downward’’ direction
`(Fig. 1). Hence we would instead specifiy the type 1 error
`risk to be 2a (i.e. aupwards + adownwards), i.e. 2a = 5%.
`The type 2 error risk b would normally be specified
`to 10-20%. Since a given value of D is always either
`above or below zero (H0), the type 2 error risk b is
`always one-sided. The smaller b, the larger the com-
`plementary probability 1 b of accepting HD when
`it is in fact true. 1 b is called the power of the trial
`because it states the probability of finding D if this
`difference truly exists.
`
`Fig. 1. Illustration of factors influencing the sample size of a trial. The
`effect difference found in a trial will be subject to random variation. The
`variation is illustrated by bell-shaped normal distribution curves for a
`difference of zero corresponding to the null hypothesis (H0) and for a
`difference of D corresponding to the alternative hypothesis (HD),
`respectively. Defined areas under the curves indicate the probability of
`a given difference being compatible with H0 or HD, respectively. If the
`difference lies near H0, one would accept H0. The farther the difference
`would be from H0, the less probable H0 would be. If the probability of H0
`becomes very small (less than the specified type 1 error) risk 2a (being a
`in either tail of the curve) one would reject H0. The sample distribution
`curves show some overlap. A large overlap will result in considerable risk
`of interpretation error,
`in particular the type 2 error risk may be
`substantial as indicated in the figure. An important issue would be to
`reduce the type 2 error risk b (and increase the power 1 b) to a
`reasonable level. Three ways of doing that are shown in (b–d), a being a
`reference situation. (b) Isolated increase of 2a will decrease b and
`increase power. Conversely, isolated decrease of 2a will increase b and
`decrease power. (c) Isolated narrowing of the sample distribution curves –
`by increasing sample size 2N and/or decreasing variance of the difference
`S2 – will decrease b and increase power. Conversely, isolated widening of
`the sample distribution curves – by decreasing sample size and/or
`increasing variance of the difference – will increase b and decrease power.
`(d) Isolated increase of D – larger therapeutic effect – will decrease b and
`increase power. Conversely, isolated decrease of D – smaller therapeutic
`effect – will increase b and decrease power.
`
`From given values of D, S2, a and b the needed num-
`ber (N) of patients in each group can be estimated using
`this relatively simple general formula:
`
`Mylan Exhibit 1059
`Mylan v. Regeneron, IPR2022-01226
`Page 2
`
`
`
`E. Christensen / Journal of Hepatology 46 (2007) 947–954
`
`949
`
`N ¼ ðZ2a þ ZbÞ2 S2=D2;
`
`where Z2a and Zb are the standardized normal deviates
`corresponding to the levels of the defined values of 2a
`(Table 1, left), and b (Table 1, right), respectively. If
`for some reason one wants to test for difference in only
`one direction (‘‘one-sided’’ testing) one should replace
`Z2a with Za in the formula and apply the right side of
`Table 1. The formula is approximate, but it gives in
`most cases a good estimation of the necessary number
`of patients. For a trial with two parallel groups of equal
`size the total sample size will be 2N.
`The values used for 2a, b and D should be decided by
`the researcher, not by the statistician. The values chosen
`should take into account the disease, its stage, the effec-
`tiveness and side effects of the control therapy and an
`estimate of how much extra effect may be reasonably
`expected by the new therapy.
`If for example the disease is rather benign with a rel-
`atively good prognosis and the new therapy is more
`expensive and may have more side effects than a rather
`effective control therapy, one should specify a relatively
`larger D and b and a smaller 2a, because the new therapy
`would only be interesting if it is markedly better than the
`control therapy.
`If on the other hand the disease is aggressive, the new
`therapy is less expensive or may have less side effects
`than a not very effective control therapy, one should
`specify a relatively smaller D and b and a larger 2a,
`because the new therapy would be interesting even if it
`is only slightly better than the control therapy.
`
`As mentioned above 2a would normally be specified
`to 5% or 0.05, but one may justify values of 0.10 or
`0.01 in certain situations as mentioned above. b would
`normally be specified to 0.10–0.20, but in special situa-
`tions a higher or lower value may be justified. D should
`be decided on clinical grounds as the least relevant ther-
`apeutic gain of the new therapy considering the disease
`and its prognosis, the efficacy of the control therapy
`and what may reasonably be expected of the new ther-
`apy. Preliminary data from pilot studies or historical
`observational data can be guidelines for the choice of
`D. Even if it may be tempting to specify a relatively large
`D as fewer patients will then be needed, D should never
`be specified larger than what is biologically reasonable.
`It will always be unethical to perform trials with unreal-
`istic aims. Fig. 1 illustrates the effects on the type 2 error
`risk b and hence also on the power (1 b) of changing
`2a, N, S2 and D. Thus b will be decreased and the power
`1 b will be increased if 2a is increased (Fig. 1b), if the
`sample size is increased (Fig. 1c), and if D is increased
`(Fig. 1d).
`The estimated sample size should be increased in pro-
`portion to the expected loss of patients during follow-up
`due to drop-outs and withdrawals.
`
`2.2. The confidence interval
`
`An important concept indicating the confidence of
`the result obtained in an RCT is the width of the confi-
`dence interval of the difference D in effect between the
`therapies investigated [1,2]. The narrower the confidence
`
`Table 1
`Abbreviated table of the standardized normal distribution (adapted for this paper)
`
`Two-sided
`probability
`
`Z2a
`
`One-sided probability
`
`2a
`
`Za or Zb
`
`a or b
`
`Za or Zb
`
`a or b
`
`3.72
`3.29
`3.09
`2.58
`2.33
`1.96
`1.64
`1.28
`1.04
`0.84
`0.67
`0.52
`0.39
`0.25
`0.13
`0.00
`
`3.72
`3.29
`3.09
`2.58
`2.33
`1.96
`1.64
`1.28
`1.04
`0.84
`0.67
`0.52
`0.39
`0.25
`0.13
`0.00
`
`0.0001
`0.0005
`0.001
`0.005
`0.010
`0.025
`0.05
`0.10
`0.15
`0.20
`0.25
`0.30
`0.35
`0.40
`0.45
`0.50
`
`0.0002
`0.001
`0.002
`0.01
`0.02
`0.05
`0.1
`0.2
`0.3
`0.4
`0.5
`0.6
`0.7
`0.8
`0.9
`1.0
`
`0.00
` 0.13
` 0.25
` 0.39
` 0.52
` 0.67
` 0.84
` 1.04
` 1.28
` 1.64
` 1.96
` 2.33
` 2.58
` 3.09
` 3.29
` 3.72
`Note. The total area under the normal distribution curve is one. The area under a given part of the curve gives the probability of an observation being
`in that part. The y-axis indicates the ‘‘probability density’’, which is highest in the middle of the curve and decreases in either direction toward the
`tails of the curve. The normal distribution is symmetric, i.e. the probability from Z to plus infinity (right side of the table) is the same as from Z to
` 1. The right side of the table gives the one-sided probability from a given Z-value on the x-axis to +1. The left side of the table gives the two-sided
`probability as the sum of the probability from a given positive Z-value to +1 and the probability from the corresponding negative Z-value to 1.
`
`0.50
`0.55
`0.60
`0.65
`0.70
`0.75
`0.80
`0.85
`0.90
`0.95
`0.975
`0.990
`0.995
`0.999
`0.9995
`0.9999
`
`Mylan Exhibit 1059
`Mylan v. Regeneron, IPR2022-01226
`Page 3
`
`
`
`950
`
`E. Christensen / Journal of Hepatology 46 (2007) 947–954
`
`interval would be, the more reliable the result would be.
`In general the width of the confidence interval is deter-
`mined by the sample size. A large sample size would
`result in a narrow confidence interval. Normally the
`95% confidence interval would be estimated. The 95%
`confidence interval is the interval, which would on aver-
`age include the true difference in 95 out of 100 similar
`studies. This is illustrated in Fig. 2 where 100 trial sam-
`ples of the same size have been randomly drawn from
`the same population. It is important to note that in 5
`of the 100 samples the 95% confidence interval of the
`difference in effect D does not include the true difference
`found in the population. When the sorted confidence
`intervals are aligned to their middle (Fig. 2c), the varia-
`tion in relation to the true value in the population
`becomes even clearer. If simulation is carried out on
`an even greater scale, the likelihood distribution of the
`true difference in the population, given the results from
`a certain trial sample, will follow a normal distribution
`like that presented in Fig. 3 [2]. It is seen that the likeli-
`hood of the true difference in the population is maxi-
`mum at the difference D found in the sample and that
`it decreases with higher and lower values. The figure also
`
`Fig. 2. Illustration of the variation of confidence limits in random
`samples (computer simulation). (a) ninety-five percent confidence inter-
`vals in 100 random samples of same size from the same population
`aligned according to the true value in the population. In 5 of the samples
`the 95% confidence interval does not include the true value found in the
`population. (b) The same confidence intervals are here sorted according
`to their values. (c) When the sorted confidence intervals are aligned to
`their middle, their variation in relation to the true value in the population
`is again clearly seen. This presentation corresponds to how investigators
`would see the world. They investigate samples in order to extrapolate the
`findings to the population. However,
`the potential
`imprecision of
`extrapolating from a sample to the population is apparent – especially
`if the confidence interval is wide. Thus keeping confidence intervals rather
`narrow is important. This would mean relatively large trials.
`
`Fig. 3. (a) Histogram showing the distribution of the true difference in
`the population in relation to the difference D found in the trial sample
`(computer simulation of 10,000 samples). (b) The normally distributed
`likelihood curve of the true difference in the population in relation to the
`difference D found in a trial sample. The 95% confidence interval (CI) is
`shown.
`
`illustrates the 95% confidence interval, which is the
`interval that includes the middle 95% of the total likeli-
`hood area under the normal curve. This area can be cal-
`culated from the difference D and its standard error
`SED. To be surer that the true difference is included in
`the confidence interval, one may calculate a 99% confi-
`dence interval, which would be wider, since it should
`include the middle 99% of the total likelihood area.
`
`2.3. The type 2 error risk of having overlooked
`a difference D
`
`If the 95% confidence interval of D includes zero,
`then there is no significant difference in effect between
`the two therapies. However, this does not mean that
`one can conclude that the effects of the therapies are
`the same. There may still be a true difference in effect
`between the therapies, which the RCT has just not been
`able to detect e.g. because of insufficient sample size and
`power. The risk of having overlooked a certain differ-
`ence in effect of D between the therapies is the type 2
`error risk b. In some cases this risk may be substantial.
`Example 1 gives an illustration of this.
`
`Example 1. In naı¨ve cases of chronic hepatitis C
`genotype 1 pegylated interferon plus ribavirin for 3
`months induce sustained virologic response in about
`40%. One wishes to test if a new therapeutic regimen can
`increase the sustained response in this type of patients to
`
`Mylan Exhibit 1059
`Mylan v. Regeneron, IPR2022-01226
`Page 4
`
`
`
`E. Christensen / Journal of Hepatology 46 (2007) 947–954
`
`951
`
`60% with a power (1 b) of 80%. The type 1 error risk
`(2a) should be 5%. One needs to estimate the number of
`patients necessary for this trial. For comparison of
`proportions like in this trial,
`the variance of
`the
`difference (S2) is equal to p1(1 p1) + p2(1 p2), where
`p1 and p2 are the proportions with response in the
`compared groups. So we have:
`b ¼ 0:20 ) Zb ¼ 0:84
`2a ¼ 0:05 ) Z2a ¼ 1:96:
`D ¼ 0:2:
`p1 ¼ 0:4
`p2 ¼ 0:6
`Using N = (Z2a + Zb)2 · p1(1 p1) + p2(1 p2)/D2 one
`gets:
`N ¼ ð1:96 þ 0:84Þ2 ð0:4 0:6 þ 0:6 0:4Þ=0:22
`¼ 7:84 0:48=0:04 ¼ 94:
`Therefore the necessary number of patients (2N) would
`be 188 patients.
`However, due to various difficulties only 120 patients
`(60 in each group) of this kind could be recruited. By
`solving the general sample size formula according to Zb
`ffiffiffiffi
`one obtains:
`N
` D Z 2a:
`Zb ¼
`
`p S
`
`Fig. 4. Illustration of the type 2 error risk b in an RCT showing a
`difference D in effect, which is not significant, since zero (0) difference lies
`between the lower (L) and upper (U) 95% confidence limits. The type 2
`error risk of having overlooked an effect of D is substantial.
`
`estimated as follows: Zb = (D D)/SED = (0.20–0.15)/
`0.09 = 0.55. Using Table 1 (right part) with interpola-
`tion b becomes 0.29. Thus the risk of having overlooked
`an effect of 20% is 29%. This is a consequence of the
`smaller number of patients included and the reduced
`power of the trial. The situation corresponds to that
`illustrated in Fig. 4. As seen from this figure the result
`of a negative RCT like this does not rule out that the
`true difference may be D, since the type 2 error risk b
`of having overlooked an effect of D is substantial.
`
`3. Equivalence trials
`
`The purpose of an equivalence trial is to establish
`identical effects of the therapies being compared [12–
`17,15]. Complete equivalent effects would mean a D-
`value of zero. As seen from the formula for estimation
`of the sample size (see above) this would mean division
`by zero, which is not possible. Dividing by a very
`small D-value would result
`in an unrealistic large
`estimated sample size. Therefore, as a manageable
`compromise, the aim of an equivalence trial would
`be to determine if the difference in effects between
`two therapies lies within a specified small interval D
`to +D.
`An equivalence trial would be relevant if the new
`therapy is simpler, associated with fewer side-effects or
`less expensive, even if it is not expected to have a larger
`therapeutic effect than the control therapy.
`It is crucial to specify a relevant size of D [14,17]. This
`is not simple. One should aim at limiting as much as
`possible the acceptance of a new therapy, which is infe-
`rior to the control therapy. Therefore D should be spec-
`ified rather small and in any case smaller than the
`smallest value that would represent a clinically meaning-
`ful difference. As a crude general rule D should be spec-
`ified to no more than half the value which may be used
`in a superiority trial [13]. Equivalence between the ther-
`apies would be demonstrated if the confidence interval
`
`Using this formula, the power of the trial with the
`reduced number of patients can be estimated as follows:
`ffiffiffiffiffi
`ffiffiffiffiffiffiffiffi
`60
`0:48
`
`pffi
`
`p
`
`Zb ¼
`
` 0:2 Z2a
`
`Zb ¼ 7:75=0:69 0:2 1:96 ¼ 0:29
`
`Using Table 1 (right part) with interpolation b becomes
`0.39. Thus with this limited number of patients, the
`power 1 b is now only 0.61 or 61% (a reduction from
`80%). This markedly reduced power seriously dimin-
`ishes the chances of demonstrating a significant treat-
`ment effect. A post hoc power calculation like this can
`only be used to explain why a superiority trial is incon-
`clusive; it can never be used to support a negative result
`of a superiority trial.
`The result of the trial was as follows: sustained
`virologic response was found in 26 of 60 (0.43 or 43%)
`in the control group and in 35/60 (0.58 or 58%) in the
`new therapy group. The difference D is 0.15 or 15%,
`but it is not statistically significant (p > 0.10). A simple
`approximate formula for the standard error of the
`difference is:
`p
`ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
`p1ð1 p1Þ=n1 þ p2ð1 p2Þ=n2
`SED ¼
`ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
`¼ 0:09
`¼
`0:43 0:57=60 þ 0:58 0:42=60
`95%
`confidence
`interval
`for
`The
`is
`D
`or 0.026
`D ± Z2a · SED = 0.15 ± 1.96 · 0.09
`to
`0.326 ( 2.6% to 32.6%), which is rather wide, as it
`includes both zero and D. The type 2 error risk of over-
`looking an effect of 20% (corresponding to D) can be
`
`p
`
`Mylan Exhibit 1059
`Mylan v. Regeneron, IPR2022-01226
`Page 5
`
`
`
`952
`
`E. Christensen / Journal of Hepatology 46 (2007) 947–954
`
`for the difference in effect between the therapies turns
`out to lie entirely between D and +D [13]. Fig. 5 illus-
`trates the conclusions that can be drawn from the posi-
`tion of the confidence limits for the difference in effect
`found in the performed trial.
`In the equivalence trial the roles of the null and alter-
`native hypotheses are reversed. In the equivalence trial
`the relevant null hypothesis is that a difference of at least
`D exists, and the aim of the trial is to disprove this in
`favor of the alternative hypothesis that no difference
`exists [13]. Even if this situation is like a mirror image
`of the situation for the superiority trial, it turns out that
`the method for sample size estimation is similar in the
`two types of trial, although D has different meanings
`in the superiority and equivalence trials.
`
`Example 2. In the same patients as described in Exam-
`ple 1 one wishes to test in an RCT the therapeutic
`equivalence of the current regimen of pegylated inter-
`feron plus ribavirin (giving a sustained response in 40%)
`and another new inexpensive therapeutic regimen hav-
`ing less side-effects.
`One needs to estimate the number of patients
`necessary for this trial. The power (1 b) of the trial
`should be 80%. The type 1 error risk (2a) should be 5%.
`The therapies would be considered equivalent if the
`confidence interval for the difference in proportion with
`sustained response falls entirely within the interval
`
`Fig. 5. Examples of observed treatment differences (new therapy –
`control therapy) with 95% confidence intervals and conclusions to be
`drawn. (a) The new therapy is significantly better than the control
`therapy. However,
`the magnitude of
`the effect may be clinically
`unimportant. (b–d) The therapies can be considered having equivalent
`effects. (e–f) Result inconclusive. (g) The new therapy is significantly
`worse than the control therapy, but the magnitude of the difference may
`be clinically unimportant. (h) The new therapy is significantly worse than
`the control therapy.
`
`±0.10% or ±10%. Thus D is specified to 0.10. So we
`have:
`2a ¼ 0:05 ) Z2a ¼ 1:96:
`p1 ¼ 0:4
`p2 ¼ 0:4
`
`b ¼ 0:20 ) Zb ¼ 0:84
`D ¼ 0:10:
`
`Using the same expression for the variance of the differ-
`ence (S2) as in Example 1 this result is obtained:
`N ¼ ð1:96 þ 0:84Þ2 ð0:4 0:6 þ 0:4 0:6Þ=0:12
`¼ 7:84 0:48=0:01 ¼ 376:
`Therefore the necessary number of patients (2N) would
`be 752 patients.
`The trial was conducted and the result of the trial was
`as follows: Sustained virologic response was found in
`145 of 372 (0.39 or 39%) in the control group and in 156/
`380 (0.41 or 41%) in the new therapy group. The
`difference D is 0.02 or 2%, but it is not statistically
`significant (p > 0.50). The standard error of the differ-
`p
`ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
`ence is:
`p1ð1 p1Þ=n1 þ p2ð1 p2Þ=n2
`SED ¼
`
`ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffip
`¼ 0:036
`¼
`0:39 0:61=372 þ 0:41 0:59=380
`The
`95%
`confidence
`interval
`for
`is
`D
`D ± Z2a · SED = 0.02 ± 1.96 · 0.036 or 0.050
`to
`0.091 ( 5.0% to 9.1%). Since this confidence interval lies
`completely within the specified interval for D from 0.1
`to +0.1, the effects of the two therapies can be consid-
`ered equivalent. The situation corresponds to B or C
`in Fig. 5.
`Like in this example the necessary sample size in an
`equivalence trial will often be at least 4· that of a
`corresponding superiority trial. Therefore the necessary
`resources will be larger.
`
`4. Non-inferiority trials
`
`The non-inferiority trial, which is related to the equiv-
`alence trial, aims not at showing equivalence but only at
`showing that the new therapy is no worse than the refer-
`ence therapy. Thus the non-inferiority trial is designed to
`demonstrate that the difference in effect (new therapy–
`control therapy) should be no less than D. Non-inferi-
`ority of the new therapy would then be demonstrated if
`the lower confidence limit for the difference in effect
`between the therapies turns out to lie above D. The
`position of the upper confidence limit is not of primary
`interest. Thus the non-inferiority trial is designed as a
`one-sided trial. For that reason the necessary number
`of patients would be less than for a corresponding equiv-
`alence trial as illustrated in the following example.
`
`Example 3. We want to conduct the trial described in
`Example 2 not as an equivalence trial but as a non-
`
`Mylan Exhibit 1059
`Mylan v. Regeneron, IPR2022-01226
`Page 6
`
`
`
`E. Christensen / Journal of Hepatology 46 (2007) 947–954
`
`953
`
`inferiority trial. Thus the trial should be one-sided
`instead of the two-sided equivalence trial. The only
`difference would be that one should use Za instead of
`Z2a. For a = 0.05 one gets Za = 1.64 (Table 1, right
`side). Thus we obtain:
`N ¼ ð1:64 þ 0:84Þ2 ð0:4 0:6 þ 0:4 0:6Þ=0:12
`¼ 6; 15 0:48=0:01 ¼ 295:
`Therefore the necessary number of patients (2N) would
`be 590 patients.
`The trial was conducted and the result of the trial was
`as follows: Sustained virologic response was found in 114
`of 292 (0.39 or 39%) in the control group and in 125/298
`(0.42 or 42%) in the new therapy group. The difference D
`is 0.03 or 3%, but it is not statistically significant
`(p > 0.50). The standard error of the difference is:
`p
`ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
`p1ð1 p1Þ=n1 þ p2ð1 p2Þ=n2
`SED ¼
`ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
`¼
`0:39 0:61=292 þ 0:42 0:58=298
`
`¼ 0:040
`
`p
`
`The lower one-sided 95% confidence limit would be D –
`Za · SED = 0.03 1.64 · 0.040 = 0.036
`( 3.6%).
`Since the lower confidence limit lies above the specified lim-
`it for D of 0.1, the effect of the new therapy is not inferior
`to the control therapy. If the two-sided 95% confidence
`interval (which is recommended by some even for the
`non-inferiority trial [18]) is being estimated, one obtains
`D ± Z2a · SED = 0.03 ± 1.96 · 0.040 or 0.048 to 0.108
`( 4.8% to 10.8%). The lower confidence limit still lies
`above 0.1, but the upper confidence limit lies above 0.1
`(the upper limit for equivalence – see Example 2). There-
`fore the new therapy may be slightly better than the control
`therapy. The type 2 error risk of having overlooked an ef-
`fect of 0.1 or 10% could be estimated as follows:
`Zb = (D D)/SED = (0.10 –0.03)/0.04 = 1.75. Using
`Table 1 (right part) with interpolation b becomes 0.04, a
`rather small risk.
`
`5. Other factors
`
`Since the aim of an equivalence or non-inferiority
`trial is to establish equivalence between the therapies
`or non-inferiority of the new therapy, there is not the
`same incentive to remove factors likely to obscure any
`difference between the treatments as in a superiority
`trial. Thus in some cases finding of equivalence may
`be due to trial deficiencies like small sample size, lack
`of double blinding, lack of concealed random allocation,
`incorrect doses of drugs, effects of concomitant medicine
`or spontaneous recovery of patients without medical
`intervention [19].
`An equivalence or non-inferiority trial should mirror
`as closely as possible the methods used in previous superi-
`ority trials assessing the effect of the control therapy ver-
`
`sus placebo. In particular it is important that the inclusion
`and exclusion criteria, which define the patient popula-
`tion, the blinding, the randomization, the dosing schedule
`of the standard treatment, the use of concomitant medica-
`tion and other interventions, the primary response vari-
`able and its schedule of measurements, are the same as
`in the preceding superiority trials, which have evaluated
`the reference therapy being used in the comparison. In
`addition one should pay attention to patient compliance,
`the response during any run in period, and the scale of
`patient losses and the reasons for them. These should
`not be different from previous superiority trials.
`
`6. Analysis: both ‘‘intention to treat’’ and ‘‘per protocol’’
`
`An important point in the analysis of equivalence and
`non-inferiority trials concerns whether to use an ‘‘inten-
`tion to treat’’ or a ‘‘per protocol’’ analysis. In a superi-
`ority trial, where the aim is to decide if two treatments
`are different, an intention to treat analysis is generally
`conservative: the inclusion of protocol violators and
`withdrawals will usually tend to make the results from
`the two treatment groups more similar. However, for
`an equivalence or non-inferiority trial this effect is no
`longer conservative: any blurring of
`the difference
`between the treatment groups will increase the chance
`of finding equivalence or non-inferiority.
`A per protocol analysis compares patients according
`to the treatment actually received and includes only
`those patients who satisfied the entry criteria and prop-
`erly followed the protocol. In a superiority trial this
`approach may tend to enhance any difference between
`the treatments rather than diminishing it, because unin-
`formative ‘‘noise’’ is removed. In an equivalence or non-
`inferiority trial both types of analysis should be per-
`formed and equivalence or non-inferiority can only be
`established if both analyses support it. To ensure the
`best possible quality of the analysis it is important to
`collect complete follow-up data on all randomized
`patients as per protocol, irrespective of whether they
`are subsequently found to have failed entry criteria,
`withdraw from trial medication prematurely, or violate
`the protocol
`in some other way [20]. Such a rigid
`approach to data collection allows maximum flexibility
`during later analysis and hence provides a more robust
`basis for decisions.
`The most common problem in reported equivalence
`or non-inferiority studies is that they are planned and
`analyzed as if they were superiority trials and that the
`lack of a statistically significant difference is taken as
`proof of equivalence [7–10]. Thus there seems to be a
`need for a better knowledge of how equivalence and
`non-inferiority studies should be planned, performed,
`analyzed and reported.
`
`Mylan Exhibit 1059
`Mylan v. Regeneron, IPR2022-01226
`Page 7
`
`
`
`954
`
`E. Christensen / Journal of Hepatology 46 (2007) 947–954
`
`7. Ensuring a high quality
`
`References
`
`A recent paper reported on the quality of reporting of
`published equivalence trials [21]. They found that some
`trials had been planned as superiority trials but were
`reported as if they had been equivalence trials after fail-
`ure to demonstrate superiority, since they did not
`include an equivalence margin. They also found that
`one-third of the reports which included a sample size
`calculation had omitted elements needed to reproduce
`it; one third of the reports described a confidence inter-
`val whose size was not