Defaults

Parameter	Argument	Purpose	Default
\(\alpha\)	`alpha`	Nominal level of the test	`0.05`
\(\pi\)	`targetpower`	Minimum desired power	`0.80`
logscale	`logscale`	Analysis on log-transformed or original scale?	`TRUE`
\(\theta_{0}\)	`theta0`	‘True’ or assumed deviation of T from R	see below
\(\theta_{1}\)	`theta1`	Lower BE limit	see below
\(\theta_{2}\)	`theta2`	Upper BE limit	see below
CV	`CV`	CV	none
design	`design`	Planned design	`"2x2"`
method	`method`	Algorithm	`"exact"`
robust	`robust`	‘Robust’ evaluation (Senn’s basic estimator)	`FALSE`
print	`print`	Show information in the console?	`TRUE`
details	`details`	Show details of the sample size search?	`FALSE`
imax	`imax`	Maximum number of iterations	`100`

Defaults depending on the argument logscale:

Parameter	Argument	`logscale=TRUE`	`logscale=FALSE`
\(\theta_{0}\)	`theta0`	`0.95`	`+0.05`
\(\theta_{1}\)	`theta1`	`0.80`	`−0.20`
\(\theta_{2}\)	`theta2`	`1.25`	`+0.20`

Arguments targetpower, theta0, theta1, theta2, and CV have to be given as fractions, not percent.
The CV is generally the within-subject coefficient of variation. Only for design = "parallel" it is the total (a.k.a. pooled) CV.

The terminology of the argument design follows this pattern: treatments x sequences x periods. The conventional TR|RT (a.k.a. AB|BA) design can be abbreviated as "2x2". Some call the "parallel" design a ‘one-sequence’ design. The design "paired" has two periods but no sequences, e.g., in studying linear pharmacokinetics a single dose is followed by multiple doses. A profile in steady state (T) is compared to the one after the single dose (R). Note that the underlying model assumes no period effects.

Implemented exact algorithms are "exact" / "owenq" (Owen’s Q function, default) and "mvt" (bivariate non-central t-distribution). Approximations are "noncentral" / "nct" (non-central t-distribution) and "shifted" / "central" (‘shifted’ central t-distribution).

"robust=TRUE" forces the degrees of freedom to n-seq and is used only in higher-order crossover designs. It could be used if the evaluation was done with a mixed-effects model.

With sampleN.TOST(..., details = FALSE, print = FALSE) results are provided as a data.frame with nine columns Design, alpha, CV, theta0, theta1, theta2, Sample size, Achieved power, and Target power.
To access e.g., the sample size use either sampleN.TOST[1, 7] or sampleN.TOST[["Sample size"]]. We suggest to use the latter in your code for clarity.

The estimated sample size gives always the total number of subjects (not subject/sequence in crossovers or subjects/group in parallel designs – like in some other software packages).

Sample size

Designs with one (parallel) to four periods (replicates) are supported.

#      design                        name   df
#  "parallel"           2 parallel groups  n-2
#       "2x2"               2x2 crossover  n-2
#     "2x2x2"             2x2x2 crossover  n-2
#     "2x2x3"   2x2x3 replicate crossover 2n-3
#     "2x2x4"   2x2x4 replicate crossover 3n-4
#     "2x4x4"   2x4x4 replicate crossover 3n-4
#     "2x3x3"   partial replicate (2x3x3) 2n-3
#     "2x4x2"            Balaam’s (2x4x2)  n-2
#    "2x2x2r" Liu’s 2x2x2 repeated x-over 3n-2
#    "paired"                paired means  n-1

Example 1

Estimate the sample size for assumed intra-subject CV 0.30.

sampleN.TOST(CV = 0.30)
# 
# +++++++++++ Equivalence test - TOST +++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover 
# log-transformed data (multiplicative model)
# 
# alpha = 0.05, target power = 0.8
# BE margins = 0.8 ... 1.25 
# True ratio = 0.95,  CV = 0.3
# 
# Sample size (total)
#  n     power
# 40   0.815845

To get only the sample size:

sampleN.TOST(CV = 0.30, details = FALSE, print = FALSE)[["Sample size"]]
# [1] 40

Note that the sample size is always rounded up to give balanced sequences (here a multiple of two). Since power is higher than our target, likely this was the case here. Let us check that.
Which power will we get with a sample size of 39?

power.TOST(CV = 0.30, n = 39)
# Unbalanced design. n(i)=20/19 assumed.
# [1] 0.8056171

Confirmed that with 39 subjects we will already reach the target power. That means also that one dropout will not compromise power. We can explore that further in a Power Analysis.

Maybe the clinical site has a limited capacity. Any study can also be performed in a replicate design and assessed for ABE. As a rule of thumb the total sample in a 3-period replicate is ~¾ of the 2×2×2 design and the one of a 4-period replicate ~½ of the 2×2×2. The number of treatments and hence, of biosamples – which drives the study’s cost – will be roughly the same.

designs <- c("2x2x2", "2x2x3", "2x3x3", "2x2x4")
# data.frame of results
res <- data.frame(design = designs, n = NA, power = NA, n.do = NA,
                  power.do = NA, stringsAsFactors = FALSE)
for (i in 1:4) {
  # print = FALSE and details = FALSE suppress output to the console
  # we are only interested in columns 7-8
  # let's also calculate power for one dropout
  res[i, 2:3] <- signif(
                   sampleN.TOST(CV = 0.30, design = res$design[i],
                                print = FALSE, details = FALSE)[7:8], 6)
  res[i, 4]   <-  res[i, 2] - 1
  res[i, 5]   <- suppressMessages(
                   signif(
                     power.TOST(CV = 0.30, design = res$design[i],
                                n = res[i, 4]), 6))
}
print(res, row.names = FALSE)
#  design  n    power n.do power.do
#   2x2x2 40 0.815845   39 0.805617
#   2x2x3 30 0.820400   29 0.806873
#   2x3x3 30 0.820400   29 0.806383
#   2x2x4 20 0.820240   19 0.799151

As expected, and as a bonus we get a small gain in power, though in the 4-period with one dropout power will be slightly compromised.
But why is power in the replicate designs higher than in the 2×2×2? If residual variances are equal, the width of the confidence interval depends only on the t-value and in particular on the degrees of freedom – which themselves depend on the design and the sample size.

#  design                      name  n formula df t.value
#   2x2x2           2x2x2 crossover 40     n-2 38   1.686
#   2x2x3 2x2x3 replicate crossover 30   2*n-3 57   1.672
#   2x3x3 partial replicate (2x3x3) 30   2*n-3 57   1.672
#   2x2x4 2x2x4 replicate crossover 20   3*n-4 56   1.673

If the capacity is 24 beds, we would opt for a 4-period full replicate.

As another option (e.g., if the blood volume is limited and/or there are concerns about a higher dropout-rate in a multiple-period study) we could stay with the 2×2×2 design but split the sample size into groups. In Europe (and for the FDA if certain conditions are fulfilled), there are no problems pooling the data and use the conventional model.

sequence + subject(sequence) + period + treatment

However, some regulators prefer to incorporate group-terms in the model.

group + sequence + subject(group × sequence) + 
period(group) + group × sequence + treatment

Since we have more terms in the model, we will loose some degrees of freedom. Let us explore in simulations how that would impact power. By default function power.TOST.sds() performs 100,000 simulations.

grouping <- function(capacity, n) {
  # split sample size into >=2 groups based on capacity
  if (n <= capacity) { # make equal groups
    ngrp <- rep(ceiling(n/2), 2)
  } else {             # at least one = capacity
    ngrp    <- rep(0, ceiling(n / capacity))
    grps    <- length(ngrp)
    ngrp[1] <- capacity
    for (j in 2:grps) {
      n.tot <- sum(ngrp) # what we have so far
      if (n.tot + capacity <= n) {
        ngrp[j] <- capacity
      } else {
        ngrp[j] <- n - n.tot
      }
    }
  }
  return(ngrp = list(grps = length(ngrp), ngrp = ngrp))
}
CV        <- 0.30
capacity  <- 24 # clinical capacity
res       <- data.frame(n = NA, grps = NA, n.grp = NA, m.1 = NA, m.2 = NA)
x         <- sampleN.TOST(CV = CV, print = FALSE, details = FALSE)
res$n     <- x[["Sample size"]]
res$m.1   <- x[["Achieved power"]]
x         <- grouping(capacity = capacity, n = res$n)
res$grps  <- x[["grps"]]
ngrp      <- x[["ngrp"]]
res$n.grp <- paste(ngrp, collapse = "|")
res$m.2   <- power.TOST.sds(CV = CV, n = res$n, grps = res$grps,
                            ngrp = ngrp, gmodel = 2, progress = FALSE)
res$loss <- 100*(res$m.2 - res$m.1)/res$m.1
names(res)[2:6] <- c("groups", "n/group", "pooled model",
                     "group model", "loss (%)")
res[1, 4:6] <- sprintf("%6.4f", res[1, 4:6])
cat("Achieved power and relative loss\n"); print(res, row.names = FALSE)
# Achieved power and relative loss
#   n groups n/group pooled model group model loss (%)
#  40      2   24|16       0.8158      0.8120  -0.4664

With ~0.5% the relative loss in power is practically negligible.

Example 2

Estimate the sample size for equivalence of the ratio of two means with normality on original scale based on Fieller’s (‘fiducial’) confidence interval.¹ Crossover design, within-subject CV_w 0.20, between-subject CV_b 0.40.

sampleN.RatioF(CV = 0.20, CVb = 0.40)
# 
# +++++++++++ Equivalence test - TOST +++++++++++
#     based on Fieller's confidence interval
#             Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover
# Ratio of means with normality on original scale
# alpha = 0.025, target power = 0.8
# BE margins = 0.8 ... 1.25 
# True ratio = 0.95,  CVw = 0.2,  CVb = 0.4
# 
# Sample size
#  n     power
# 28   0.807774

In this function the default \(\alpha\) is 0.025, since it is intended for studies with clinical endpoints, where the 95% confidence interval is usually used for equivalence testing.²

Example 3

Estimate the sample size based on the results of a 2×2×2 pilot study in 16 subjects where we found an intra-subject CV 0.20 and \(\theta_{0}\) 0.92.

Basic

If we believe [sic] that in the pivotal study both the \(\theta_{0}\) and CV will be exactly like in the pilot, this is a straightforward excercise. We simply plug in the necessary arguments.

sampleN.TOST(CV = 0.20, theta0 = 0.92)
# 
# +++++++++++ Equivalence test - TOST +++++++++++
#             Sample size estimation
# -----------------------------------------------
# Study design: 2x2 crossover 
# log-transformed data (multiplicative model)
# 
# alpha = 0.05, target power = 0.8
# BE margins = 0.8 ... 1.25 
# True ratio = 0.92,  CV = 0.2
# 
# Sample size (total)
#  n     power
# 28   0.822742

This approach is called by some ‘carved in stone’ because it relies on very strong assumptions which likely are not justified. Although power curves are relatively flat close to unity (i.e., the impact on power is small when moving from say, \(\theta_{0}\) 1 to 0.95) but they are getting increasingly steep when moving away more from unity.

Fig. 1 Power curve (\(\alpha\) 0.05, 2×2×2 design, CV 0.20;
blue n = 28, black n = 22, red n = 16)

Both the \(\theta_{0}\) and CV carry (as every estimate) some uncertainty, which depends on the degrees of freedom (sample size, design). Hence, it might not be good idea to perform very small pilot studies (e.g., in only six subjects). Although it might be possible that in the pivotal study the CV is indeed lower than the one we observed in the pilot, it would be even more risky than the ‘carved in stone’ approach to assume a lower one in planning the pivotal study.

With the function CVCL() we can calculate confidence limits of the CV. It is advisable to use the upper CL as a conservative approach. As a side effect – if the CV will be lower than assumed – we get a ‘safety margin’ for the T/R ratio.

df <- 16 - 2 # degrees of freedom of the 2x2x2 crossover pilot
CVCL(CV = 0.20, df = df, side = "upper", alpha = 0.20)[["upper CL"]]
# [1] 0.2443631

I prefer \(\alpha=0.20\) in analogy to the producer’s risk \(\beta=0.20\) when planning for power \(\pi=1-\beta=0.80\). Gould³ suggested the more liberal 0.25. Let us repeat the sample size estimation with the upper limit.

CL.upper <- CVCL(CV = 0.20, df = 16 - 2, side = "upper",
                 alpha = 0.20)[["upper CL"]]
res <- sampleN.TOST(CV = CL.upper, theta0 = 0.92, print = FALSE)
print(res[7:8], row.names = FALSE)
#  Sample size Achieved power
#           40       0.816919

Of course, this has a massive impact on the sample size, which increases from 28 to 40. It might be difficult to convice the management to invest ~40% more than with the ‘carved in stone’ approach.

However, we can also explore how power would be affected if our assumption is true and the study will nevertheless be performed with only 28 subjects.

CL.upper <- CVCL(CV = 0.20, df = 16 - 2, side = "upper",
                 alpha = 0.20)[["upper CL"]]
power.TOST(CV = CL.upper, theta0 = 0.92, n = 28)
# [1] 0.679253

There will be a drop in power from the ~0.82 the management expects to only ~0.67. That’s just slightly higher than betting for two dozens in Roulette…

Fig. 2 Power for CV 0.244

As mentioned above, if the CV turns out to be lower than assumed, we gain ‘headroom’ for the T/R ratio. Let us explore that. We perform the study with 40 subjects, the CV will be 0.22 (less than the ~0.24 we assumed), and the T/R with 0.90 worse than the 0.92 we assumed.

power.TOST(CV = 0.22, theta0 = 0.90, n = 40)
# [1] 0.7686761

Below our target but still acceptable.

Advanced

In the basic approach we concentrated mainly on the uncertainty of the CV. But this is not the end of the story. Clearly the \(\theta_{0}\) is uncertain as well. With the function expsampleN.TOST() we can dive deeper into this area. Let us start with only the CV.

expsampleN.TOST(CV = 0.20, theta0 = 0.92, prior.type = "CV",
                prior.parm = list(m = 16, design = "2x2x2"))
# 
# ++++++++++++ Equivalence test - TOST ++++++++++++
#        Sample size est. with uncertain CV
# -------------------------------------------------
# Study design:  2x2 crossover 
# log-transformed data (multiplicative model)
# 
# alpha = 0.05, target power = 0.8
# BE margins = 0.8 ... 1.25 
# Ratio = 0.92
# CV = 0.2 with 14 df
# 
# Sample size (ntotal)
#  n   exp. power
# 30   0.806069

Not that bad. The sample size increases only slightly from the 28 of the ‘carved in stone’ approach to 30 but is substantially lower than the 40 we estimated based on the upper confidence limit of the CV.

Let us keep the CV ‘fixed’ and take only the uncertainty of \(\theta_{0}\) into account.

expsampleN.TOST(CV = 0.20, theta0 = 0.92, prior.type = "theta0",
                prior.parm = list(m = 16, design = "2x2x2"))
# 
# ++++++++++++ Equivalence test - TOST ++++++++++++
#      Sample size est. with uncertain theta0
# -------------------------------------------------
# Study design:  2x2 crossover 
# log-transformed data (multiplicative model)
# 
# alpha = 0.05, target power = 0.8
# BE margins = 0.8 ... 1.25 
# Ratio = 0.92
# CV = 0.2
# 
# Sample size (ntotal)
#  n   exp. power
# 46   0.805236

It starts to hurt. We learned already that power curves are getting steep when the T/R ratio is not close to unity. Our \(\theta_{0}\) 0.92 was not very nice but in the pivotal study it might be even lower as well – which has a larger impact on power than the CV.

Now for the ‘worst case scenario’ where we take both uncertainties into account.

expsampleN.TOST(CV = 0.20, theta0 = 0.92, prior.type = "both",
                prior.parm = list(m = 16, design = "2x2x2"),
                details = FALSE)
# 
# ++++++++++++ Equivalence test - TOST ++++++++++++
#   Sample size est. with uncertain CV and theta0
# -------------------------------------------------
# Study design:  2x2 crossover 
# log-transformed data (multiplicative model)
# 
# alpha = 0.05, target power = 0.8
# BE margins = 0.8 ... 1.25 
# Ratio = 0.92 with 14 df
# CV = 0.2 with 14 df
# 
# Sample size (ntotal)
#  n   exp. power
# 54   0.802440

This sample size is almost twice the 28 your boss got from his Excel-Sheet. If you are not fired when suggesting such a study, take it as a warning what might happen.
At least, if the pivotal study fails in a lower sample size, you know why.

Fig. 3 Expected power for uncertain estimates (from top: CV, \(\theta_{0}\), both)

If you are adventurous consider an Adaptive Two-Stage Sequential Design with sample size re-estimation. Various methods are supported in the package Power2Stage.

Higher-Order Designs

Designs with three and four periods are supported.

#   design            name   df
#    "3x3"   3x3 crossover 2n-4
#  "3x6x3" 3x6x3 crossover 2n-4
#    "4x4"   4x4 crossover 3n-6

Which design argument in studies with more than two periods you should use depends on the planned evaluation.

Suppose we have a bioequivalence study with three treatments – A, B, and C – and the objective of the study is to make pairwise comparisons among the treatments. Suppose further that treatment C is different in kind from A and B, so that the assumption of homogeneous variance among the three treatments is questionable. One way to do the analyses, under normality assumptions, is Two at a Time – e.g., to test hypotheses about A and B, use only the data from A and B. Another way is All at Once – include the data from all three treatments in a single analysis, making pairwise comparisons within this analysis. If the assumption of homogeneous variance is correct, the All at Once approach will provide more d.f. for estimating the common variance, resulting in increased power. If the variance of C differs from that of A and B, the All at Once approach may have reduced power or an inflated type I error rate, depending on the direction of the difference in variances.
— Donald J. Schuirmann, Two at a Time? Or All at Once?

All at Once

In this approach you assume homogenicity in the ANOVA of pooled data and get one residual variance.
To plan for this approach specify one of the design arguments given above.
Two at a Time

In this approach – which is preferred by some agencies, e.g., the EMA⁴ – you exclude one treatment and perform the analysis on the remaing two. That means you have two separate Incomplete Block Designs (IBD).
Say, you have one test (T) and two references from different regions (R1, R2). You get two separate within-subject variances (in the comparisons T vs R1 and T vs R2).
To plan for this approach specify design = "2x2x2".

I suggest to opt for the latter since the former may lead to biased estimates and an inflated type I error.⁵

Three treatments like in the ‘Two at a Time’ example above.

CV  <- 0.20
res <- data.frame(design = c("3x6x3", "2x2x2"), n = NA, power = NA,
                  stringsAsFactors = FALSE)
for (i in 1:2) {
  res[i, 2:3] <- signif(
                   sampleN.TOST(CV = CV, design = res$design[i],
                                details = FALSE, print = FALSE)[7:8], 5)
}
print(res, row.names = FALSE)
#  design  n   power
#   3x6x3 18 0.80895
#   2x2x2 20 0.83468

Four treatments (e.g., Test fasting/fed, Reference fasting/fed).

CV  <- 0.20
res <- data.frame(design = c("4x4", "2x2x2"), n = NA, power = NA,
                  stringsAsFactors = FALSE)
for (i in 1:2) {
  res[i, 2:3] <- signif(
                  sampleN.TOST(CV = CV, design = res$design[i],
                               details = FALSE, print = FALSE)[7:8], 5)
}
print(res, row.names = FALSE)
#  design  n   power
#     4x4 20 0.85280
#   2x2x2 20 0.83468

Power

Let us first recap the hypotheses in bioequivalence.

The two one-sided testing procedure (TOST)⁶ \[H_{0L}:\frac{\mu_T}{\mu_R} \leq \theta_1\:vs\:H_{1L}:\frac{\mu_T}{\mu_R}>\theta_1\] \[H_{0U}:\frac{\mu_T}{\mu_R} \geq \theta_2\:vs\:H_{1U}:\frac{\mu_T}{\mu_R}<\theta_2\]
The confidence inclusion approach (preferred in regulatory guidelines) \[H_0:\frac{\mu_T}{\mu_R}\notin \left [ \theta_1, \theta_2 \right]\:vs\:H_1:\theta_1<\frac{\mu_T}{\mu_R}<\theta_2\]

Note that the Null hypotheses are bioinequivalence.

From a regulatory perspective the outcome of a comparative BA study is dichotomous. Either the study demonstrated bioequivalence (CI entirely within \(\theta_{1},\theta_{2}\)) or not.⁷ Only if the CI lies entirely outside \(\theta_{1},\theta_{2}\) the Null hypothesis is not rejected and further studies not warranted. In any case, calculation of post hoc (a.k.a. a posteriori, retrospective) power is futile.⁸

There is simple intuition behind results like these: If my car made it to the top of the hill, then it is powerful enough to climb that hill; if it didn’t, then it obviously isn’t powerful enough. Retrospective power is an obvious answer to a rather uninteresting question. A more meaningful question is to ask whether the car is powerful enough to climb a particular hill never climbed before; or whether a different car can climb that new hill. Such questions are prospective, not retrospective.
— Russell V. Lenth, Two Sample-Size Practices that I Don’t Recommend

If a study passes – despite lower than desired power – there is no reason to reject the study. It only means that assumptions (!) in sample size estimation were not realized. The CV might have been higher, the T/R-ratio worse, or the dropout-rate higher than anticipated. On the other hand, if post hoc power is higher than desired, this does not further support a study which already showed BE.

Nevertheless, exploring power is useful when trying to understand why a study failed and to plan another study. Let us continue with the example from above. Ignoring our concerns, the pivotal study was performed with 28 subjects. The T/R-ratio was slightly worse (0.90), the CV higher (0.25), and we had one dropout in the first sequence and two in the second. The function CI.BE() comes handy.

n     <- 28
n.seq <- rep(n/2, 2) - c(1, 2)
round(100*CI.BE(pe = 0.90, CV = 0.25, n = n.seq), 2)
#  lower  upper 
#  79.87 101.42

The study failed although by a small margin. One might be tempted to repeat the study with an – only slightly – higher sample size. But what was the post hoc power of the failed study?

power.TOST(CV = 0.25, theta0 = 0.90, n = c(13, 12))
# [1] 0.4963175

Actually the chance of passing was worse than flipping a coin.

NB, in calculating post hoc power the observed T/R ratio has to be used. In some reports high ‘power’ is given even for a failed study, which is not even wrong. Unfortunately a T/R of 1 is still used in some software packages.

power.TOST(CV = 0.25, theta0 = 1, n = c(13, 12))
# [1] 0.8558252

Since all estimates were worse than assumed, how could we have a ‘power’ even higher than desired, despite the fact that the study failed? Nonsense.

Pooling

When planning the next study one can use the entire arsenal from above. Since we have more accurate estimates (from 25 subjects instead of the 16 of the pilot) the situation is more clear now.
As a further step we can take the information of both studies into account with the function CVpooled().

CVs <- ("  CV |  n | design | study
         0.20 | 16 |  2x2x2 | pilot
         0.25 | 25 |  2x2x2 | pivotal")
txtcon <- textConnection(CVs)
data   <- read.table(txtcon, header = TRUE, sep = "|",
                     strip.white = TRUE, as.is = TRUE)
close(txtcon)
print(CVpooled(data, alpha = 0.20), digits = 4, verbose = TRUE)
# Pooled CV = 0.2322 with 37 degrees of freedom
# Upper 80% confidence limit of CV = 0.2603

Before pooling, variances are weighted by the degrees of freedom. Hence, the new estimate is with ~0.23 closer to the 0.25 of the larger study. Note also that the upper confidence limit is with ~0.26 higher than the one of the pilot study with ~0.24.

Caveats

Don’t pool data blindly. In the ideal situation you know the entire background of all studies (clinical performance, bioanalytics). Even if all studies were performed at the same CRO, more things are important. One purpose of pilot studies is to find a suitable sampling schedule. If the schedule of the pilot was not ideal (e.g., you get the impression that C_max was not sufficiently enough described), pooling is not a good idea. It might well be that in the pivotal study – with a ‘better’ schedule – this CV is more reliable. On the other hand, the AUC is less sensitive to different sampling schedule.

Pooling data from the literature should be done with great caution (if at all). Possibly critical information is missing. Consider using a CV from the upper end of values instead. Common sense helps.

An example where pooling could be missleading: C_max data of pilot and pivotal studies in five different designs with 11 to 39 subjects, fasting/fed, three different bioanalytical methods (GC/ECD, LC-MS/MS, GC/MS), chiral and achiral (which is not relevant for this drug since the active enantiomer is ~95% of the total drug and there is no in vivo interconversion). Note that most pivotal studies were ‘overpowered’.

Fig. 4 Blue deltoid the pooled CV 0.163; dotted line upper CL 0.167.

This is an apples-and-oranges comparison. Red squares show CVs which were above the upper CL of the pooled CV. Given, only of two studies the lower CL did not overlap the upper one of the pooled CV.

Which side of the great divide are you on? Do you believe that meta is better or do you hold instead that pooling is fooling? Well, to nail my colours to the mast, I belong to the former school. It seems to me that there is no other topic in medical statistics, with the possible exceptions of cross-over trials, bioequivalence and n-of-1 studies, which has the same capacity as this one to rot the brains.
— Stephen Senn, A Dip in the Pool

Hints

Power Analysis

Although we suggest to explore the various options shown in Example 3, it is worthwhile to have a first look with the function pa.ABE().

pa.ABE(CV = 0.20, theta0 = 0.92)
# Sample size plan ABE
#  Design alpha  CV theta0 theta1 theta2 Sample size Achieved power
#     2x2  0.05 0.2   0.92    0.8   1.25          28       0.822742
# 
# Power analysis
# CV, theta0 and number of subjects which lead to min. acceptable power of at least 0.7:
#  CV= 0.2377, theta0= 0.9001
#  n = 21 (power= 0.7104)

Fig. 5 Power Analysis (in each panel one argument is varied and others kept constant)

This exercise confirms what we already know. The most critical parameter is \(\theta_{0}\), whereas dropouts are the least important.

More details are given in the vignette Power Analysis.

Dropouts

As we have seen, the impact of dropouts on power is rather limited. Regulary CROs suggest additional subjects to ‘compensate for the potential loss in power’. IMHO, milking sponsors to make wealthy CROs richer. Note that the dropout-rate is based on dosed subjects. Hence, the correct formula for an adjusted sample size is \(n_{adj}=n / (1-{dropout\:rate})\), and not \(n_{adj}=n \times (1+{dropout\:rate})\).

balance <- function(x, seqs) {
  x <- ceiling(x) + ceiling(x) %% seqs
  return(x)
}
do.rate   <- 0.10
seqs      <- 3
n         <- seq(12, 120, 12)
res       <- data.frame(n = n,
                        ad1 = balance(n * (1 + do.rate), seqs),
                        el1 = NA, diff1 = NA,
                        ad2 = balance(n / (1 - do.rate), seqs),
                        el2 = NA, diff2 = NA)
res$el1   <- floor(res$ad1 * (1 - do.rate))
res$diff1 <- sprintf("%+i", res$el1 - n)
res$el2   <- floor(res$ad2 * (1 - do.rate))
res$diff2 <- sprintf("%+i", res$el2 - n)
invisible(
  ifelse(res$el2 - n >=0, res$opt <- res$el2, res$opt <- res$el1))
res$diff  <- sprintf("%+i", res$opt - n)
print(res, row.names = FALSE)
#    n ad1 el1 diff1 ad2 el2 diff2 opt diff
#   12  16  14    +2  16  14    +2  14   +2
#   24  27  24    +0  27  24    +0  24   +0
#   36  41  36    +0  41  36    +0  36   +0
#   48  55  49    +1  54  48    +0  48   +0
#   60  66  59    -1  68  61    +1  61   +1
#   72  82  73    +1  82  73    +1  73   +1
#   84  93  83    -1  95  85    +1  85   +1
#   96 107  96    +0 109  98    +2  98   +2
#  108 121 108    +0 120 108    +0 108   +0
#  120 132 118    -2 136 122    +2 122   +2

With the wrong formula – especially for high droput rates – you might end up with less eligible subjects than planned. On the other hand, with the correct one (due to rounding up to get balanced sequences) you may end up with slightly too many. Of course, if you want to be one the safe side, you can select the ‘best’ (column opt).

Literature Data

Sometimes the CV is not given in the literature. By means of the function CVfromCI() we can calculate it from the confidence interval, the design, and the sample size.

CVfromCI(lower = 0.8323, upper = 1.0392,
         design = "2x2x4", n = 26)
# [1] 0.3498608

The method is exact if the subjects/sequence are known. In the literature generally only the total sample size is given. What if the study was unbalanced?

n     <- 26
n1    <- balance(seq(n, 12, -1), 2) / 2
n2    <- n - n1
nseqs <- unique(data.frame(n1 = n1, n2 = n2, n = n))
res   <- data.frame(n1 = nseqs$n1, n2 = nseqs$n2, CV = NA)
for (i in 1:nrow(res)) {
  res$CV[i] <- CVfromCI(lower = 0.8323, upper = 1.0392,
                        design = "2x2x4",
                        n = c(res$n1[i], res$n2[i]))
}
print(res, row.names = FALSE)
#  n1 n2        CV
#  13 13 0.3498608
#  12 14 0.3487635
#  11 15 0.3454550
#  10 16 0.3398848
#   9 17 0.3319624
#   8 18 0.3215478
#   7 19 0.3084333
#   6 20 0.2923127

The true values of any unbalanced study are lower than what we assumed. That means, if we use the CV – falsely – assuming balanced sequences our sample size estimations will always be conservative.

Direction of Deviation

If you are unsure about the direction of the deviation of T from R (lower or higher) always assume \(\theta_{0}\) <1.

CV   <- 0.21
d    <- 0.05 # delta 5%, direction unknown
n    <- sampleN.TOST(CV = CV, theta0 = 1 - d, print = FALSE,
                     details = FALSE)[["Sample size"]]
res1 <- data.frame(CV = CV, theta0 = c(1 - d, 1 / (1 - d)),
                   n = n, power = NA)
for (i in 1:nrow(res1)) {
  res1$power[i] <- power.TOST(CV = CV, theta0 = res1$theta0[i], n = n)
}
n    <- sampleN.TOST(CV = CV, theta0 = 1 + d, print = FALSE,
                     details = FALSE)[["Sample size"]]
res2 <- data.frame(CV = CV, theta0 = c(1 + d, 1 / (1 + d)),
                   n = n, power = NA)
for (i in 1:nrow(res1)) {
  res2$power[i] <- power.TOST(CV = CV, theta0 = res2$theta0[i], n = n)
}
res <- rbind(res1, res2)
print(signif(res[order(res$n, res$theta0), ], 4), row.names = FALSE)
#    CV theta0  n  power
#  0.21 0.9524 20 0.8081
#  0.21 1.0500 20 0.8081
#  0.21 0.9500 22 0.8374
#  0.21 1.0530 22 0.8374

If you use 1.05 (sample size 20) power will be maintained down to its reciprocal (0.9524) but not to 0.95 (where you would need already a sample size of 22). One the other hand, 0.95 preserves power up to 1.053. The statement ‘sample size based on a deviation of ±5%’ seen in many protocol does not make sense.

function	author(s)
`sampleN.TOST`, `power.TOST`, `power.TOST.sds`, `sampleN.RatioF`, `CVCL`, `CI.BE`, `CVpooled`	Detlew Labes
`expsampleN.TOST`	Benjamin Lang, Detlew Labes
`CVfromCI`	Detlew Labes, Helmut Schütz, Benjamin Lang
`pa.ABE`	Helmut Schütz, Detlew Labes

Fieller EC. Some Problems in Interval Estimation. J Royal Stat Soc B. 1954; 16(2): 175–85. JSTOR:2984043.↩︎
EMEA, CPMP. Points to Consider on Switching between Superiority and Non-Inferiority. London, 27 July 2000. CPMP/EWP/482/99.↩︎
Gould AL. Group Sequential Extension of a Standard Bioequivalence Testing Procedure. J Pharmacokin Biopharm. 1995: 23(1); 57–86. doi:10.1007/BF02353786.↩︎
EMA, CHMP. Guideline on the Investigation of Bioequivalence. London, 20 January 2010. CPMP/EWP/QWP/1401/98 Rev. 1/ Corr **.↩︎
D’Angelo P. Testing for Bioequivalence in Higher‐Order Crossover Designs: Two‐at‐a‐Time Principle Versus Pooled ANOVA. 2^nd Conference of the Global Bioequivalence Harmonisation Initiative. Rockville, 15–16 September, 2016.↩︎
Schuirmann DJ. A Comparison of the Two One-Sided Tests Procedure and the Power Approach for Assessing the Equivalence of Average Bioavailability. J Pharmacokin Biopharm. 1987; 15(6): 657–80. doi:10.1007/BF01068419.↩︎
If the CI overlaps \(\theta_{1},\theta_{2}\), the outcome is indecisive (called ‘the grey zone’ by some). As long as \(\theta_{0}\) lies within \(\theta_{1},\theta_{2}\) you can hope to show BE in a larger sample size. However, once \(\theta_{0}\) approaches one of the limits, if might be futile – even with a very small CV. Try
sampleN.TOST(CV = 0.075, theta0 = 0.81) and then
sampleN.TOST(CV = 0.075, theta0 = 0.80) to get an idea.↩︎
Hoenig JM, Heisey DM. The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis. Am Stat. 2001; 55(1): 19–24↩︎

PowerTOST • Average Bioequivalence

Details and examples of other methods are accessible via the menu bar on top of the page and the online manual of all functions.