4b) Errors and power — Torturing the t-test

Data Simulation with Monte Carlo Methods

Marko Bachl

University of Hohenheim

Errors and power — Torturing the t-test, part II

Traktandenliste

  1. Introduction & overview

  2. Monte Carlo Simulation?

  3. Proof by simulation — The Central Limit Theorem (CLT)

  4. Errors and power — Torturing the t-test

  5. Misclassification and bias — Messages mismeasured

  6. Outlook: What’s next?

4. Errors and power — Torturing the t-test

  1. Comparison of Student’s and Welch’s t-tests

  2. A priori power calculation for Welch’s t-test

Student’s t-test & Welch’s t-test

  • Old school advice: Student’s t-test for equal variances, Welch’s t-test for unequal variances.

    • Higher power of Student’s t-test if assumptions hold.
  • Modern advice: Always use Welch’s t-test.

    • Better if assumptions are violated; not worse if assumptions hold.

    • e.g., Delacre, M., Lakens, D., & Leys, C. (2017). Why Psychologists Should by Default Use Welch’s t-test Instead of Student’s t-test. International Review of Social Psychology, 30(1). https://doi.org/10.5334/irsp.82

  • For those who don’t care about t-tests: Idea also applies to heteroskedasticity-consistent standard errors.

Second Simulation study: Power

  1. Question: What is the goal of the simulation?

    • Comparison of Student’s and Welch’s t-tests
  2. Quantities of interest: What is measured in the simulation?

    • Power (\(1 - \beta\)) of the tests
  3. Evaluation strategy: How are the quantities assessed?

    • Comparison of statistical power to reject the Null hypothesis if the alternative hypothesis is true
  4. Conditions: Which characteristics of the data generating model will be varied?

    • Sample size, effect size (mean difference)
  5. Data generating model: How are the data simulated?

    • Random draws from normal distributions with different group means and different overall sample sizes, but euqal group sizes and standard deviations

Additional practical considerations

  • How to adapt the simulation function to include different group means

  • How to adapt the experimental conditions

    • to include different group means and different total sample sizes

    • to include many different levels of the total sample size

Adapt the simulation function

  • Include new argument M_diff

    • Mean of the treatment condition; equals mean difference, as mean of the control is fixed at 0.

    • Mean difference equals standardized effect size d if SDR, and, consequently, both group standard deviations, are 1.

sim_ttest = function(n = 200, GR = 1, SDR = 1, M_diff = 0) {
  n1 = round(n / (GR + 1))
  n2 = round(n1 * GR)
  sd1 = 1
  sd2 = sd1 * SDR # sd2/sd1
  g1 = rnorm(n = n1, mean = 0, sd = sd1) # control
  g2 = rnorm(n = n2, mean = M_diff, sd = sd2) # treatment
  welch = t.test(g1, g2)$p.value
  student = t.test(g1, g2, var.equal = TRUE)$p.value
  res = list("Welch" = welch, "Student" = student)
  res
}
sim_ttest(n = 100, M_diff = 0.3)
$Welch
[1] 0.03260774

$Student
[1] 0.03260287

Adapt the experimental conditions

  • GR and SDR are now fixed at 1.

  • New between-simulation factors: n and M_diff

conditions = expand_grid(
  GR = 1, 
  SDR = 1,
  n = c(100, 200, 300),
  M_diff = c(0.2, 0.4, 0.6)) %>% 
  rowid_to_column(var = "condition")
conditions
# A tibble: 9 × 5
  condition    GR   SDR     n M_diff
      <int> <dbl> <dbl> <dbl>  <dbl>
1         1     1     1   100    0.2
2         2     1     1   100    0.4
3         3     1     1   100    0.6
4         4     1     1   200    0.2
5         5     1     1   200    0.4
6         6     1     1   200    0.6
7         7     1     1   300    0.2
8         8     1     1   300    0.4
9         9     1     1   300    0.6

Run simulation experiment

set.seed(326)
i = 10000
tic()
sims = map_dfr(1:i, ~ conditions) %>% # each condition i times
  rowid_to_column(var = "sim") %>% # within simulation comparison
  rowwise() %>%
  mutate(p.value = list(sim_ttest(n = n, M_diff = M_diff))) %>% 
  unnest_longer(p.value, indices_to = "method")
toc()
22.864 sec elapsed

Check results: Simulation v analytical solution

sims %>% 
  filter(M_diff == 0.4 & method == "Student") %>% 
  group_by(n) %>% 
  summarise(P_p05 = mean(p.value < 0.05)) %>% 
  ungroup() %>% 
  mutate(across(where(is.numeric), round, 3))
# A tibble: 3 × 2
      n P_p05
  <dbl> <dbl>
1   100 0.505
2   200 0.802
3   300 0.928
power.t.test(n = c(100, 200, 300)/2,
             delta = 0.4, sd = 1,
             sig.level = 0.05)

     Two-sample t test power calculation 

              n = 50, 100, 150
          delta = 0.4
             sd = 1
      sig.level = 0.05
          power = 0.5081451, 0.8036466, 0.9322752
    alternative = two.sided

NOTE: n is number in *each* group
  • Close enough for government work

Results: Power comparison

Show the code
sims %>% 
  group_by(n, M_diff, method) %>% 
  summarise(P_p05 = mean(p.value < 0.05)) %>% 
  ggplot(aes(factor(n), P_p05, color = method, group = method)) +
  geom_point(position = position_dodge(width = 0.3), size = 3) + 
  geom_line(position = position_dodge(width = 0.3), size = 1.5) +
  facet_wrap(vars(M_diff), labeller = label_both) +
  labs(x = "n", y = "Power")

Summary of the t-test comparison

  • Student’s t-test fails the nominal false discovery rate if variances and group sizes are unequal (simulation study 1).

  • Welch’s t-test has similar statistical power as Student’s t-test if the assumption of equal group variances holds (simulation study 2).

  • We should always use Welch’s t-test as a default.

  • For those who don’t use t-tests: We probably should also use heteroskedasticity-consistent standard errors as default in OLS linear models.

Questions?

A priori power calculation for Welch’s t-test

a priori Power calculation for Welch’s t-test

  • So far we have learnt that we should prefer Welch’s t-test over Student’s t-test when comparing two group means.

  • We also know that different group sizes and group standard deviations influence statistical power.

  • However, the standard options for a priori power calculation often do not allow for varying group sizes and varying group SDs.

  • This is a good segue to using Monte Carlo simulation experiments for a priori power calculation with a (seemingly) simple example.

Using a model fitted to simulated data as power calculator

  • We start with a simple example: The relationship between sample size and statistical power.

  • For now, we assume equal group sample sizes and group standard deviations.

  • We also limit the example to one effect size.

Calculator: Adapt the simulation function

  • We no longer need Student’s t-test.
    • The output is simply the p-value of Welch’s t-test.
  • Beyond that, the simulation function stays the same.
sim_ttest = function(n = 200, GR = 1, SDR = 1, M_diff = 0) {
  n1 = round(n / (GR + 1))
  n2 = round(n1 * GR)
  sd1 = 1
  sd2 = sd1 * SDR # sd2/sd1
  g1 = rnorm(n = n1, mean = 0, sd = sd1)
  g2 = rnorm(n = n2, mean = M_diff, sd = sd2)
  res = t.test(g1, g2)$p.value
  return(res)
}
sim_ttest(n = 300, M_diff = 0.4)
[1] 0.1350032

Calculator: Adapt the experimental conditions

Option 1: more fine-grained factor n

conditions = expand_grid(GR = 1, 
                         SDR = 1,
                         n = seq(from = 50, to = 500, by = 10),
                         M_diff = 0.4) %>% 
  rowid_to_column(var = "condition")
i = 250
conditions = map_dfr(1:i, ~ conditions) %>%
  rowid_to_column(var = "sim")
conditions
# A tibble: 11,500 × 6
     sim condition    GR   SDR     n M_diff
   <int>     <int> <dbl> <dbl> <dbl>  <dbl>
 1     1         1     1     1    50    0.4
 2     2         2     1     1    60    0.4
 3     3         3     1     1    70    0.4
 4     4         4     1     1    80    0.4
 5     5         5     1     1    90    0.4
 6     6         6     1     1   100    0.4
 7     7         7     1     1   110    0.4
 8     8         8     1     1   120    0.4
 9     9         9     1     1   130    0.4
10    10        10     1     1   140    0.4
# … with 11,490 more rows

Option 2: many random draws of n from a distribution

i = 250*46 # same total number of simulations
set.seed(86)
conditions = tibble(sim = 1:i,
                    GR = 1, 
                    SDR = 1,
                    n = sample(x = 50:500, size = i, replace = TRUE),
                    M_diff = 0.4)
conditions
# A tibble: 11,500 × 5
     sim    GR   SDR     n M_diff
   <int> <dbl> <dbl> <int>  <dbl>
 1     1     1     1   244    0.4
 2     2     1     1   462    0.4
 3     3     1     1   306    0.4
 4     4     1     1    78    0.4
 5     5     1     1   285    0.4
 6     6     1     1   396    0.4
 7     7     1     1   419    0.4
 8     8     1     1   392    0.4
 9     9     1     1   226    0.4
10    10     1     1   278    0.4
# … with 11,490 more rows

Calculator: Run simulation

tic()
sims = conditions %>% 
  rowwise() %>%
  mutate(p.value = sim_ttest(n = n, M_diff = M_diff, GR = GR, SDR = SDR))
toc()
1.482 sec elapsed

Calculator: Visual results

sims %>% 
  mutate(p05 = as.integer(p.value < 0.05)) %>% 
  ggplot(aes(n, p05)) +
  geom_smooth(method = "gam", 
              formula = y ~ s(x, bs = "cr"), 
              method.args = list(family = "binomial")) + 
  labs(y = "Power")

Calculator: Predict results for any n

Fit (non-linear) model (e.g., mgcv::gam)

power_mod = sims %>% 
  mutate(p05 = as.integer(p.value < 0.05)) %>%
  gam(p05 ~ s(n, bs = "cr"), family = binomial, data = .)

Predict power for any n (within range of simulated values!)

tibble(n = c(120, 256, 404)) %>% 
  mutate(power = predict(power_mod, newdata = .,
                         type = "response"))
# A tibble: 3 × 2
      n     power
  <dbl> <dbl[1d]>
1   120     0.587
2   256     0.884
3   404     0.978

Check against analytical solution

power.t.test(n = c(120, 256, 404)/2,
             delta = 0.4, sd = 1, 
             sig.level = 0.05)

     Two-sample t test power calculation 

              n = 60, 128, 202
          delta = 0.4
             sd = 1
      sig.level = 0.05
          power = 0.5843645, 0.8902603, 0.9798356
    alternative = two.sided

NOTE: n is number in *each* group

Calculator: Summary

  • (Trivial) result: Larger sample size, more power

  • General take-away: Using many values instead of few discrete factor levels can be an interesting alternative

    • if we know (assume) that there is a fixed functional relationship with the outcome
    • if we want to look at an outcome across the whole range of values
    • if we want to make predictions across the range of values
    • if we evaluate models which need more computational resources
    • Not so helpful if discretized after the simulation.

Questions?

A priori power calculation for Welch’s t-test

Getting started: Some preliminary thoughts

  • We know that effect size and total sample size are positively related to statistical power, so we are not that interested in their effects.

  • We know less about the role of group sample sizes and group standard deviations differences, so this is what we want to investigate in the simulation.

  • We have to think more about how we define the effect size.

    • This is generally one of the harder tasks in a priori power calculation
    • and it becomes more complicated if simple “mean divided by SD” rules of thumb get more ambiguous.

Third Simulation study: Power (again)

  1. Question: What is the goal of the simulation?

    • Understand roles of group sample sizes and group standard deviations for statistical power
  2. Quantities of interest: What is measured in the simulation?

    • Power (\(1 - \beta\)) of the designs using Welch’s t-test
  3. Evaluation strategy: How are the quantities assessed?

    • Comparison of statistical power to reject the Null hypothesis if the alternative hypothesis is true
  4. Conditions: Which characteristics of the data generating model will be varied?

    • Group sample sizes, group standard deviations
  5. Data generating model: How are the data simulated?

    • Random draws from normal distributions with different group sizes and standard deviations, but fixed group mean differences and overall sample size.

First try: Adapt the simulation function

  • We use the same simulation function as before.
sim_ttest = function(n = 200, GR = 1, SDR = 1, M_diff = 0) {
  n1 = round(n / (GR + 1))
  n2 = round(n1 * GR)
  sd1 = 1
  sd2 = sd1 * SDR # sd2/sd1
  g1 = rnorm(n = n1, mean = 0, sd = sd1)
  g2 = rnorm(n = n2, mean = M_diff, sd = sd2)
  res = t.test(g1, g2)$p.value
  return(res)
}
sim_ttest(n = 300, M_diff = 0.3, GR = 0.5)
[1] 0.1476812

First try: Adapt the experimental conditions

  • Total sample size n = 300 and effect size M_diff = 0.3 are now fixed.

  • New between-simulation factors: GR and SDR

conditions = expand_grid(GR = c(0.5, 1, 2), 
                         SDR = c(0.5, 1, 2),
                         n = 300,
                         M_diff = 0.3) %>% 
  rowid_to_column(var = "condition") %>% 
  mutate(
    n1 = round(n / (GR + 1)),
    n2 = round(n1 * GR),
    sd1 = 1,
    sd2 = sd1 * SDR
    ) 
conditions %>% select(1:5)
# A tibble: 9 × 5
  condition    GR   SDR     n M_diff
      <int> <dbl> <dbl> <dbl>  <dbl>
1         1   0.5   0.5   300    0.3
2         2   0.5   1     300    0.3
3         3   0.5   2     300    0.3
4         4   1     0.5   300    0.3
5         5   1     1     300    0.3
6         6   1     2     300    0.3
7         7   2     0.5   300    0.3
8         8   2     1     300    0.3
9         9   2     2     300    0.3

This has some implications

  • For effect size interpretation:

    • If the SD ratio is not equal to 1, the pooled SD is not equal to 1 and the mean difference is not in units of the pooled SD (as in traditional d).

    • Instead, it is the difference in SD of the control group units.

    • Sensible definition for intervention trials with passive control group: Estimate of the population variation without a intervention.

    • Maybe less sensible in other designs with more active controls (e.g., typical media effects experimental studies) or im observational studies (e.g., gender differences in some outcome).

This has some implications

  • For substantive interpretation:

    • SDR = 0.5 assumes a treatment which not only increases the level of the outcome by M_diff, but also halves the variation in the outcome.

    • SDR = 2 assumes a treatment which not only increases the level of the outcome by M_diff, but also doubles the variation in the outcome.

    • Heterogeneous treatment effects!

First try: Run simulation experiment

set.seed(7913)
i = 10000
tic()
sims = map_dfr(1:i, ~ conditions) %>% # each condition 10,000 times
  rowid_to_column(var = "sim") %>%
  rowwise() %>%
  mutate(p.value = sim_ttest(n = n, M_diff = M_diff, GR = GR, SDR = SDR))
toc()
11.534 sec elapsed

First try: Results

Show the code
sims %>% 
  group_by(GR, SDR) %>% 
  summarise(P_p05 = mean(p.value < 0.05)) %>% 
  ggplot(aes(factor(GR), P_p05, group = 1)) + 
  geom_point() + geom_line() +
  facet_wrap(vars(SDR), labeller = label_both) +
  labs(x = "Group size ratio", y = "Power")

First try: Results

Show the code
sims %>% 
  group_by(n, M_diff, GR, SDR, n1, n2, sd1, sd2) %>% 
  summarise(P_p05 = mean(p.value < 0.05), .groups = "drop") %>% 
  mutate(implies = str_glue("n_c = {n1}, n_t = {n2}, sd_c = {round(sd1, 2)}, sd_t = {round(sd2, 2)}")) %>% 
  mutate(implies = fct_reorder(implies, P_p05)) %>% 
  ggplot(aes(P_p05, implies)) + geom_point(size = 3) +
  labs(x = "Power", y = "Condition") +
  theme(axis.text.y = element_text(size = 20))

First try: Summary

  • Larger SD in the treatment group reduces statistical power, smaller SD in the treatment group increases power.

  • Optimal sample size plan: Larger sample size for group with larger SD

  • If group SDs are equal, most power with equal group sample sizes.

Second try: Bring back a population SD of 1

  • We might not be satisfied with the trivial result that more variation decreases power.

  • We might be interested in a more conventional d-style scaling of the effect size.

  • We want a population SD of 1 while retaining the SD ratio design.

Wikipedia to the rescue

Standard deviations of non-overlapping (X ∩ Y = ∅) sub-populations can be aggregated as follows if the size (actual or relative to one another) and means of each are known:

\[ \sigma_{X\cup Y} = \sqrt{ \frac{N_X \sigma_X^2 + N_Y \sigma_Y^2}{N_X + N_Y} + \frac{N_X N_Y}{(N_X+N_Y)^2}(\mu_X - \mu_Y)^2 } \] and

\[ \sigma_Y = SDR * \sigma_X \]

Second try: Adapt the simulation function

  • We add a new argument, pop_sd.

  • sd1 is computed based on pop_sd, GR (via n1 and n2), SDR, and M_diff.

sim_ttest = function(n = 200, pop_sd = 1, GR = 1, SDR = 1, M_diff = 0) {
  n1 = round(n / (GR + 1))
  n2 = round(n1 * GR)
  sd1 = sqrt(
    ((pop_sd^2 - ((n1 * n2) / (n1 + n2)^2) * M_diff^2) * (n1 + n2)) / 
      (n1 + n2 * SDR^2)
    )
  sd2 = sd1 * SDR # sd2/sd1
  g1 = rnorm(n = n1, mean = 0, sd = sd1)
  g2 = rnorm(n = n2, mean = M_diff, sd = sd2)
  res = t.test(g1, g2)$p.value
  return(res)
}
sim_ttest(n = 300, M_diff = 0.3, GR = 0.5)
[1] 0.01249033

Second try: Adapt the experimental conditions

  • Total sample size n = 300 and effect size M_diff = 0.3 are now fixed.

  • In addition, pop_sd is fixed at 1, returning to a d-style interpretation of the effect size.

  • Between-simulation factors: GR and SDR

conditions = expand_grid(GR = c(0.5, 1, 2), 
                         SDR = c(0.5, 1, 2),
                         pop_sd = 1,
                         n = 300,
                         M_diff = 0.3) %>% 
  rowid_to_column(var = "condition") %>% 
  mutate(
    n1 = round(n / (GR + 1)), n2 = round(n1 * GR),
    sd1 = sqrt(((pop_sd^2 - ((n1 * n2) / (n1 + n2)^2) * M_diff^2) * (n1 + n2)) / (n1 + n2 * SDR^2)),
    sd2 = sd1 * SDR
    ) 
conditions %>% select(1:6)
# A tibble: 9 × 6
  condition    GR   SDR pop_sd     n M_diff
      <int> <dbl> <dbl>  <dbl> <dbl>  <dbl>
1         1   0.5   0.5      1   300    0.3
2         2   0.5   1        1   300    0.3
3         3   0.5   2        1   300    0.3
4         4   1     0.5      1   300    0.3
5         5   1     1        1   300    0.3
6         6   1     2        1   300    0.3
7         7   2     0.5      1   300    0.3
8         8   2     1        1   300    0.3
9         9   2     2        1   300    0.3

Second try: Run simulation experiment

set.seed(39)
i = 10000
tic()
sims = map_dfr(1:i, ~ conditions) %>% # each condition 10,000 times
  rowid_to_column(var = "sim") %>%
  rowwise() %>%
  mutate(p.value = sim_ttest(n = n, M_diff = M_diff, GR = GR, SDR = SDR))
toc()
11.591 sec elapsed

Second try: Results

Show the code
sims %>% 
  group_by(GR, SDR) %>% 
  summarise(P_p05 = mean(p.value < 0.05)) %>% 
  ggplot(aes(factor(GR), P_p05, group = 1)) + 
  geom_point() + geom_line() +
  facet_wrap(vars(SDR), labeller = label_both) +
  labs(x = "Group size ratio", y = "Power")

Second try: Results

Show the code
sims %>% 
  group_by(n, M_diff, GR, SDR, n1, n2, sd1, sd2) %>% 
  summarise(P_p05 = mean(p.value < 0.05), .groups = "drop") %>% 
  mutate(implies = str_glue("n_c = {n1}, n_t = {n2}, sd_c = {round(sd1, 2)}, sd_t = {round(sd2, 2)}")) %>% 
  mutate(implies = fct_reorder(implies, P_p05)) %>% 
  ggplot(aes(P_p05, implies)) + geom_point(size = 3) +
  labs(x = "Power", y = "Condition") +
  theme(axis.text.y = element_text(size = 20))

Second try: Summary

  • Fixing the pooled SD to 1 makes the patterns much clearer.

  • If we assume different group SDs, we should allocate more observations in the group with more variation.

  • If we assume similar group SDs, we should plan with similar group sample sizes.

  • For planing experimental designs, it is a decision under uncertainty about treatment effect heterogeneity.

Simulation for a priori power calculation

What’s the power? Don’t expect an easy answer!

  • Simulation is a very flexible and powerful approach to a priori power calculation. We can, in principle, test many different design features and statistical methods.

  • However, a meaningful power calculation (not only, but in particular with simulation methods) requires deep understanding of all involved components.

  • Implementing simulation methods for a priori power calculation from scratch is very revealing, because it makes it very obvious how much we need to know to arrive at a sensible answer.

Questions?

Group exercise 2

Exercise 2: Welch’s t-test for count data

Part 2: Power

  • We have shown in Part 1 of the exercise that the false discovery rates of t-tests and Poisson regression are similar even if the true data generating process is drawing from a Poisson distribution.

  • In Part 2, we want to compare the statistical power of the methods across a range of plausible scenarios.

Exercise 2b: Power

  • We want to compare the statistical power of

    • Welch’s t-test and
    • Poisson regression with a dummy predictor
  • if the true data generating process is drawing from Poisson distributions with different rate parameters.

  • If necessary: Adapt the simulation function from Exercise 2a.

  • Choose a sensible set of parameter combinations for the Monte Carlo experiment.

Exercise 2b: Help

Expand if you need help.

Don’t expand if you want to figure it out by yourself.

Scroll down if you need more help on later steps.

Simulation function adapted from Exercise 2a

Code
# We use almost the same simulation function, 
# but we omit Student's t-test, because we already showed its
# problems with unequal group sizes and group SDs
sim_ttest_glm = function(n = 200, GR = 1, lambda1 = 1, M_diff = 0) {
  n1 = round(n / (GR + 1))
  n2 = round(n1 * GR)
  lambda2 = lambda1 + M_diff
  g1 = rpois(n = n1, lambda = lambda1)
  g2 = rpois(n = n2, lambda = lambda2)
  Welch = t.test(g1, g2)$p.value
  GLM = glm(outcome ~ group,
            data = data.frame(outcome = c(g1, g2),
                              group = c(rep("g1", n1), rep("g2", n2))),
            family = poisson)
  res = lst(Welch, GLM = coef(summary(GLM))[2, 4])
  return(res)
}

Conditions I: Simple variation of differences between rate parameters

Code
# It makes sense to vary group size ratio, lambda, and M_diff
# The positive relationship between n and power is trivial
conditions = expand_grid(GR = c(0.5, 1, 2), 
                         n = 200,
                         lambda1 = c(1, 10, 20),
                         M_diff = c(0.3, 0.5, 0.8)) %>% 
  rowid_to_column(var = "condition")

Conditions II: Variation of differences between rate parameters standardized by \(\sqrt{\lambda_1}\)

Code
conditions = expand_grid(GR = c(0.5, 1, 2), 
                         n = 200,
                         lambda1 = c(1, 10, 20),
                         d = c(0.2, 0.4, 0.6)) %>% # standardized mean difference.
  mutate(M_diff = d * sqrt(lambda1)) %>% # absolute mean difference
  rowid_to_column(var = "condition")

Result plot: Power at \(p < .05\)

Code
sims %>% 
  group_by(GR, lambda1, M_diff, method) %>% 
  summarise(P_p05 = mean(p.value < 0.05)) %>% 
  ggplot(aes(factor(GR), P_p05, color = method, group = method)) + 
  geom_point() + geom_line() +
  facet_grid(M_diff ~ lambda1, labeller = label_both) +
  labs(x = "Group size ratio", y = "Power")