5) Misclassification and bias — Messages mismeasured

Data Simulation with Monte Carlo Methods

Marko Bachl

University of Hohenheim

Misclassification and bias — Messages mismeasured

Traktandenliste

Introduction & overview
Monte Carlo Simulation?
Proof by simulation — The Central Limit Theorem (CLT)
Errors and power — Torturing the t-test
Misclassification and bias — Messages mismeasured
Outlook: What’s next?

5. Misclassification and bias — Messages mismeasured

Univariate analysis of the misclassified variable
Misclassified variables as predictors

Misclassification

Misclassification: Measurement error in categorical variables
Units are assigned the “wrong” category.
Important, but often ignored issue in (computational and human) content analysis:
- Biased estimates
- Reduced power of statistical tests

Misclassification

Can be expressed as misclassification or confusion matrix:

\[ \Theta_A = \begin{pmatrix} \theta_{1|1} & \theta_{1|2} & \theta_{1|\dots} & \theta_{1|k} \\ \theta_{2|1} & \theta_{2|2} & \theta_{2|\dots} & \theta_{2|k} \\ \theta_{\dots|1} & \theta_{\dots|2} & \theta_{\dots|\dots} & \theta_{\dots|k} \\ \theta_{k|1} & \theta_{k|2} & \theta_{k|\dots} & \theta_{k|k} \end{pmatrix} \]

Probability that each category is assigned given the true category.
Columns sum to 1

First simulation study: Univariate analysis

Question: What is the goal of the simulation?
- Understand how misclassification error affects the estimated proportions of a misclassified variable
Quantities of interest: What is measured in the simulation?
- Deviation of proportion estimates from known population proportions
Evaluation strategy: How are the quantities assessed?
- Inspection of the deviation for fixed population configurations
- Root mean squared error (RMSE) for general analysis.
Conditions: Which characteristics of the data generating model will be varied?
- Accuracy of the classification; Population distributions.
Data generating model: How are the data simulated?
- Random draws from a categorical distribution; subsequent misclassification of the sampled units.

Practical considerations

How to (somewhat efficiently) implement misclassification?
How to evaluate the deviation of estimates from population values?
How to sample population distributions from a \(\sf{Dirichlet}\) distribution?
How to implement a randomly sampled factor in the simulation?

Two steps of the data generating process

Sample units from population
Measure variable with misclassification error

One simulation: Population

set.seed(28)
pop_dist = c(0.55, 0.3, 0.1, 0.05)
k = length(pop_dist) # number of categories
categories = LETTERS[1:k] # names of the categories
names(pop_dist) = categories
pop_dist

   A    B    C    D 
0.55 0.30 0.10 0.05

One simulation: Sample

n = 1000 # sample size
sample_true = sample(x = categories, 
                     size = n, 
                     replace = TRUE, 
                     prob = pop_dist) %>% 
  factor(levels = categories) %>% # if any categories not observed in the sample
  sort() # important later for efficient misclassification
freq_true = sample_true %>% 
  table() %>% 
  prop.table()
freq_true

.
    A     B     C     D 
0.563 0.307 0.085 0.045

One simulation: Misclassification matrix

# Simple: Equal error rates, equal difficulties, one process
accuracy = 0.8
equal_error = (1 - accuracy) / (k - 1)
misclass_prob = matrix(equal_error, nrow = k, ncol = k, 
                       dimnames = list(categories, categories))
diag(misclass_prob) = accuracy
misclass_prob

           A          B          C          D
A 0.80000000 0.06666667 0.06666667 0.06666667
B 0.06666667 0.80000000 0.06666667 0.06666667
C 0.06666667 0.06666667 0.80000000 0.06666667
D 0.06666667 0.06666667 0.06666667 0.80000000

One simulation: Misclassification

First try

Mimic empirical classification process at unit level

tic()
sample_obs = sample_true %>%  # n samples
  map_chr(~ sample(categories,
                  size = 1, replace = FALSE, 
                  prob = misclass_prob[, .x])) %>%
    factor(levels = categories)
freq_obs = sample_obs %>% 
    table() %>% 
    prop.table()
toc()

0.017 sec elapsed

freq_obs

.
    A     B     C     D 
0.483 0.293 0.138 0.086

One simulation: Misclassification

Second try

Less computational costs at the at category level

tic()
sample_obs = table(sample_true) %>%  # k samples
  imap(~ sample(categories,
                  size = .x, replace = TRUE, 
                  prob = misclass_prob[, .y])) %>%
    unlist(use.names = FALSE) %>% 
    factor(levels = categories)
freq_obs = sample_obs %>% 
  table() %>% 
    prop.table()
toc()

0.003 sec elapsed

freq_obs

.
    A     B     C     D 
0.496 0.299 0.128 0.077

One simulation: Quantification of error

Root mean squared error (RMSE); but which reference?

Error due to sampling and misclassification compared to population

sqrt(mean((pop_dist - freq_obs)^2)) %>% round(2)

[1] 0.03

Error due to misclassification compared to true values in sample

sqrt(mean((freq_true - freq_obs)^2)) %>% round(2)

[1] 0.04

(Error due to sampling compared to population)

sqrt(mean((pop_dist - freq_true)^2)) %>% round(2)

[1] 0.01

Wrap it into a function

sim_misclass = function(pop_dist = c(0.55, 0.3, 0.1, 0.05),
                        n = 1000,
                        accuracy = 0.8) {
  # Population
  k = length(pop_dist)
  categories = LETTERS[1:k]
  
  # Sample
  sample_true = sample(x = categories, 
                       size = n, 
                       replace = TRUE, 
                       prob = pop_dist) %>% 
    factor(levels = categories) %>% 
    sort()
  freq_true = sample_true %>% 
    table() %>% 
    prop.table()
  
  # Misclassification
  equal_error = (1 - accuracy) / (k - 1)
  misclass_prob = matrix(equal_error, nrow = k, ncol = k,
                         dimnames = list(categories, categories))
  diag(misclass_prob) = accuracy

  sample_obs = table(sample_true) %>%  
    imap(~ sample(categories,
                  size = .x, replace = TRUE, 
                  prob = misclass_prob[, .y])) %>%
    unlist(use.names = FALSE) %>% 
    factor(levels = categories)
  freq_obs = sample_obs %>% 
    table() %>% 
    prop.table()

  # Error summary
  rmse = sqrt(mean((pop_dist - freq_obs)^2))

  # Output
  out = lst(freq_true, freq_obs, rmse)
  out
}

Experimental conditions

pop_dist must be wrapped into another list, because the condition is defined by the vector of proportions.
n (sample size) fixed.
Between-simulation factor: accuracy.

conditions = expand_grid(
  pop_dist = list(c(0.55, 0.3, 0.1, 0.05)),
  n = 1000,
  accuracy = seq(from = 0.4, to = 1, by = 0.1)
  ) %>% 
  rowid_to_column(var = "condition")
conditions

# A tibble: 7 × 4
  condition pop_dist      n accuracy
      <int> <list>    <dbl>    <dbl>
1         1 <dbl [4]>  1000      0.4
2         2 <dbl [4]>  1000      0.5
3         3 <dbl [4]>  1000      0.6
4         4 <dbl [4]>  1000      0.7
5         5 <dbl [4]>  1000      0.8
6         6 <dbl [4]>  1000      0.9
7         7 <dbl [4]>  1000      1

Run simulation experiment

i = 1000 # less simulations to save some workshop time 
set.seed(39)
tic()
sims = map_dfr(1:i, ~ conditions) %>%
  rowid_to_column(var = "sim") %>%
  rowwise() %>%
  mutate(res = list(sim_misclass(pop_dist = pop_dist, 
                                 n = n, accuracy = accuracy)))
toc()

5.099 sec elapsed

Results: Proportion estimates

Show the code

sims %>% 
  ungroup() %>% 
  unnest_wider(res) %>% 
  unnest_longer(freq_obs, indices_to = "category") %>% 
  group_by(accuracy, category) %>% 
  summarise(Q = list(quantile(freq_obs, probs = c(0.25, 0.5, 0.75)))) %>% 
  unnest_wider(Q) %>% 
  ggplot(aes(`50%`, factor(accuracy), 
             xmin = `25%`, xmax = `75%`, color = category)) +
  geom_pointrange() +
  labs(x = str_glue("Median and IQR of the proportions from {i} simulation runs per condition."),
       y = "accuracy")

Results: RMSE

Show the code

sims %>% 
  ungroup() %>% 
  unnest_wider(res) %>% 
  group_by(accuracy) %>% 
  summarise(Q = list(quantile(rmse, probs = c(0.25, 0.5, 0.75)))) %>% 
  unnest_wider(Q) %>% 
  ggplot(aes(accuracy, `50%`, ymin = `25%`, ymax = `75%`)) + 
  geom_pointrange() + 
  geom_line() +
  labs(y = "RMSE (Mdn & IQR)",
       x = "accuracy")

Summary

Bias towards \(1/k\).
Less accuracy, stronger bias. Substantial bias even with seemingly sufficient accuracy.
Stronger bias for categories which are farther from \(1/k\).
Implies: More uneven distributions, stronger bias — but how much?

Adjust experimental conditions

Add varying population distributions - but how?
One possible solution: Factor of different distributions, e.g.,

list(pd1 = c(0.25, 0.25, 0.25, 0.25), 
     pd2 = c(0.55, 0.30, 0.10, 0.05))

Another possibility: Sample from a distribution of population distributions.

rdirichlet(n = 3, alpha = c(1,1,1,1)) %>% round(2)

     [,1] [,2] [,3] [,4]
[1,] 0.26 0.54 0.19 0.01
[2,] 0.59 0.37 0.04 0.00
[3,] 0.19 0.50 0.22 0.09

Adjust experimental conditions

Only accuracy is varied between conditions.
pop_dist is sampled from rdirichlet(alpha = c(1,1,1,1)) for each simulation run.
i (number of simulations) and set.seed() required for building conditions.

i = 5000
set.seed(39)
conditions = expand_grid(
  n = 1000,
  accuracy = seq(from = 0.4, to = 1, by = 0.2)
  ) %>% 
  rowid_to_column(var = "condition")
conditions = map_dfr(1:i, ~ conditions) %>%
  rowid_to_column(var = "sim") %>%
  mutate(pop_dist = split(rdirichlet(nrow(.), c(1,1,1,1)), 1:nrow(.)))
conditions

# A tibble: 20,000 × 5
     sim condition     n accuracy pop_dist    
   <int>     <int> <dbl>    <dbl> <named list>
 1     1         1  1000      0.4 <dbl [4]>   
 2     2         2  1000      0.6 <dbl [4]>   
 3     3         3  1000      0.8 <dbl [4]>   
 4     4         4  1000      1   <dbl [4]>   
 5     5         1  1000      0.4 <dbl [4]>   
 6     6         2  1000      0.6 <dbl [4]>   
 7     7         3  1000      0.8 <dbl [4]>   
 8     8         4  1000      1   <dbl [4]>   
 9     9         1  1000      0.4 <dbl [4]>   
10    10         2  1000      0.6 <dbl [4]>   
# … with 19,990 more rows
# ℹ Use `print(n = ...)` to see more rows

Run simulation experiment

tic()
sims = conditions %>% 
  rowwise() %>%
  mutate(res = list(sim_misclass(pop_dist = pop_dist, n = n, accuracy = accuracy)))
toc()

14.592 sec elapsed

Measure for population distribution variation

\(\chi^2\) value of the standardized (proportions * 100, independent from n) distribution.

sims %>% 
  ungroup() %>% 
  unnest_wider(res) %>% 
  mutate(chi2_pop_dist = map_dbl(pop_dist, ~ chisq.test(.x*100)$statistic)) %>% 
  ggplot(aes(chi2_pop_dist)) +
  geom_histogram()

Results

Show the code

sims %>% 
  ungroup() %>% 
  unnest_wider(res) %>% 
  mutate(chi2_pop_dist = map_dbl(pop_dist, ~ chisq.test(.x*100)$statistic)) %>% 
  ggplot(aes(chi2_pop_dist, rmse, color = factor(accuracy))) +
  geom_smooth() +
  xlim(0, 155) +
  labs(x = "chi2_pop_dist",
       color = "accuracy",
       y = "RMSE",
       caption = "method = 'gam' and formula 'y ~ s(x, bs = 'cs')")

Show the code

sims %>% 
  ungroup() %>% 
  unnest_wider(res) %>% 
  mutate(chi2_pop_dist = map_dbl(pop_dist, ~ chisq.test(.x*100)$statistic)) %>% 
  filter(round(chi2_pop_dist) %in% c(0, 25, 50, 75, 100, 125, 150)) %>% 
  group_by(round(chi2_pop_dist)) %>% 
  slice(1) %>% 
  ungroup() %>% 
  arrange(chi2_pop_dist) %>% 
  select(chi2_pop_dist, pop_dist) %>% 
  mutate(pop_dist = map(pop_dist, sort, decreasing = TRUE)) %>% 
  unnest_wider(pop_dist) %>% 
  kable(digits = 2, col.names = c("chi2_pop_dist", LETTERS[1:4]))

chi2_pop_dist	A	B	C	D
0.11	0.26	0.26	0.25	0.24
24.58	0.40	0.34	0.14	0.12
50.27	0.53	0.25	0.15	0.06
74.82	0.62	0.17	0.12	0.09
99.68	0.68	0.16	0.10	0.06
124.52	0.73	0.13	0.07	0.06
150.11	0.77	0.16	0.04	0.03

Summary: Substantial conclusion

Deviation of the populations distribution from equal proportions is strongly related to error magnitude, such that more unequal proportions are estimated less accurate.
There seems to be a non-linear interaction between population distribution and accuracy, such that the consequences of lesser accuracy are more severe for more unequal distribution.

Summary: For simulation experiments

Assignment of random values at each simulation run as an alternative
Also an option for accuracy (runif(min = 0.4, 1.0)) or sample size (sample(x = 300:2000, replace = TRUE), see Section 4)
Most helpful if we want to look at an outcome across the whole range of values. Not so helpful if discretized after the simulation.

Questions?

Misclassification and bias — Messages mismeasured

5. Misclassification and bias — Messages mismeasured

Univariate analysis of the misclassified variable
Misclassified variables as predictors

Second simulation study: Misclassified predictors

Question: What is the goal of the simulation?
- How does misclassification error affect the estimated differences in an outcome and the statistical power to detect these differences if the grouping variable is misclassified?
Quantities of interest: What is measured in the simulation?
- Observed group differences and their ratios to the true group differences
- Statistical power to detect the difference with \(p < .05\) in Welch’s t-test
Evaluation strategy: How are the quantities assessed?
- Inspection of the deviation between true and observed differences and their ratios across conditions
- Comparison of statistical power to reject the Null hypothesis if the alternative hypothesis is true
Conditions: Which characteristics of the data generating model will be varied?
- Accuracy of the classification; within-group variances
Data generating model: How are the data simulated?
- Grouping variable: Random draws from a categorical distribution; subsequent misclassification of the sampled units
- Outcome: Random draws from Poisson distributions with different \(\lambda\) values

Three steps of the data generating process

Sample units from population
Measure variable with misclassification error
Sample outcome from population

One simulation: Population grouping variable

Same as before

set.seed(4026)
pop_dist = c(0.55, 0.3, 0.1, 0.05)
k = length(pop_dist) # number of categories
categories = LETTERS[1:k] # names of the categories
names(pop_dist) = categories
pop_dist

   A    B    C    D 
0.55 0.30 0.10 0.05

One simulation: Sample grouping variable

Same as before

n = 1000 # sample size
sample_true = sample(x = categories, 
                     size = n, 
                     replace = TRUE, 
                     prob = pop_dist) %>% 
  factor(levels = categories) %>% # if any categories not observed in the sample
  sort() # important later for efficient misclassification
freq_true = sample_true %>% 
  table() %>% 
  prop.table()
freq_true

.
    A     B     C     D 
0.568 0.290 0.100 0.042

One simulation: Sample outcome

lambdas = c(21:24)
names(lambdas) = categories
outcomes = table(sample_true) %>% 
  imap(~ rpois(n = .x, lambda = lambdas[.y])) %>% 
  unlist(use.names = FALSE)
table(sample_true, outcomes)

           outcomes
sample_true  8 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
          A  1  2  8  4  7 16 24 33 41 36 43 56 50 39 44 59 28 24 17 12  9  6
          B  0  0  0  0  7  7  8 12 14 16 22 19 19 30 27 20 19 16 17  8  6  9
          C  0  0  0  0  2  3  2  3  6  2  8  6  4  6  7  7  6  5  7  6 10  2
          D  0  0  0  0  0  0  1  1  4  1  2  2  2  4  2  4  3  4  4  3  1  1
           outcomes
sample_true 31 32 33 34 35 42
          A  2  2  2  2  0  1
          B  7  0  2  4  1  0
          C  3  2  3  0  0  0
          D  1  0  1  1  0  0

One simulation: Misclassification matrix

Same as before

# Simple: Equal error rates, equal difficulties, one process
accuracy = 0.8
equal_error = (1 - accuracy) / (k - 1)
misclass_prob = matrix(equal_error, nrow = k, ncol = k, 
                       dimnames = list(categories, categories))
diag(misclass_prob) = accuracy
misclass_prob

           A          B          C          D
A 0.80000000 0.06666667 0.06666667 0.06666667
B 0.06666667 0.80000000 0.06666667 0.06666667
C 0.06666667 0.06666667 0.80000000 0.06666667
D 0.06666667 0.06666667 0.06666667 0.80000000

One simulation: Misclassification

Same as before

tic()
sample_obs = table(sample_true) %>%  # k samples
  imap(~ sample(categories,
                  size = .x, replace = TRUE, 
                  prob = misclass_prob[, .y])) %>%
    unlist(use.names = FALSE) %>% 
    factor(levels = categories)
freq_obs = sample_obs %>% 
  table() %>% 
    prop.table()
toc()

0.003 sec elapsed

freq_obs

.
    A     B     C     D 
0.476 0.287 0.142 0.095

One simulation: Measure quantities of interest

Test means of outcome in categories B, C, and D against A (the default)

t_res = categories[2:k] %>% 
  map(~ t.test(outcomes[sample_obs == "A"],
               outcomes[sample_obs == .x])) %>% 
  set_names(str_c("A_", categories[2:k]))

Extract group mean differences

t_res %>% 
  map_dbl(~ diff(.x$estimate)) %>% round(2)

 A_B  A_C  A_D 
1.00 1.54 0.66

Extract p-values

t_res %>% 
  map_dbl(~ .x$p.value) %>% round(5)

    A_B     A_C     A_D 
0.00520 0.00081 0.21381

Wrap it into a function

sim_misclass_test = function(pop_dist = c(0.55, 0.3, 0.1, 0.05),
                        lambdas = c(3:6),
                        n = 1000,
                        accuracy = 0.8) {
  # Population
  k = length(pop_dist)
  categories = LETTERS[1:k]
  names(lambdas) = categories
  
  # True sample
  sample_true = sample(x = categories, 
                       size = n, 
                       replace = TRUE, 
                       prob = pop_dist) %>% 
    factor(levels = categories) %>% 
    sort()
  freq_true = sample_true %>% 
    table() %>% 
    prop.table()
  outcomes = table(sample_true) %>% 
    imap(~ rpois(n = .x, lambda = lambdas[.y])) %>% 
    unlist(use.names = FALSE)

  # Misclassification probabilities
  equal_error = (1 - accuracy) / (k - 1)
  misclass_prob = matrix(equal_error, nrow = k, ncol = k, 
                         dimnames = list(categories, categories))
  diag(misclass_prob) = accuracy
  
  # Misclassification  
  sample_obs = table(sample_true) %>%  
    imap(~ sample(categories,
                  size = .x, replace = TRUE, 
                  prob = misclass_prob[, .y])) %>%
    unlist(use.names = FALSE) %>% 
    factor(levels = categories)
  freq_obs = sample_obs %>% 
    table() %>% 
    prop.table()
  
  # Model
  ttests_obs = categories[2:k] %>% 
    map(~ t.test(outcomes[sample_obs == "A"],
                 outcomes[sample_obs == .x])) %>% 
    set_names(str_c("A_", categories[2:k]))

  # Error summary for categories
  rmse = sqrt(mean((pop_dist - freq_obs)^2))

  # Output
  out = lst(freq_true, freq_obs, rmse,
            p_obs = ttests_obs %>% 
              map_dbl(~ .x$p.value),
            diff_obs = ttests_obs %>% 
              map_dbl(~ diff(.x$estimate))
            )
  out
}

Experimental conditions

lambdas is a named list with two sets of values. The group differences are the same in both conditions, but the variances differ strongly.
Between-simulation factor: accuracy varies with 4 values.
pop_dist and n (sample size) are fixed.

conditions = expand_grid(
  pop_dist = list(c(0.55, 0.3, 0.1, 0.05)),
  lambdas = list(start_1 = c(1:4),
                 start_21 = c(21:24)),
  n = 1000,
  accuracy = seq(from = 0.4, to = 1, by = 0.2)
) %>% 
  rowid_to_column(var = "condition")
conditions

# A tibble: 8 × 5
  condition pop_dist  lambdas          n accuracy
      <int> <list>    <named list> <dbl>    <dbl>
1         1 <dbl [4]> <int [4]>     1000      0.4
2         2 <dbl [4]> <int [4]>     1000      0.6
3         3 <dbl [4]> <int [4]>     1000      0.8
4         4 <dbl [4]> <int [4]>     1000      1  
5         5 <dbl [4]> <int [4]>     1000      0.4
6         6 <dbl [4]> <int [4]>     1000      0.6
7         7 <dbl [4]> <int [4]>     1000      0.8
8         8 <dbl [4]> <int [4]>     1000      1

Run simulation experiment

i = 1000
set.seed(39)
tic()
sims = map_dfr(1:i, ~ conditions) %>%
  rowid_to_column(var = "sim") %>%
  rowwise() %>%
  mutate(res = list(sim_misclass_test(
    pop_dist = pop_dist,
    lambdas = lambdas, 
    n = n, accuracy = accuracy)))
toc()

13.79 sec elapsed

Results: Estimated differences

Show the code

sims %>% 
  ungroup() %>% 
  unnest_wider(res) %>% 
  unnest_longer(diff_obs, indices_to = "difference") %>% 
  mutate(lambdas = names(lambdas)) %>% 
  group_by(accuracy, difference, lambdas) %>% 
  summarise(Q = list(quantile(diff_obs, probs = c(0.25, 0.5, 0.75)))) %>% 
  unnest_wider(Q) %>% 
  ggplot(aes(`50%`, factor(accuracy),
             xmin = `25%`, xmax = `75%`, color = difference)) +
  geom_pointrange() +
  facet_wrap(vars(lambdas)) + 
  labs(y = "accuracy",
       x = str_glue("Median and IQR of the mean differences from {i} simulation runs per condition."))

Results: Ratio estimated to true differences

Show the code

sims %>% 
  ungroup() %>% 
  unnest_wider(res) %>% 
  unnest_longer(diff_obs, indices_to = "difference") %>% 
  mutate(
    diff_true = case_when(
      difference == "A_B" ~ 1L,
      difference == "A_C" ~ 2L,
      difference == "A_D" ~ 3L
    ),
    recovered = diff_obs / diff_true,
    lambdas = names(lambdas)
    ) %>% 
  select(accuracy, difference, lambdas, diff_obs, diff_true, recovered) %>% 
  group_by(accuracy, difference, lambdas) %>% 
  summarise(Q = list(quantile(recovered, probs = c(0.25, 0.5, 0.75)))) %>% 
  unnest_wider(Q) %>% 
  ggplot(aes(`50%`, factor(accuracy),
             xmin = `25%`, xmax = `75%`, color = difference)) +
  geom_pointrange() +
  facet_wrap(vars(lambdas)) + 
  labs(y = "accuracy",
       x = str_glue("Median and IQR of the ratio true by observed difference\nfrom {i} simulation runs per condition."))

Results: Power

Show the code

sims %>% 
  ungroup() %>% 
  mutate(lambdas = names(lambdas)) %>% 
  unnest_wider(res) %>% 
  unnest_longer(p_obs, indices_to = "difference") %>% 
  group_by(accuracy, difference, lambdas) %>% 
  summarise(P_p05 = mean(p_obs < 0.05)) %>% 
  ggplot(aes(accuracy, P_p05, color = difference)) + 
  geom_point() + geom_line() +
  facet_wrap(vars(lambdas)) + 
  labs(y = "Power at p < .05")

Summary

Misclassification of the grouping variable attenuates the group differences and reduces statistical power.
Even a seemingly ok-ish accuracy of .8 can lead to a reduction of the observed differences to 50% of the true difference in some instances.
The level of within-group variances obviously reduces statistical power, but does not seem to be related to the average bias of the observed differences.

Questions?

Group exercise 3

Try one (or more) of the following adaptions of the Monte Carlo studies on misclassification:

Change the first simulation so that the misclassification probabilities to allow for different patterns accuracies and errors: varying difficulties of the categories, systematic confusion of some categories.
Investigate the consequences of misclassification error for an analysis of two binary variables which were both measured with misclassification error.

Exercise 3a: Help

Expand if you need help.

Don’t expand if you want to figure it out by yourself.

Scroll down if you need more help on later steps.

Manually defined misclassification matrix with varying difficulty and systematic confusion

Code

misclass_prob = matrix(c(
  .80, .10, .15,
  .15, .80, .15,
  .05, .10, .70
), nrow = 3, byrow = TRUE,
dimnames = list(LETTERS[1:3], LETTERS[1:3]))
misclass_prob

Adapted simulation function

Code

# Misclassification matrix as argument
sim_misclass = function(pop_dist = c(0.55, 0.3, 0.1, 0.05),
                        n = 1000,
                        misclass_prob = misclass_prob) {
  # Population
  k = length(pop_dist)
  categories = LETTERS[1:k]
  
  # Sample
  sample_true = sample(x = categories, 
                       size = n, 
                       replace = TRUE, 
                       prob = pop_dist) %>% 
    factor(levels = categories) %>% 
    sort()
  freq_true = sample_true %>% 
    table() %>% 
    prop.table()
  
  # Misclassification
  sample_obs = table(sample_true) %>%  
    imap(~ sample(categories,
                  size = .x, replace = TRUE, 
                  prob = misclass_prob[, .y])) %>%
    unlist(use.names = FALSE) %>% 
    factor(levels = categories)
  freq_obs = sample_obs %>% 
    table() %>% 
    prop.table()
  
  # Error summary
  rmse = sqrt(mean((pop_dist - freq_obs)^2))
  
  # Output
  out = lst(freq_true, freq_obs, rmse)
  out
}

Condition

Code

# No experiment, just one condition for illustration
conditions = expand_grid(
  pop_dist = list(c(0.55, 0.3, 0.15)),
  n = 1000,
  misclass_prob = list(misclass_prob)
) %>% 
  rowid_to_column(var = "condition")
conditions

Run simulation

Code

i = 1000 # less simulations to save some workshop time 
set.seed(40)
tic()
sims = map_dfr(1:i, ~ conditions) %>%
  rowid_to_column(var = "sim") %>%
  rowwise() %>%
  mutate(res = list(sim_misclass(pop_dist = pop_dist, 
                                 n = n, misclass_prob = misclass_prob)))
toc()

Result plot

Code

sims %>% 
  ungroup() %>% 
  unnest_wider(res) %>% 
  unnest_longer(freq_obs, indices_to = "category") %>% 
  group_by(category) %>%
  summarise(Q = list(quantile(freq_obs, probs = c(0.25, 0.5, 0.75)))) %>%
  unnest_wider(Q) %>%
  ggplot(aes(`50%`, fct_rev(category),
             xmin = `25%`, xmax = `75%`)) +
  geom_pointrange() +
  geom_vline(xintercept = unlist(conditions$pop_dist), color = "red", linetype = 2) + 
  labs(x = str_glue("Median and IQR of the proportions from {i} simulation runs per condition.\nRed lines are shares in the population."),
       y = "Category")

Exercise 3b: Help

Check out the steps-by-step explanation in the rendered solution or the R code.