# Sample Size Calculation with R Extended

# Sample Size Caluclation in R… Extended!

This is an extension from a post I saw here. When thinking about study designs for a trial that we want to conduct for a client, one of the things that we need to think about is how many patients we would need to recruit to see the effect that we propose to see? We’re going to walk through a couple of scenarios and see how we need can use R to get at the sample size & power question when desgining our studies.

# Scenario 1: One Sample Mean T-Test

There aren’t many scenarios where we want to test one population and if they have a significantly differing values of a parameter of interest from a null value. But let’s start with an easy example first.

Let’s say we’re interested determining if the average hospital cost of patients with hemophilia is less than $80,000 a year. We collect trial data. The data is shown below.

```
df_s1 <- data.frame(
cost_hosp = c(70000,80000,2000,13000,140000,13000,24000,44000,87000,30000)
)
# load library to summarize data
library('pacman')
p_load(skimr,dplyr)
df_s1 %>%
skim()
```

Name | Piped data |

Number of rows | 10 |

Number of columns | 1 |

_______________________ | |

Column type frequency: | |

numeric | 1 |

________________________ | |

Group variables | None |

**Variable type: numeric**

skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|

cost_hosp | 0 | 1 | 50300 | 43361.66 | 2000 | 15750 | 37000 | 77500 | 140000 | ▇▃▃▂▂ |

Now that we have a sense of the data, we need to extract the effect size in order to estimate the number of patients we need to survey/collect for our study. The equation for effect size is as below

\[ \text{effect size} = \frac{\text{mean}_{\text{sample}} - \text{mean}_{\text{hypothesized}}}{\text{standard deviation}_{\text{sample}}} \] Below is the R code for calculating the effect size

```
effsize <- (mean(df_s1$cost_hosp) - 80000) / sd(df_s1$cost_hosp)
effsize
```

`## [1] -0.6849369`

Now we can use the `{pwr}`

package to estimate the sample size needed. We’re going to assume an of 0.05 and power of 0.80. The hypothesis of interest is whether the estimated cost from our sample is smaller than $80,000.

```
p_load(pwr)
pwr.t.test(d = effsize, sig.level =0.05, power=0.80, type = "one.sample", alternative= "less")
```

```
##
## One-sample t test power calculation
##
## n = 14.63024
## d = -0.6849369
## sig.level = 0.05
## power = 0.8
## alternative = less
```

Given that we see the number going over 14, we need to round up and collect information from 15 patients.

Now let’s say we’re interested in instances where different countries have different mean costs and therefore we’d be interested to see how our sample sizes need to pan out in different countries. See the data below:

```
df_s1b <- data.frame(
cost_hosp_usa = c(70000,80000,2000,13000,140000,13000,24000,44000,87000,30000),
cost_hosp_china = c(20000,24000,2300,15000,40000,6300,9400,10000,27000,23000),
cost_hosp_korea = c(30000,40000,3000,10000,100000,3500,14000,23000,87000,30000)
)
df_s1b %>%
skim()
```

Name | Piped data |

Number of rows | 10 |

Number of columns | 3 |

_______________________ | |

Column type frequency: | |

numeric | 3 |

________________________ | |

Group variables | None |

**Variable type: numeric**

skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|

cost_hosp_usa | 0 | 1 | 50300 | 43361.66 | 2000 | 15750 | 37000 | 77500 | 140000 | ▇▃▃▂▂ |

cost_hosp_china | 0 | 1 | 17700 | 11350.18 | 2300 | 9550 | 17500 | 23750 | 40000 | ▇▅▇▂▂ |

cost_hosp_korea | 0 | 1 | 34050 | 33700.02 | 3000 | 11000 | 26500 | 37500 | 100000 | ▇▇▁▁▃ |

Let’s create a effect size df now

```
effsizedf <- data.frame(
es = c((mean(df_s1b$cost_hosp_usa) - 80000) / sd(df_s1b$cost_hosp_usa) , (mean(df_s1b$cost_hosp_china) - 80000) / sd(df_s1b$cost_hosp_china), (mean(df_s1b$cost_hosp_korea) - 80000) / sd(df_s1b$cost_hosp_korea)),
country = c("USA","China","Korea")
)
effsizedf
```

```
## es country
## 1 -0.6849369 USA
## 2 -5.4888980 China
## 3 -1.3635005 Korea
```

Using the `effsizedf`

we’re going to create another `data.frame`

that will give us the estimated sample size to arrive at the effect estimate. We’re going to use`{purrr}`

to apply the `pwr.t.test`

function across the rows of the `effsizedf`

.

```
p_load(purrr)
ss_s1 <- effsizedf %>%
pmap_dfr(function(es, country){
data.frame(
country = country,
es = es,
n = ceiling(pwr.t.test(d = es, sig.level =0.05, power=0.80, type = "one.sample", alternative= "less")$n)
)
})
ss_s1
```

```
## country es n
## 1 USA -0.6849369 15
## 2 China -5.4888980 3
## 3 Korea -1.3635005 5
```

To make it easier on the eyes let’s visualize it using `{ggplot2}`

.

```
p_load(ggplot2)
p<-ggplot(data=ss_s1, aes(x=country, y=n, fill=country, label=n)) +
geom_bar(stat="identity") +
geom_text(nudge_y = 1) +
theme_light() +
theme(legend.position = "none") +
xlab("") + ylab("Number of Patients Needed")
p
```

So there you have it. As the importance of real-world evidence is continuing to grow and emphasis on generating value continues to rise, I hope more people will be able to utilize codes written in R (or python or SAS) when designing real-world studies.

I’ll be editing this post from time to time to add more power & sample size calculations for different tests. Thank you for reading and happy coding!