Replacing missing values in R
Problem
So I’m a regular visitor of the r/rstats
subreddit and recently there was a post about replacing missing values using a certain logic.
Specifically here is the problem:
Below is a dataframe where each row represents a city and the idea is to fill the missing unemployment rate. The OP (original poster) wanted to fill in the NA
columns based on the mean value of the unemployment rate of the same State
and Size
. So let’s start with an example dataframe
NOTE: You can find the original code here.
The reason I’m putting all the code together is to show the different ways R
can be used to solve the problem and also assess which route is the fastest.
library(dplyr)
unemployment_rate_vals <- c(0.01, 0.17, 0.19, NA, 0.21, 0.14, 0.02, NA, 0.26, 0.27,
0.21, 0.28, 0.23, 0.16, 0.1, NA, 0.23, 0.03, 0.11)
state_vals <- c("KC", "WA", "CA", "KC", "WA", "KC", "CA", "CA", "WA", "CA", "KC",
"CA", "CA", "KC", "CA", "KC", "KC", "CA", "WA")
size_vals <- c("Big", "Medium", "Big", "Big", "Medium", "Small", "Big", "Medium",
"Medium", "Big", "Small", "Medium", "Medium", "Big", "Medium", "Big", "Big",
"Big", "Small")
df <- tibble(unemployment_rate = unemployment_rate_vals, state = state_vals, size = size_vals)
df
# A tibble: 19 x 3
unemployment_rate state size
<dbl> <chr> <chr>
1 0.01 KC Big
2 0.17 WA Medium
3 0.19 CA Big
4 NA KC Big
5 0.21 WA Medium
6 0.14 KC Small
7 0.02 CA Big
8 NA CA Medium
9 0.26 WA Medium
10 0.27 CA Big
11 0.21 KC Small
12 0.28 CA Medium
13 0.23 CA Medium
14 0.16 KC Big
15 0.1 CA Medium
16 NA KC Big
17 0.23 KC Big
18 0.03 CA Big
19 0.11 WA Small
Solutions
There are 3 ways people have coded the solution and I’ll briefly talk about them
Tidyverse way
The tidyverse syntax is the most intuitive for me so I’ll start with it
f1 <- function(df, state, size, unemployment_rate) {
state <- enquo(state)
size <- enquo(size)
unemployment_rate <- enquo(unemployment_rate)
df %>% group_by(!!state, !!size) %>% mutate(unemployment_rate = ifelse(is.na(!!unemployment_rate),
mean(!!unemployment_rate, na.rm = TRUE), !!unemployment_rate))
}
res <- f1(df, state, size, unemployment_rate)
res
# A tibble: 19 x 3
# Groups: state, size [6]
unemployment_rate state size
<dbl> <chr> <chr>
1 0.01 KC Big
2 0.17 WA Medium
3 0.19 CA Big
4 0.133 KC Big
5 0.21 WA Medium
6 0.14 KC Small
7 0.02 CA Big
8 0.203 CA Medium
9 0.26 WA Medium
10 0.27 CA Big
11 0.21 KC Small
12 0.28 CA Medium
13 0.23 CA Medium
14 0.16 KC Big
15 0.1 CA Medium
16 0.133 KC Big
17 0.23 KC Big
18 0.03 CA Big
19 0.11 WA Small
There are some rlang
based coding going on due to my desire to use less quotes in function arguments (look into NSE if you’re interested!). Basically, the state
, size
, and unemployment_rate
arguments should be unquoted.
Using plyr
The plyr
syntax isn’t bad but not as intuitive compared to the tidyverse
way
library(plyr)
f2 <- function(df, state = "state", size = "size") {
df <- ddply(df, c(state, size), function(a) {
a$unemployment_rate[is.na(a$unemployment_rate)] <- mean(a$unemployment_rate,
na.rm = TRUE)
a
})
df
}
res <- f2(df, "state", "size")
res
unemployment_rate state size
1 0.1900000 CA Big
2 0.0200000 CA Big
3 0.2700000 CA Big
4 0.0300000 CA Big
5 0.2033333 CA Medium
6 0.2800000 CA Medium
7 0.2300000 CA Medium
8 0.1000000 CA Medium
9 0.0100000 KC Big
10 0.1333333 KC Big
11 0.1600000 KC Big
12 0.1333333 KC Big
13 0.2300000 KC Big
14 0.1400000 KC Small
15 0.2100000 KC Small
16 0.1700000 WA Medium
17 0.2100000 WA Medium
18 0.2600000 WA Medium
19 0.1100000 WA Small
Also, the order of the original row isn’t preserved
For loop
f3 <- function(df) {
for (a in unique(df$state)) {
for (b in unique(df$size)) {
df$unemployment_rate[df$state == a & df$size == b & is.na(df$unemployment_rate)] <- mean(df$unemployment_rate[df$state ==
a & df$size == b], na.rm = TRUE)
}
}
df
}
res <- f3(df)
res
# A tibble: 19 x 3
unemployment_rate state size
<dbl> <chr> <chr>
1 0.01 KC Big
2 0.17 WA Medium
3 0.19 CA Big
4 0.133 KC Big
5 0.21 WA Medium
6 0.14 KC Small
7 0.02 CA Big
8 0.203 CA Medium
9 0.26 WA Medium
10 0.27 CA Big
11 0.21 KC Small
12 0.28 CA Medium
13 0.23 CA Medium
14 0.16 KC Big
15 0.1 CA Medium
16 0.133 KC Big
17 0.23 KC Big
18 0.03 CA Big
19 0.11 WA Small
The for-loop logic takes the rows where rows where is.na(unemployment_rate) == TRUE
are replaced with the mean(..., na.rm=TRUE)
. It would be interesting to see how fast each of the codes compute.
Benchmark
library("microbenchmark")
microbenchmark(f1(df, state, size, unemployment_rate), f2(df, "state", "size"), f3(df))
Unit: microseconds
expr min lq mean median
f1(df, state, size, unemployment_rate) 999.774 1236.9305 1516.320 1370.872
f2(df, "state", "size") 2829.943 3027.1130 3344.944 3118.802
f3(df) 338.996 393.4035 460.559 417.402
uq max neval
1453.959 14197.972 100
3245.434 10388.833 100
437.535 5107.095 100
Increase N
unemployment_rate_vals <- c(0.01, 0.17, 0.19, NA, 0.21, 0.14, 0.02, NA, 0.26, 0.27,
0.21, 0.28, 0.23, 0.16, 0.1, NA, 0.23, 0.03, 0.11)
n = 1000
unemployment_rate_vals <- sample(c(NA, round(runif(1, 0.01, 0.99), 2)), n, TRUE)
state_vals <- sample(c("KC", "WA", "CA"), n, replace = TRUE)
size_vals <- sample(c("Big", "Medium", "Small"), n, replace = TRUE)
df <- tibble(unemployment_rate = unemployment_rate_vals, state = state_vals, size = size_vals)
library("microbenchmark")
microbenchmark(f1(df, state, size, unemployment_rate), f2(df, "state", "size"), f3(df))
Unit: milliseconds
expr min lq mean median
f1(df, state, size, unemployment_rate) 1.250130 1.542898 1.718041 1.697982
f2(df, "state", "size") 4.795270 4.919377 5.230602 5.032977
f3(df) 1.131887 1.199447 1.337793 1.243662
uq max neval
1.761445 7.625083 100
5.147187 11.135417 100
1.322230 6.956421 100
Conclusion
You’d think that the dplyr
way would produce the fastest result but sometimes the grouping under the hood is slower than the nested for loop. For small to medium problems like these, it’s a good idea to use intuitive coding that takes less development time. I’d go with dplyr
but if there are instances where short codes like this is used consistently, it might be better to opt for a faster running code. For me, I will think about what’s most important (e.g. production run-time vs. development time) then tailor my analysis accordingly.
Again, I hope you found this post helpful. Thanks for reading!