Replacing missing values in R

Problem

So I’m a regular visitor of the r/rstats subreddit and recently there was a post about replacing missing values using a certain logic.

Specifically here is the problem:

Below is a dataframe where each row represents a city and the idea is to fill the missing unemployment rate. The OP (original poster) wanted to fill in the NA columns based on the mean value of the unemployment rate of the same State and Size. So let’s start with an example dataframe

NOTE: You can find the original code here.

The reason I’m putting all the code together is to show the different ways R can be used to solve the problem and also assess which route is the fastest.

library(dplyr)

unemployment_rate_vals <- c(0.01, 0.17, 0.19, NA, 0.21, 0.14, 0.02, NA, 0.26, 0.27, 
    0.21, 0.28, 0.23, 0.16, 0.1, NA, 0.23, 0.03, 0.11)

state_vals <- c("KC", "WA", "CA", "KC", "WA", "KC", "CA", "CA", "WA", "CA", "KC", 
    "CA", "CA", "KC", "CA", "KC", "KC", "CA", "WA")

size_vals <- c("Big", "Medium", "Big", "Big", "Medium", "Small", "Big", "Medium", 
    "Medium", "Big", "Small", "Medium", "Medium", "Big", "Medium", "Big", "Big", 
    "Big", "Small")

df <- tibble(unemployment_rate = unemployment_rate_vals, state = state_vals, size = size_vals)

df
# A tibble: 19 x 3
   unemployment_rate state size  
               <dbl> <chr> <chr> 
 1              0.01 KC    Big   
 2              0.17 WA    Medium
 3              0.19 CA    Big   
 4             NA    KC    Big   
 5              0.21 WA    Medium
 6              0.14 KC    Small 
 7              0.02 CA    Big   
 8             NA    CA    Medium
 9              0.26 WA    Medium
10              0.27 CA    Big   
11              0.21 KC    Small 
12              0.28 CA    Medium
13              0.23 CA    Medium
14              0.16 KC    Big   
15              0.1  CA    Medium
16             NA    KC    Big   
17              0.23 KC    Big   
18              0.03 CA    Big   
19              0.11 WA    Small 

Solutions

There are 3 ways people have coded the solution and I’ll briefly talk about them

Tidyverse way

The tidyverse syntax is the most intuitive for me so I’ll start with it

f1 <- function(df, state, size, unemployment_rate) {
    state <- enquo(state)
    size <- enquo(size)
    unemployment_rate <- enquo(unemployment_rate)
    
    df %>% group_by(!!state, !!size) %>% mutate(unemployment_rate = ifelse(is.na(!!unemployment_rate), 
        mean(!!unemployment_rate, na.rm = TRUE), !!unemployment_rate))
}
res <- f1(df, state, size, unemployment_rate)
res
# A tibble: 19 x 3
# Groups:   state, size [6]
   unemployment_rate state size  
               <dbl> <chr> <chr> 
 1             0.01  KC    Big   
 2             0.17  WA    Medium
 3             0.19  CA    Big   
 4             0.133 KC    Big   
 5             0.21  WA    Medium
 6             0.14  KC    Small 
 7             0.02  CA    Big   
 8             0.203 CA    Medium
 9             0.26  WA    Medium
10             0.27  CA    Big   
11             0.21  KC    Small 
12             0.28  CA    Medium
13             0.23  CA    Medium
14             0.16  KC    Big   
15             0.1   CA    Medium
16             0.133 KC    Big   
17             0.23  KC    Big   
18             0.03  CA    Big   
19             0.11  WA    Small 

There are some rlang based coding going on due to my desire to use less quotes in function arguments (look into NSE if you’re interested!). Basically, the state, size, and unemployment_rate arguments should be unquoted.

Using plyr

The plyr syntax isn’t bad but not as intuitive compared to the tidyverse way

library(plyr)
f2 <- function(df, state = "state", size = "size") {
    df <- ddply(df, c(state, size), function(a) {
        a$unemployment_rate[is.na(a$unemployment_rate)] <- mean(a$unemployment_rate, 
            na.rm = TRUE)
        a
    })
    df
}
res <- f2(df, "state", "size")
res
   unemployment_rate state   size
1          0.1900000    CA    Big
2          0.0200000    CA    Big
3          0.2700000    CA    Big
4          0.0300000    CA    Big
5          0.2033333    CA Medium
6          0.2800000    CA Medium
7          0.2300000    CA Medium
8          0.1000000    CA Medium
9          0.0100000    KC    Big
10         0.1333333    KC    Big
11         0.1600000    KC    Big
12         0.1333333    KC    Big
13         0.2300000    KC    Big
14         0.1400000    KC  Small
15         0.2100000    KC  Small
16         0.1700000    WA Medium
17         0.2100000    WA Medium
18         0.2600000    WA Medium
19         0.1100000    WA  Small

Also, the order of the original row isn’t preserved

For loop

f3 <- function(df) {
    for (a in unique(df$state)) {
        for (b in unique(df$size)) {
            df$unemployment_rate[df$state == a & df$size == b & is.na(df$unemployment_rate)] <- mean(df$unemployment_rate[df$state == 
                a & df$size == b], na.rm = TRUE)
        }
    }
    df
}
res <- f3(df)
res
# A tibble: 19 x 3
   unemployment_rate state size  
               <dbl> <chr> <chr> 
 1             0.01  KC    Big   
 2             0.17  WA    Medium
 3             0.19  CA    Big   
 4             0.133 KC    Big   
 5             0.21  WA    Medium
 6             0.14  KC    Small 
 7             0.02  CA    Big   
 8             0.203 CA    Medium
 9             0.26  WA    Medium
10             0.27  CA    Big   
11             0.21  KC    Small 
12             0.28  CA    Medium
13             0.23  CA    Medium
14             0.16  KC    Big   
15             0.1   CA    Medium
16             0.133 KC    Big   
17             0.23  KC    Big   
18             0.03  CA    Big   
19             0.11  WA    Small 

The for-loop logic takes the rows where rows where is.na(unemployment_rate) == TRUE are replaced with the mean(..., na.rm=TRUE). It would be interesting to see how fast each of the codes compute.

Benchmark

library("microbenchmark")
microbenchmark(f1(df, state, size, unemployment_rate), f2(df, "state", "size"), f3(df))
Unit: microseconds
                                   expr      min        lq     mean   median
 f1(df, state, size, unemployment_rate)  999.774 1236.9305 1516.320 1370.872
                f2(df, "state", "size") 2829.943 3027.1130 3344.944 3118.802
                                 f3(df)  338.996  393.4035  460.559  417.402
       uq       max neval
 1453.959 14197.972   100
 3245.434 10388.833   100
  437.535  5107.095   100

Increase N

unemployment_rate_vals <- c(0.01, 0.17, 0.19, NA, 0.21, 0.14, 0.02, NA, 0.26, 0.27, 
    0.21, 0.28, 0.23, 0.16, 0.1, NA, 0.23, 0.03, 0.11)

n = 1000
unemployment_rate_vals <- sample(c(NA, round(runif(1, 0.01, 0.99), 2)), n, TRUE)
state_vals <- sample(c("KC", "WA", "CA"), n, replace = TRUE)
size_vals <- sample(c("Big", "Medium", "Small"), n, replace = TRUE)
df <- tibble(unemployment_rate = unemployment_rate_vals, state = state_vals, size = size_vals)
library("microbenchmark")
microbenchmark(f1(df, state, size, unemployment_rate), f2(df, "state", "size"), f3(df))
Unit: milliseconds
                                   expr      min       lq     mean   median
 f1(df, state, size, unemployment_rate) 1.250130 1.542898 1.718041 1.697982
                f2(df, "state", "size") 4.795270 4.919377 5.230602 5.032977
                                 f3(df) 1.131887 1.199447 1.337793 1.243662
       uq       max neval
 1.761445  7.625083   100
 5.147187 11.135417   100
 1.322230  6.956421   100

Conclusion

You’d think that the dplyr way would produce the fastest result but sometimes the grouping under the hood is slower than the nested for loop. For small to medium problems like these, it’s a good idea to use intuitive coding that takes less development time. I’d go with dplyr but if there are instances where short codes like this is used consistently, it might be better to opt for a faster running code. For me, I will think about what’s most important (e.g. production run-time vs. development time) then tailor my analysis accordingly.

Again, I hope you found this post helpful. Thanks for reading!

Avatar
Chong H. Kim
Health Economics & Outcomes Researcher

My research interests include real-world evidence/observation research, cost-effectiveness analysis, predictive modeling, and spatial statistics.

Related