Easy data cleaning with the janitor package

I demonstrate a couple of functions from the janitor package I find quite useful
Author

Albert Rapp

Published

May 26, 2024

The janitor package contains only a little number of functions but nevertheless it is surprisingly convenient to use them. I never really fully appreciated its functionality until I took a look into the documentation. So let’s dive into this package. You can find a video version of this blog post at

INSERT VIDEO HERE

Clean column names

As everyone working with data knows, data sets rarely come in a clean format. Often, the necessary cleaning process already starts with the column names. Here, take this data set from TidyTuesday, week 41.

library(tidyverse) 
nurses <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-05/nurses.csv')
names(nurses)
##  [1] "State"                                          
##  [2] "Year"                                           
##  [3] "Total Employed RN"                              
##  [4] "Employed Standard Error (%)"                    
##  [5] "Hourly Wage Avg"                                
##  [6] "Hourly Wage Median"                             
##  [7] "Annual Salary Avg"                              
##  [8] "Annual Salary Median"                           
##  [9] "Wage/Salary standard error (%)"                 
## [10] "Hourly 10th Percentile"                         
## [11] "Hourly 25th Percentile"                         
## [12] "Hourly 75th Percentile"                         
## [13] "Hourly 90th Percentile"                         
## [14] "Annual 10th Percentile"                         
## [15] "Annual 25th Percentile"                         
## [16] "Annual 75th Percentile"                         
## [17] "Annual 90th Percentile"                         
## [18] "Location Quotient"                              
## [19] "Total Employed (National)_Aggregate"            
## [20] "Total Employed (Healthcare, National)_Aggregate"
## [21] "Total Employed (Healthcare, State)_Aggregate"   
## [22] "Yearly Total Employed (State)_Aggregate"

These column names are intuitively easy to understand but not necessarily easy to process by code. There are too many white spaces and other special characters. Therefore, I accompany most data inputs by clean_names() from the janitor package.

library(janitor)
nurses |> 
  clean_names() |> 
  names()
##  [1] "state"                                       
##  [2] "year"                                        
##  [3] "total_employed_rn"                           
##  [4] "employed_standard_error_percent"             
##  [5] "hourly_wage_avg"                             
##  [6] "hourly_wage_median"                          
##  [7] "annual_salary_avg"                           
##  [8] "annual_salary_median"                        
##  [9] "wage_salary_standard_error_percent"          
## [10] "hourly_10th_percentile"                      
## [11] "hourly_25th_percentile"                      
## [12] "hourly_75th_percentile"                      
## [13] "hourly_90th_percentile"                      
## [14] "annual_10th_percentile"                      
## [15] "annual_25th_percentile"                      
## [16] "annual_75th_percentile"                      
## [17] "annual_90th_percentile"                      
## [18] "location_quotient"                           
## [19] "total_employed_national_aggregate"           
## [20] "total_employed_healthcare_national_aggregate"
## [21] "total_employed_healthcare_state_aggregate"   
## [22] "yearly_total_employed_state_aggregate"

Did you see what happened? White spaces were converted to _ and parentheses were removed. Even the % signs were converted to percent. Now, these labels are easy to understand AND process by code. This does not mean that you are finished cleaning but at least now the columns are more accessible.

Remove empty and or constant columns and rows

Data sets with empty or superfluous rows or columns are not a rare sighting. This is especially true if you work with Excel files because there will be a lot of empty cells. Take a look at the dirty Excel data set from janitor’s GitHub page. It looks like this when you open it with Excel.

Taking a look just at this picture we may notice a couple of things.

  • First, Jason Bourne is teaching at a school. I guess being a trained assassin qualifies him to teach physical education. Also - and this is just a hunch - undercover work likely earned him his “Theater” certification.

  • Second, the header above the actual table will be annoying, so we must skip the first line when we read the data set.

  • Third, the column names are not ideal but we know how to deal with that by now.

  • Fourth, there are empty rows and columns we can get rid of.

  • Fifth, there is a column that contains only ‘YES’. Therefore it contains no information at all and can be removed.

So, let us read and clean the data. The janitor package will help us with remove_empty() and remove_constant().

xl_file <- readxl::read_excel('dirty_data.xlsx', skip = 1) |> 
  clean_names() |>
  remove_empty() |> 
  remove_constant()
xl_file
## # A tibble: 12 × 9
##    first_name   last_name employee_status subject    hire_date percent_allocated
##    <chr>        <chr>     <chr>           <chr>          <dbl>             <dbl>
##  1 Jason        Bourne    Teacher         PE             39690              0.75
##  2 Jason        Bourne    Teacher         Drafting       43479              0.25
##  3 Alicia       Keys      Teacher         Music          37118              1   
##  4 Ada          Lovelace  Teacher         <NA>           38572              1   
##  5 Desus        Nice      Administration  Dean           42791              1   
##  6 Chien-Shiung Wu        Teacher         Physics        11037              0.5 
##  7 Chien-Shiung Wu        Teacher         Chemistry      11037              0.5 
##  8 James        Joyce     Teacher         English        36423              0.5 
##  9 Hedy         Lamarr    Teacher         Science        27919              0.5 
## 10 Carlos       Boozer    Coach           Basketball     42221             NA   
## 11 Young        Boozer    Coach           <NA>           34700             NA   
## 12 Micheal      Larsen    Teacher         English        40071              0.8 
## # ℹ 3 more variables: full_time <chr>, certification_9 <chr>,
## #   certification_10 <chr>

Here, remove_empty() defaulted to remove, both, rows and colums. If we wish, we can change that by setting e.g. which = 'rows'.

Now, we may also want to see the hire_data in a sensible format. For example, in this dirty data set, Jason Bourne was hired on 39690. Luckily, our janitor can make sense of it all.

xl_file |> 
  mutate(hire_date = excel_numeric_to_date(hire_date))
## # A tibble: 12 × 9
##    first_name   last_name employee_status subject   hire_date  percent_allocated
##    <chr>        <chr>     <chr>           <chr>     <date>                 <dbl>
##  1 Jason        Bourne    Teacher         PE        2008-08-30              0.75
##  2 Jason        Bourne    Teacher         Drafting  2019-01-14              0.25
##  3 Alicia       Keys      Teacher         Music     2001-08-15              1   
##  4 Ada          Lovelace  Teacher         <NA>      2005-08-08              1   
##  5 Desus        Nice      Administration  Dean      2017-02-25              1   
##  6 Chien-Shiung Wu        Teacher         Physics   1930-03-20              0.5 
##  7 Chien-Shiung Wu        Teacher         Chemistry 1930-03-20              0.5 
##  8 James        Joyce     Teacher         English   1999-09-20              0.5 
##  9 Hedy         Lamarr    Teacher         Science   1976-06-08              0.5 
## 10 Carlos       Boozer    Coach           Basketba… 2015-08-05             NA   
## 11 Young        Boozer    Coach           <NA>      1995-01-01             NA   
## 12 Micheal      Larsen    Teacher         English   2009-09-15              0.8 
## # ℹ 3 more variables: full_time <chr>, certification_9 <chr>,
## #   certification_10 <chr>

Rounding

To my surprise shock, R uses some unexpected rounding rule. In my world, whenever a number ends in .5, standard rounding would round up. Apparently, R uses something called banker’s rounding that in these cases rounds towards the next even number. Take a look.

round(seq(0.5, 4.5, 1))
## [1] 0 2 2 4 4

I would expect that the rounded vector contains the integers from one to five. Thankfully, janitor offers a convenient rounding function.

round_half_up(seq(0.5, 4.5, 1))
## [1] 1 2 3 4 5

Ok, so that gives us a new function for rounding towards integers. But what is really convenient is that janitor can round_to_fractions.

round_to_fraction(seq(0.5, 2.0, 0.13), denominator = 4)
##  [1] 0.50 0.75 0.75 1.00 1.00 1.25 1.25 1.50 1.50 1.75 1.75 2.00

Here, I rounded the numbers to the next quarters (denominator = 4) but of course any fraction is possible. You can now live the dream of rounding towards arbitrary fractions.

Find matches in multiple characteristics

In my opinion, the get_dupes() function is really powerful. It allows us to find “similar” observations in a data set based on certain characteristics. For example, the starwars data set from dplyr contains a lot of informationon characters from the Star Wars movies. Possibly, we want to find out which characters are similar w.r.t. to certain traits.

starwars |> 
  get_dupes(eye_color, hair_color, skin_color, sex, homeworld) |> 
  select(1:8)
## # A tibble: 6 × 8
##   eye_color hair_color skin_color sex    homeworld dupe_count name        height
##   <chr>     <chr>      <chr>      <chr>  <chr>          <int> <chr>        <int>
## 1 blue      black      yellow     female Mirial             2 Luminara U…    170
## 2 blue      black      yellow     female Mirial             2 Barriss Of…    166
## 3 blue      blond      fair       male   Tatooine           2 Luke Skywa…    172
## 4 blue      blond      fair       male   Tatooine           2 Anakin Sky…    188
## 5 brown     brown      light      female Naboo              2 Padmé Amid…    185
## 6 brown     brown      light      female Naboo              2 Dormé          165

So, Luke and Anakin Skywalker are similar to one another. Who would have thought that. Sadly, I don’t enough about Star Wars to know whether the other matches are similarly “surprising”. In any case, the point here is that we can easily find matches according to arbitrarily many characteristics. Conveniently, these characteristics are the first columns of the new output and we get a dupe_count.

Count all the combinations with tabyl()

I frequently find myself counting how many different things a column in a data set consists of. Let’s look at the mpg data set from the ggplot2 package.

mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows

There, we can count how many different car classes are available in the data set.

mpg |> 
  count(class, sort = TRUE)
## # A tibble: 7 × 2
##   class          n
##   <chr>      <int>
## 1 suv           62
## 2 compact       47
## 3 midsize       41
## 4 subcompact    35
## 5 pickup        33
## 6 minivan       11
## 7 2seater        5

But that’s rarely where I stop. Most of the time, I also want to get the percentages.

mpg |> 
  count(class, sort = TRUE) |> 
  mutate(prop = n / sum(n))
## # A tibble: 7 × 3
##   class          n   prop
##   <chr>      <int>  <dbl>
## 1 suv           62 0.265 
## 2 compact       47 0.201 
## 3 midsize       41 0.175 
## 4 subcompact    35 0.150 
## 5 pickup        33 0.141 
## 6 minivan       11 0.0470
## 7 2seater        5 0.0214

With tabyl() from {janitor} there’s a much faster way to do all of that in one go.

mpg |> 
  tabyl(class) 
##       class  n    percent
##     2seater  5 0.02136752
##     compact 47 0.20085470
##     midsize 41 0.17521368
##     minivan 11 0.04700855
##      pickup 33 0.14102564
##  subcompact 35 0.14957265
##         suv 62 0.26495726

The only drawback this has is that we do not get a tibble returned but a data frame. I do like the tibble output much more though. You can enforce that with as_tibble().

mpg |> 
  tabyl(class) |> 
  as_tibble()
## # A tibble: 7 × 3
##   class          n percent
##   <chr>      <int>   <dbl>
## 1 2seater        5  0.0214
## 2 compact       47  0.201 
## 3 midsize       41  0.175 
## 4 minivan       11  0.0470
## 5 pickup        33  0.141 
## 6 subcompact    35  0.150 
## 7 suv           62  0.265

It’s tedious to include that every time you count something though. So if you want to do more tabyl() calls, you might want to override the print() defaults for tabyl objects.

print.tabyl <- function(x, ...) {
  tib <- x |> 
    as_tibble(.name_repair = 'minimal')
  
  if ('n' %in% colnames(tib)) {
    tib |> 
      arrange(desc(n)) |> 
      print()
  } else {
    tib |> 
    janitor::clean_names() |> 
    print()
  }
}
  
mpg |> 
  tabyl(class) 
##       class  n    percent
##     2seater  5 0.02136752
##     compact 47 0.20085470
##     midsize 41 0.17521368
##     minivan 11 0.04700855
##      pickup 33 0.14102564
##  subcompact 35 0.14957265
##         suv 62 0.26495726

Tabyl works with vectors too

The nice thing about tabyl() is that it works the same for vectors.

fruits <- sample(fruit, size = 125, replace = TRUE)
tabyl(fruits)
##             fruits n percent
##            apricot 3   0.024
##            avocado 2   0.016
##             banana 1   0.008
##        bell pepper 1   0.008
##           bilberry 1   0.008
##       blood orange 2   0.016
##          blueberry 2   0.016
##        boysenberry 1   0.008
##         breadfruit 2   0.016
##       canary melon 3   0.024
##         cantaloupe 3   0.024
##          cherimoya 4   0.032
##             cherry 1   0.008
##         clementine 2   0.016
##         cloudberry 1   0.008
##            coconut 2   0.016
##          cranberry 2   0.016
##            currant 2   0.016
##             damson 1   0.008
##               date 2   0.016
##        dragonfruit 2   0.016
##             durian 2   0.016
##           eggplant 2   0.016
##         elderberry 1   0.008
##             feijoa 4   0.032
##         goji berry 4   0.032
##         gooseberry 4   0.032
##              grape 1   0.008
##         grapefruit 3   0.024
##              guava 4   0.032
##           honeydew 1   0.008
##        huckleberry 1   0.008
##          jackfruit 1   0.008
##             jujube 2   0.016
##         kiwi fruit 3   0.024
##            kumquat 1   0.008
##               lime 2   0.016
##             loquat 1   0.008
##             lychee 1   0.008
##          mandarine 1   0.008
##           mulberry 1   0.008
##          nectarine 3   0.024
##                nut 1   0.008
##              olive 1   0.008
##             orange 3   0.024
##             pamelo 1   0.008
##       passionfruit 3   0.024
##              peach 1   0.008
##          persimmon 2   0.016
##           physalis 3   0.024
##          pineapple 1   0.008
##               plum 1   0.008
##        pomegranate 2   0.016
##             pomelo 1   0.008
##  purple mangosteen 3   0.024
##             quince 2   0.016
##             raisin 1   0.008
##           rambutan 1   0.008
##          raspberry 1   0.008
##         rock melon 2   0.016
##        salal berry 3   0.024
##         star fruit 2   0.016
##          tamarillo 3   0.024
##          tangerine 3   0.024
##         ugli fruit 1   0.008

And with two-way & three-way counts too

Even cooler, tabyl() can generate two-way and three-way counts for you. In the case of three-way counts, you will get a list of tibbles.

mpg |> 
  tabyl(class, trans)
##       class auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) auto(s5)
##     2seater        0        0        1        0        0        0        0
##     compact        2        1        8        4        0        2        2
##     midsize        3        0       14        5        0        1        0
##     minivan        0        1        8        0        2        0        0
##      pickup        0        0       12        8        0        0        0
##  subcompact        0        0       11        4        0        0        0
##         suv        0        0       29       18        4        0        1
##  auto(s6) manual(m5) manual(m6)
##         1          0          3
##         5         18          5
##         6          9          3
##         0          0          0
##         0          8          5
##         1         16          3
##         3          7          0
  
mpg |> 
  tabyl(class, trans, fl)
## $c
##       class auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) auto(s5)
##     2seater        0        0        0        0        0        0        0
##     compact        0        0        0        0        0        0        0
##     midsize        0        0        0        0        0        0        0
##     minivan        0        0        0        0        0        0        0
##      pickup        0        0        0        0        0        0        0
##  subcompact        0        0        0        1        0        0        0
##         suv        0        0        0        0        0        0        0
##  auto(s6) manual(m5) manual(m6)
##         0          0          0
##         0          0          0
##         0          0          0
##         0          0          0
##         0          0          0
##         0          0          0
##         0          0          0
## 
## $d
##       class auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) auto(s5)
##     2seater        0        0        0        0        0        0        0
##     compact        0        0        0        0        0        0        0
##     midsize        0        0        0        0        0        0        0
##     minivan        0        0        0        0        0        0        0
##      pickup        0        0        0        0        0        0        0
##  subcompact        0        0        1        0        0        0        0
##         suv        0        0        1        1        0        0        0
##  auto(s6) manual(m5) manual(m6)
##         0          0          0
##         0          1          0
##         0          0          0
##         0          0          0
##         0          0          0
##         0          1          0
##         0          0          0
## 
## $e
##       class auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) auto(s5)
##     2seater        0        0        0        0        0        0        0
##     compact        0        0        0        0        0        0        0
##     midsize        0        0        0        0        0        0        0
##     minivan        0        0        1        0        0        0        0
##      pickup        0        0        0        2        0        0        0
##  subcompact        0        0        0        0        0        0        0
##         suv        0        0        2        2        0        0        0
##  auto(s6) manual(m5) manual(m6)
##         0          0          0
##         0          0          0
##         0          0          0
##         0          0          0
##         0          0          1
##         0          0          0
##         0          0          0
## 
## $p
##       class auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) auto(s5)
##     2seater        0        0        1        0        0        0        0
##     compact        2        0        0        4        0        1        0
##     midsize        2        0        1        3        0        1        0
##     minivan        0        0        0        0        0        0        0
##      pickup        0        0        0        0        0        0        0
##  subcompact        0        0        0        0        0        0        0
##         suv        0        0        4        2        0        0        1
##  auto(s6) manual(m5) manual(m6)
##         1          0          3
##         4          5          5
##         4          2          2
##         0          0          0
##         0          0          0
##         0          1          2
##         0          1          0
## 
## $r
##       class auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) auto(s5)
##     2seater        0        0        0        0        0        0        0
##     compact        0        1        8        0        0        1        2
##     midsize        1        0       13        2        0        0        0
##     minivan        0        1        7        0        2        0        0
##      pickup        0        0       12        6        0        0        0
##  subcompact        0        0       10        3        0        0        0
##         suv        0        0       22       13        4        0        0
##  auto(s6) manual(m5) manual(m6)
##         0          0          0
##         1         12          0
##         2          7          1
##         0          0          0
##         0          8          4
##         1         14          1
##         3          6          0

Style your tabyls

The nice thing with tabyls is that you can customize them through adorn_*() functions. For example, you can add totals or make the percent labels nicer.

mpg |> 
  tabyl(class) |> 
  # Adds a total row
  adorn_totals() |> 
  # Makes the percent columns look nicer
  adorn_pct_formatting()
##       class   n percent
##     2seater   5    2.1%
##     compact  47   20.1%
##     midsize  41   17.5%
##     minivan  11    4.7%
##      pickup  33   14.1%
##  subcompact  35   15.0%
##         suv  62   26.5%
##       Total 234  100.0%

Or you could also use counts and percentages in two-way counts.

mpg |> 
  tabyl(class, trans) |> 
  # Adds a total row
  adorn_totals() |> 
  # Turns counts into percentages
  adorn_percentages() |> 
  # Makes percentages look nice
  adorn_pct_formatting() |> 
  # includes the counts (n) again
  adorn_ns() |> 
  # Adds a title row
  adorn_title()  
##                trans                                                           
##       class auto(av) auto(l3)   auto(l4)   auto(l5)  auto(l6) auto(s4) auto(s5)
##     2seater 0.0% (0) 0.0% (0) 20.0%  (1)  0.0%  (0)  0.0% (0) 0.0% (0) 0.0% (0)
##     compact 4.3% (2) 2.1% (1) 17.0%  (8)  8.5%  (4)  0.0% (0) 4.3% (2) 4.3% (2)
##     midsize 7.3% (3) 0.0% (0) 34.1% (14) 12.2%  (5)  0.0% (0) 2.4% (1) 0.0% (0)
##     minivan 0.0% (0) 9.1% (1) 72.7%  (8)  0.0%  (0) 18.2% (2) 0.0% (0) 0.0% (0)
##      pickup 0.0% (0) 0.0% (0) 36.4% (12) 24.2%  (8)  0.0% (0) 0.0% (0) 0.0% (0)
##  subcompact 0.0% (0) 0.0% (0) 31.4% (11) 11.4%  (4)  0.0% (0) 0.0% (0) 0.0% (0)
##         suv 0.0% (0) 0.0% (0) 46.8% (29) 29.0% (18)  6.5% (4) 0.0% (0) 1.6% (1)
##       Total 2.1% (5) 0.9% (2) 35.5% (83) 16.7% (39)  2.6% (6) 1.3% (3) 1.3% (3)
##                                  
##    auto(s6) manual(m5) manual(m6)
##  20.0%  (1)  0.0%  (0) 60.0%  (3)
##  10.6%  (5) 38.3% (18) 10.6%  (5)
##  14.6%  (6) 22.0%  (9)  7.3%  (3)
##   0.0%  (0)  0.0%  (0)  0.0%  (0)
##   0.0%  (0) 24.2%  (8) 15.2%  (5)
##   2.9%  (1) 45.7% (16)  8.6%  (3)
##   4.8%  (3) 11.3%  (7)  0.0%  (0)
##   6.8% (16) 24.8% (58)  8.1% (19)

Conclusion

Sweeeet! We learned a whole lot of convenient helper functions from the janitor package. I hope you enjoyed this little tutorial. Have a great day and see you next time. And if you found this helpful, here are some other ways I can help you:


Stay in touch

If you enjoyed this post, then you may also like my weekly 3-minute newsletter. Every week, I share insights on data visualization, statistics and Shiny web app development. Reading time: 3 minutes or less. You can check it out via this link.

You can also support my work with a coffee