The 6 Most Fundamental Functions for Data Cleaning with R

In this blog post, I’m explaing the most fundamental building blocks for data cleaning in the Tidyverse.
Author

Albert Rapp

Published

June 16, 2024

In this blog post I’m going to show you the six most fundamental functions for your data cleaning journey. And as always, you can watch the video version of this blog post on YouTube:

Getting started

All of these functions come from the tidyverse. So this is why we are going to load that package.

library(tidyverse)

And of course, we’re going to need some data, and this is where the palmerpenguins package comes in. We can either load the palmerpenguins package like we did with the tidyverse, or we can just access the penguins data set via its full name, namely the palmerpenguins package name followed by :: followed by the data set name.

penguins <- palmerpenguins::penguins
penguins
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>

Count all the things with count()

Once we have that data set, we can pass it to the count() function. Once we do that, we see in the console that we get a count.

penguins |> 
  count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1   344

But this output right now isn’t particularly useful because it just shows us how many things are in this data set. You see, count() becomes way more useful when you also add a column name from the dataset inside of count(). That way, we could count how many different species are in the data set.

penguins |> 
  count(species)
## # A tibble: 3 × 2
##   species       n
##   <fct>     <int>
## 1 Adelie      152
## 2 Chinstrap    68
## 3 Gentoo      124

So that’s how we know how many Adelie, Gentoo and Chinstrap penguins are there. Furthermore, we can also sort the rows in this data set by also adding the sort = TRUE argument inside of count(). This is particularly helpful if you have a data set that has a whole bunch of different things inside one column and you want to get the most frequent ones at the top.

penguins |> 
  count(species, sort = TRUE)
## # A tibble: 3 × 2
##   species       n
##   <fct>     <int>
## 1 Adelie      152
## 2 Gentoo      124
## 3 Chinstrap    68

And we could even count combinations of different columns.

penguins |> 
  count(species, island, sort = TRUE)
## # A tibble: 5 × 3
##   species   island        n
##   <fct>     <fct>     <int>
## 1 Gentoo    Biscoe      124
## 2 Chinstrap Dream        68
## 3 Adelie    Dream        56
## 4 Adelie    Torgersen    52
## 5 Adelie    Biscoe       44

Let me note that count() is super useful because obviously you can count things, but also I use it all of the time as intermediate steps when I filter or rearrange some part of the data. This helps me to check that the count of what is now inside of a data set matches what I expect after some data operation.

Get the right columns with select()

Next let us talk about the select() function. This function helps you select columns from your data set. Say, you have a data set with a whole bunch of columns and you want to focus on a subset of that. Then you can just

  • take your data,
  • pass it to the select() function, and
  • list all of the things that you want to select.
penguins |> 
  select(flipper_length_mm, bill_length_mm, species)
## # A tibble: 344 × 3
##    flipper_length_mm bill_length_mm species
##                <int>          <dbl> <fct>  
##  1               181           39.1 Adelie 
##  2               186           39.5 Adelie 
##  3               195           40.3 Adelie 
##  4                NA           NA   Adelie 
##  5               193           36.7 Adelie 
##  6               190           39.3 Adelie 
##  7               181           38.9 Adelie 
##  8               195           39.2 Adelie 
##  9               193           34.1 Adelie 
## 10               190           42   Adelie 
## # ℹ 334 more rows

That’s was pretty easy, right? Here, I kind of want to leave it at this stage. But be aware that there are a whole bunch of advanced tricks that you can use in select() to not have to list all of the things.

Instead, you can just describe the things that you want. For example, you can get all the columns

  • that have a particular text in their name (like _mm in the penguins dataset) or
  • that are of specific type (e.g. numeric)

In these cases, you can use a couple of tricks in combination with select() that make your life much easier. But here I really want to focus on the fundamentals. So let’s ignore those advanced tricks for now.

Reduce the rows to specific observations with filter()

The next essential function is filter(). It is like select() but for rows. Using that function you can filter your data according to specific criteria. That way, you get all of the columns of the data that you pass to filter(), but you get only the rows that match a specific criteria.

For example, we can

  • take our penguins data,
  • pass it to the filter() function, and
  • state that we want to get only the penguins whose bill length is larger than 55 millimeters.
penguins |> 
  filter(bill_length_mm > 55)
## # A tibble: 5 × 8
##   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
## 1 Gentoo    Biscoe           59.6          17                 230        6050
## 2 Gentoo    Biscoe           55.9          17                 228        5600
## 3 Gentoo    Biscoe           55.1          16                 230        5850
## 4 Chinstrap Dream            58            17.8               181        3700
## 5 Chinstrap Dream            55.8          19.8               207        4000
## # ℹ 2 more variables: sex <fct>, year <int>

I’ll explain how to read that code in a second (in case you don’t find it intuitive.) Let me add one more example first, though. For instance, we could also use the between() function in combination with filter() to find the penguins that have a bill length larger than 55 but smaller than 58.

penguins |> 
  filter(between(bill_length_mm, 55, 58))
## # A tibble: 4 × 8
##   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
## 1 Gentoo    Biscoe           55.9          17                 228        5600
## 2 Gentoo    Biscoe           55.1          16                 230        5850
## 3 Chinstrap Dream            58            17.8               181        3700
## 4 Chinstrap Dream            55.8          19.8               207        4000
## # ℹ 2 more variables: sex <fct>, year <int>

Okay, here you might wonder how to understand this thing here. So, let me give you an insight of how filter() function does things.

Basically, whenever you describe a condition inside of filter(), then what happens is that it takes the column that you used and applies the command to get a vector full of TRUE and FALSE. Let’s try this out manually to see what happens. Let’s grab the bill lengths.

bill_length_mm
## Error in eval(expr, envir, enclos): object 'bill_length_mm' not found

As you can see, we can’t just access the data column via its name outside of the filter() function. R simply doesn’t know where this column name is coming from. Inside of the filter() function R knows to look at the data set. But outside of filter() you will have to tell R that this thing comes from our penguins data set. You do that by via the data set name and the $ operator.

penguins$bill_length_mm
##   [1] 39.1 39.5 40.3   NA 36.7 39.3 38.9 39.2 34.1 42.0 37.8 37.8 41.1 38.6 34.6
##  [16] 36.6 38.7 42.5 34.4 46.0 37.8 37.7 35.9 38.2 38.8 35.3 40.6 40.5 37.9 40.5
##  [31] 39.5 37.2 39.5 40.9 36.4 39.2 38.8 42.2 37.6 39.8 36.5 40.8 36.0 44.1 37.0
##  [46] 39.6 41.1 37.5 36.0 42.3 39.6 40.1 35.0 42.0 34.5 41.4 39.0 40.6 36.5 37.6
##  [61] 35.7 41.3 37.6 41.1 36.4 41.6 35.5 41.1 35.9 41.8 33.5 39.7 39.6 45.8 35.5
##  [76] 42.8 40.9 37.2 36.2 42.1 34.6 42.9 36.7 35.1 37.3 41.3 36.3 36.9 38.3 38.9
##  [91] 35.7 41.1 34.0 39.6 36.2 40.8 38.1 40.3 33.1 43.2 35.0 41.0 37.7 37.8 37.9
## [106] 39.7 38.6 38.2 38.1 43.2 38.1 45.6 39.7 42.2 39.6 42.7 38.6 37.3 35.7 41.1
## [121] 36.2 37.7 40.2 41.4 35.2 40.6 38.8 41.5 39.0 44.1 38.5 43.1 36.8 37.5 38.1
## [136] 41.1 35.6 40.2 37.0 39.7 40.2 40.6 32.1 40.7 37.3 39.0 39.2 36.6 36.0 37.8
## [151] 36.0 41.5 46.1 50.0 48.7 50.0 47.6 46.5 45.4 46.7 43.3 46.8 40.9 49.0 45.5
## [166] 48.4 45.8 49.3 42.0 49.2 46.2 48.7 50.2 45.1 46.5 46.3 42.9 46.1 44.5 47.8
## [181] 48.2 50.0 47.3 42.8 45.1 59.6 49.1 48.4 42.6 44.4 44.0 48.7 42.7 49.6 45.3
## [196] 49.6 50.5 43.6 45.5 50.5 44.9 45.2 46.6 48.5 45.1 50.1 46.5 45.0 43.8 45.5
## [211] 43.2 50.4 45.3 46.2 45.7 54.3 45.8 49.8 46.2 49.5 43.5 50.7 47.7 46.4 48.2
## [226] 46.5 46.4 48.6 47.5 51.1 45.2 45.2 49.1 52.5 47.4 50.0 44.9 50.8 43.4 51.3
## [241] 47.5 52.1 47.5 52.2 45.5 49.5 44.5 50.8 49.4 46.9 48.4 51.1 48.5 55.9 47.2
## [256] 49.1 47.3 46.8 41.7 53.4 43.3 48.1 50.5 49.8 43.5 51.5 46.2 55.1 44.5 48.8
## [271] 47.2   NA 46.8 50.4 45.2 49.9 46.5 50.0 51.3 45.4 52.7 45.2 46.1 51.3 46.0
## [286] 51.3 46.6 51.7 47.0 52.0 45.9 50.5 50.3 58.0 46.4 49.2 42.4 48.5 43.2 50.6
## [301] 46.7 52.0 50.5 49.5 46.4 52.8 40.9 54.2 42.5 51.0 49.7 47.5 47.6 52.0 46.9
## [316] 53.5 49.0 46.2 50.9 45.5 50.9 50.8 50.1 49.0 51.5 49.8 48.1 51.4 45.7 50.7
## [331] 42.5 52.2 45.2 49.3 50.2 45.6 51.9 46.8 45.7 55.8 43.5 49.6 50.8 50.2

So now you see that we get a vector full of the length that we want to look at. Now, what happens inside of filter() is that this vector is applied according to what you specified. In the first example, we just checked is this thing larger than 55. Here’s what that results in.

penguins$bill_length_mm > 55
##   [1] FALSE FALSE FALSE    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE    NA FALSE FALSE FALSE FALSE
## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [289] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

Now you see that what you get is a vector full of TRUEs and FALSEs. And as you can probably guess:

  • TRUE corresponds to the occasions where the bill length was indeed larger than 55, and
  • FALSE corresponds to the cases where this wasn’t the case.

Really, what you put inside of filter() are operations that at the end of the day deliver TRUEs and FALSEs. Then, filter() just takes the rows that correspond to TRUE. You could simulate this by taking these TRUEs and FALSEs that we’ve just captured manually and sticking that into the which() function.

which(penguins$bill_length_mm > 55)
## [1] 186 254 268 294 340

This gives you all of the row numbers for which you had a TRUE, and these are the rows that filter() will return.

Now why did I explain this in so much detail when it was really a simple operation? The thing is, I want you to understand this as things become more complicated. For example, with our second example using the between() function, you might wonder why this was valid syntax. Basically, you have to understand what between() does. So here’s an example.

between(c(54, 55, 56, 57, 58), 55, 57)
## [1] FALSE  TRUE  TRUE  TRUE FALSE

As you can see, between() is a function that does exactly what we just talked about, i.e. it returns TRUEs and FALSEs. So really, this was just a different operation that has some slightly other syntax. But the syntax doesn’t matter to us as long as at the end of the day this operation that we use inside of filter() will return TRUEs and FALSEs so that filter() can do its thing.

Calculating and transforming columns with mutate()

Now let’s compute new things from our data. For example, we can compute a new column that takes the bill lengths that we’ve just looked at and scales them. Scaling is a technique used to transform the values of a variable so that they are centered around zero. This transformation allows you to get a better understanding of whether a value is high or low relative to the other values in the dataset (even if you are not familiar with the specific units of measurement.)

For example, if you scale the bill length variable, a scaled value of

  • 0 would indicate an average bill length,
  • while a positive value would indicate an above-average bill length, and
  • a negative value would indicate a below-average bill length.

And to do this scaling, we need the mutate() function. It allows you to modify existing columns or create new columns in a dataset. Here’s how it works:

  1. Pass your data set to the mutate() function.

  2. Specify the column name and set it equal to the desired transformation.

penguins |> 
  mutate(bill_length_mm = scale(bill_length_mm)[,1]) |> 
  select(bill_length_mm)
## # A tibble: 344 × 1
##    bill_length_mm
##             <dbl>
##  1         -0.883
##  2         -0.810
##  3         -0.663
##  4         NA    
##  5         -1.32 
##  6         -0.847
##  7         -0.920
##  8         -0.865
##  9         -1.80 
## 10         -0.352
## # ℹ 334 more rows

And via the same strategy, we can also create a new column. We just have to assign a new column name.

penguins |> 
  mutate(scaled_bill_length = scale(bill_length_mm)[,1]) |> 
  select(bill_length_mm, scaled_bill_length)
## # A tibble: 344 × 2
##    bill_length_mm scaled_bill_length
##             <dbl>              <dbl>
##  1           39.1             -0.883
##  2           39.5             -0.810
##  3           40.3             -0.663
##  4           NA               NA    
##  5           36.7             -1.32 
##  6           39.3             -0.847
##  7           38.9             -0.920
##  8           39.2             -0.865
##  9           34.1             -1.80 
## 10           42               -0.352
## # ℹ 334 more rows

Behind the scenes, mutate() takes the specified column…

penguins$bill_length_mm
##   [1] 39.1 39.5 40.3   NA 36.7 39.3 38.9 39.2 34.1 42.0 37.8 37.8 41.1 38.6 34.6
##  [16] 36.6 38.7 42.5 34.4 46.0 37.8 37.7 35.9 38.2 38.8 35.3 40.6 40.5 37.9 40.5
##  [31] 39.5 37.2 39.5 40.9 36.4 39.2 38.8 42.2 37.6 39.8 36.5 40.8 36.0 44.1 37.0
##  [46] 39.6 41.1 37.5 36.0 42.3 39.6 40.1 35.0 42.0 34.5 41.4 39.0 40.6 36.5 37.6
##  [61] 35.7 41.3 37.6 41.1 36.4 41.6 35.5 41.1 35.9 41.8 33.5 39.7 39.6 45.8 35.5
##  [76] 42.8 40.9 37.2 36.2 42.1 34.6 42.9 36.7 35.1 37.3 41.3 36.3 36.9 38.3 38.9
##  [91] 35.7 41.1 34.0 39.6 36.2 40.8 38.1 40.3 33.1 43.2 35.0 41.0 37.7 37.8 37.9
## [106] 39.7 38.6 38.2 38.1 43.2 38.1 45.6 39.7 42.2 39.6 42.7 38.6 37.3 35.7 41.1
## [121] 36.2 37.7 40.2 41.4 35.2 40.6 38.8 41.5 39.0 44.1 38.5 43.1 36.8 37.5 38.1
## [136] 41.1 35.6 40.2 37.0 39.7 40.2 40.6 32.1 40.7 37.3 39.0 39.2 36.6 36.0 37.8
## [151] 36.0 41.5 46.1 50.0 48.7 50.0 47.6 46.5 45.4 46.7 43.3 46.8 40.9 49.0 45.5
## [166] 48.4 45.8 49.3 42.0 49.2 46.2 48.7 50.2 45.1 46.5 46.3 42.9 46.1 44.5 47.8
## [181] 48.2 50.0 47.3 42.8 45.1 59.6 49.1 48.4 42.6 44.4 44.0 48.7 42.7 49.6 45.3
## [196] 49.6 50.5 43.6 45.5 50.5 44.9 45.2 46.6 48.5 45.1 50.1 46.5 45.0 43.8 45.5
## [211] 43.2 50.4 45.3 46.2 45.7 54.3 45.8 49.8 46.2 49.5 43.5 50.7 47.7 46.4 48.2
## [226] 46.5 46.4 48.6 47.5 51.1 45.2 45.2 49.1 52.5 47.4 50.0 44.9 50.8 43.4 51.3
## [241] 47.5 52.1 47.5 52.2 45.5 49.5 44.5 50.8 49.4 46.9 48.4 51.1 48.5 55.9 47.2
## [256] 49.1 47.3 46.8 41.7 53.4 43.3 48.1 50.5 49.8 43.5 51.5 46.2 55.1 44.5 48.8
## [271] 47.2   NA 46.8 50.4 45.2 49.9 46.5 50.0 51.3 45.4 52.7 45.2 46.1 51.3 46.0
## [286] 51.3 46.6 51.7 47.0 52.0 45.9 50.5 50.3 58.0 46.4 49.2 42.4 48.5 43.2 50.6
## [301] 46.7 52.0 50.5 49.5 46.4 52.8 40.9 54.2 42.5 51.0 49.7 47.5 47.6 52.0 46.9
## [316] 53.5 49.0 46.2 50.9 45.5 50.9 50.8 50.1 49.0 51.5 49.8 48.1 51.4 45.7 50.7
## [331] 42.5 52.2 45.2 49.3 50.2 45.6 51.9 46.8 45.7 55.8 43.5 49.6 50.8 50.2

…performs the requested transformation,

scale(penguins$bill_length_mm)[,1]
##   [1] -0.88320467 -0.80993901 -0.66340769          NA -1.32279862 -0.84657184
##   [7] -0.91983750 -0.86488825 -1.79902541 -0.35202864 -1.12131806 -1.12131806
##  [13] -0.51687637 -0.97478674 -1.70744334 -1.34111504 -0.95647033 -0.26044656
##  [19] -1.74407616  0.38062795 -1.12131806 -1.13963448 -1.46932994 -1.04805240
##  [25] -0.93815391 -1.57922843 -0.60845845 -0.62677486 -1.10300165 -0.62677486
##  [31] -0.80993901 -1.23121655 -0.80993901 -0.55350920 -1.37774787 -0.86488825
##  [37] -0.93815391 -0.31539581 -1.15795089 -0.75498976 -1.35943145 -0.57182562
##  [43] -1.45101353  0.03261607 -1.26784938 -0.79162259 -0.51687637 -1.17626731
##  [49] -1.45101353 -0.29707939 -0.79162259 -0.70004052 -1.63417768 -0.35202864
##  [55] -1.72575975 -0.46192713 -0.90152108 -0.60845845 -1.35943145 -1.15795089
##  [61] -1.50596277 -0.48024354 -1.15795089 -0.51687637 -1.37774787 -0.42529430
##  [67] -1.54259560 -0.51687637 -1.46932994 -0.38866147 -1.90892390 -0.77330618
##  [73] -0.79162259  0.34399512 -1.54259560 -0.20549732 -0.55350920 -1.23121655
##  [79] -1.41438070 -0.33371222 -1.70744334 -0.18718091 -1.32279862 -1.61586126
##  [85] -1.21290014 -0.48024354 -1.39606428 -1.28616579 -1.02973599 -0.91983750
##  [91] -1.50596277 -0.51687637 -1.81734182 -0.79162259 -1.41438070 -0.57182562
##  [97] -1.06636882 -0.66340769 -1.98218956 -0.13223166 -1.63417768 -0.53519279
## [103] -1.13963448 -1.12131806 -1.10300165 -0.77330618 -0.97478674 -1.04805240
## [109] -1.06636882 -0.13223166 -1.06636882  0.30736229 -0.77330618 -0.31539581
## [115] -0.79162259 -0.22381374 -0.97478674 -1.21290014 -1.50596277 -0.51687637
## [121] -1.41438070 -1.13963448 -0.68172411 -0.46192713 -1.59754485 -0.60845845
## [127] -0.93815391 -0.44361071 -0.90152108  0.03261607 -0.99310316 -0.15054808
## [133] -1.30448221 -1.17626731 -1.06636882 -0.51687637 -1.52427919 -0.68172411
## [139] -1.26784938 -0.77330618 -0.68172411 -0.60845845 -2.16535371 -0.59014203
## [145] -1.21290014 -0.90152108 -0.86488825 -1.34111504 -1.45101353 -1.12131806
## [151] -1.45101353 -0.44361071  0.39894437  1.11328455  0.87517115  1.11328455
## [157]  0.67369059  0.47221003  0.27072946  0.50884286 -0.11391525  0.52715927
## [163] -0.55350920  0.93012040  0.28904588  0.82022191  0.34399512  0.98506964
## [169] -0.35202864  0.96675323  0.41726078  0.87517115  1.14991738  0.21578022
## [175]  0.47221003  0.43557720 -0.18718091  0.39894437  0.10588173  0.71032342
## [181]  0.78358908  1.11328455  0.61874135 -0.20549732  0.21578022  2.87166037
## [187]  0.94843681  0.82022191 -0.24213015  0.08756532  0.01429966  0.87517115
## [193] -0.22381374  1.04001889  0.25241305  1.04001889  1.20486662 -0.05896600
## [199]  0.28904588  1.20486662  0.17914739  0.23409663  0.49052644  0.83853832
## [205]  0.21578022  1.13160096  0.47221003  0.19746381 -0.02233317  0.28904588
## [211] -0.13223166  1.18655021  0.25241305  0.41726078  0.32567871  1.90089038
## [217]  0.34399512  1.07665172  0.41726078  1.02170247 -0.07728242  1.24149945
## [223]  0.69200701  0.45389361  0.78358908  0.47221003  0.45389361  0.85685474
## [229]  0.65537418  1.31476511  0.23409663  0.23409663  0.94843681  1.57119492
## [235]  0.63705776  1.11328455  0.17914739  1.25981586 -0.09559883  1.35139794
## [241]  0.65537418  1.49792926  0.65537418  1.51624567  0.28904588  1.02170247
## [247]  0.10588173  1.25981586  1.00338606  0.54547569  0.82022191  1.31476511
## [253]  0.83853832  2.19395302  0.60042493  0.94843681  0.61874135  0.52715927
## [259] -0.40697788  1.73604265 -0.11391525  0.76527266  1.20486662  1.07665172
## [265] -0.07728242  1.38803077  0.41726078  2.04742170  0.10588173  0.89348757
## [271]  0.60042493          NA  0.52715927  1.18655021  0.23409663  1.09496813
## [277]  0.47221003  1.11328455  1.35139794  0.27072946  1.60782775  0.23409663
## [283]  0.39894437  1.35139794  0.38062795  1.35139794  0.49052644  1.42466360
## [289]  0.56379210  1.47961284  0.36231154  1.20486662  1.16823379  2.57859773
## [295]  0.45389361  0.96675323 -0.27876298  0.83853832 -0.13223166  1.22318303
## [301]  0.50884286  1.47961284  1.20486662  1.02170247  0.45389361  1.62614416
## [307] -0.55350920  1.88257397 -0.26044656  1.29644869  1.05833530  0.65537418
## [313]  0.67369059  1.47961284  0.54547569  1.75435906  0.93012040  0.41726078
## [319]  1.27813228  0.28904588  1.27813228  1.25981586  1.13160096  0.93012040
## [325]  1.38803077  1.07665172  0.76527266  1.36971435  0.32567871  1.24149945
## [331] -0.26044656  1.51624567  0.23409663  0.98506964  1.14991738  0.30736229
## [337]  1.46129643  0.52715927  0.32567871  2.17563660 -0.07728242  1.04001889
## [343]  1.25981586  1.14991738

and then assigns that vector do the desired column name.

Reduce your data to key quantities with summarize()

The summarize() function in R allows you to summarize your data into single values. It works similarly to mutate(), but instead of adding new columns, it reduces the data set to a single row. For example, you can use summarize() to compute the mean of the bill_length_mm and flipper_length_mm columns

penguins |> 
  summarize(
    mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
    mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE)
  )
## # A tibble: 1 × 2
##   mean_bill_length mean_flipper_length
##              <dbl>               <dbl>
## 1             43.9                201.

Note that we use na.rm = TRUE to remove any missing values before computing the means. And if you want to repeat the calculations for different subsets of your data, such as different species of penguins, you can use the .by argument:

penguins |> 
  summarize(
    mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
    mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
    .by = species
  )
## # A tibble: 3 × 3
##   species   mean_bill_length mean_flipper_length
##   <fct>                <dbl>               <dbl>
## 1 Adelie                38.8                190.
## 2 Gentoo                47.5                217.
## 3 Chinstrap             48.8                196.

This will repeat the calculations separately for each species of penguin.

Sort your data with arrange()

The arrange() function in R allows you to sort the rows of your dataset based on one or more columns. By default, it sorts in ascending order.

penguins |> 
  summarize(
    mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
    mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
    .by = species
  ) |> 
  arrange(mean_flipper_length)
## # A tibble: 3 × 3
##   species   mean_bill_length mean_flipper_length
##   <fct>                <dbl>               <dbl>
## 1 Adelie                38.8                190.
## 2 Chinstrap             48.8                196.
## 3 Gentoo                47.5                217.

But you can use the desc() function to sort in descending order:

penguins |> 
  summarize(
    mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
    mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
    .by = species
  )  |> 
  arrange(desc(mean_flipper_length))
## # A tibble: 3 × 3
##   species   mean_bill_length mean_flipper_length
##   <fct>                <dbl>               <dbl>
## 1 Gentoo                47.5                217.
## 2 Chinstrap             48.8                196.
## 3 Adelie                38.8                190.

Conclusion

In this blog post/video, we covered the 6 most fundamental functions for data cleaning with R. These functions provide a solid foundation for transforming and manipulating data in R.

If you enjoyed this content and want to learn more advanced data cleaning techniques, be sure to check out my Data Cleaning with R Master Class. In this master class, we

  • dive deeper into these functions,
  • learn to clean specific data formats (such as Excel files),
  • handling text data, and
  • deal with date formats.

Be sure to check out the course. And if you found this helpful, here are some other ways I can help you:


Enjoyed this blog post?

Here are three other ways I can help you:

3 Minutes Wednesdays

Every week, I share bite-sized R tips & tricks. Reading time less than 3 minutes. Delivered straight to your inbox. You can sign up for free weekly tips online.

Data Cleaning With R Master Class

This in-depth video course teaches you everything you need to know about becoming better & more efficient at cleaning up messy data. This includes Excel & JSON files, text data and working with times & dates. If you want to get better at data cleaning, check out the course page.

Insightful Data Visualizations for "Uncreative" R Users

This video course teaches you how to leverage {ggplot2} to make charts that communicate effectively without being a design expert. Course information can be found on the course page.