# The 6 Most Fundamental Functions for Data Cleaning with R

In this blog post, I’m explaing the most fundamental building blocks for data cleaning in the Tidyverse.
Author

Albert Rapp

Published

June 16, 2024

In this blog post I’m going to show you the six most fundamental functions for your data cleaning journey. And as always, you can watch the video version of this blog post on YouTube:

## Getting started

All of these functions come from the tidyverse. So this is why we are going to load that package.

``library(tidyverse)``

And of course, we’re going to need some data, and this is where the palmerpenguins package comes in. We can either load the palmerpenguins package like we did with the tidyverse, or we can just access the penguins data set via its full name, namely the palmerpenguins package name followed by `::` followed by the data set name.

``````penguins <- palmerpenguins::penguins
penguins
## # A tibble: 344 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # ℹ 334 more rows
## # ℹ 2 more variables: sex <fct>, year <int>``````

## Count all the things with `count()`

Once we have that data set, we can pass it to the `count()` function. Once we do that, we see in the console that we get a count.

``````penguins |>
count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1   344``````

But this output right now isn’t particularly useful because it just shows us how many things are in this data set. You see, `count()` becomes way more useful when you also add a column name from the dataset inside of `count()`. That way, we could count how many different species are in the data set.

``````penguins |>
count(species)
## # A tibble: 3 × 2
##   species       n
##   <fct>     <int>
## 2 Chinstrap    68
## 3 Gentoo      124``````

So that’s how we know how many Adelie, Gentoo and Chinstrap penguins are there. Furthermore, we can also sort the rows in this data set by also adding the `sort = TRUE` argument inside of `count()`. This is particularly helpful if you have a data set that has a whole bunch of different things inside one column and you want to get the most frequent ones at the top.

``````penguins |>
count(species, sort = TRUE)
## # A tibble: 3 × 2
##   species       n
##   <fct>     <int>
## 2 Gentoo      124
## 3 Chinstrap    68``````

And we could even count combinations of different columns.

``````penguins |>
count(species, island, sort = TRUE)
## # A tibble: 5 × 3
##   species   island        n
##   <fct>     <fct>     <int>
## 1 Gentoo    Biscoe      124
## 2 Chinstrap Dream        68

Let me note that `count()` is super useful because obviously you can count things, but also I use it all of the time as intermediate steps when I filter or rearrange some part of the data. This helps me to check that the count of what is now inside of a data set matches what I expect after some data operation.

## Get the right columns with `select()`

Next let us talk about the `select()` function. This function helps you select columns from your data set. Say, you have a data set with a whole bunch of columns and you want to focus on a subset of that. Then you can just

• pass it to the `select()` function, and
• list all of the things that you want to select.
``````penguins |>
select(flipper_length_mm, bill_length_mm, species)
## # A tibble: 344 × 3
##    flipper_length_mm bill_length_mm species
##                <int>          <dbl> <fct>
## # ℹ 334 more rows``````

That’s was pretty easy, right? Here, I kind of want to leave it at this stage. But be aware that there are a whole bunch of advanced tricks that you can use in `select()` to not have to list all of the things.

Instead, you can just describe the things that you want. For example, you can get all the columns

• that have a particular text in their name (like `_mm` in the penguins dataset) or
• that are of specific type (e.g. numeric)

In these cases, you can use a couple of tricks in combination with `select()` that make your life much easier. But here I really want to focus on the fundamentals. So let’s ignore those advanced tricks for now.

## Reduce the rows to specific observations with `filter()`

The next essential function is `filter()`. It is like `select()` but for rows. Using that function you can filter your data according to specific criteria. That way, you get all of the columns of the data that you pass to `filter()`, but you get only the rows that match a specific criteria.

For example, we can

• take our penguins data,
• pass it to the `filter()` function, and
• state that we want to get only the penguins whose bill length is larger than 55 millimeters.
``````penguins |>
filter(bill_length_mm > 55)
## # A tibble: 5 × 8
##   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
## 1 Gentoo    Biscoe           59.6          17                 230        6050
## 2 Gentoo    Biscoe           55.9          17                 228        5600
## 3 Gentoo    Biscoe           55.1          16                 230        5850
## 4 Chinstrap Dream            58            17.8               181        3700
## 5 Chinstrap Dream            55.8          19.8               207        4000
## # ℹ 2 more variables: sex <fct>, year <int>``````

I’ll explain how to read that code in a second (in case you don’t find it intuitive.) Let me add one more example first, though. For instance, we could also use the `between()` function in combination with `filter()` to find the penguins that have a bill length larger than 55 but smaller than 58.

``````penguins |>
filter(between(bill_length_mm, 55, 58))
## # A tibble: 4 × 8
##   species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
## 1 Gentoo    Biscoe           55.9          17                 228        5600
## 2 Gentoo    Biscoe           55.1          16                 230        5850
## 3 Chinstrap Dream            58            17.8               181        3700
## 4 Chinstrap Dream            55.8          19.8               207        4000
## # ℹ 2 more variables: sex <fct>, year <int>``````

Okay, here you might wonder how to understand this thing here. So, let me give you an insight of how `filter()` function does things.

Basically, whenever you describe a condition inside of `filter()`, then what happens is that it takes the column that you used and applies the command to get a vector full of `TRUE` and `FALSE`. Let’s try this out manually to see what happens. Let’s grab the bill lengths.

``````bill_length_mm

As you can see, we can’t just access the data column via its name outside of the `filter()` function. R simply doesn’t know where this column name is coming from. Inside of the `filter()` function R knows to look at the data set. But outside of `filter()` you will have to tell R that this thing comes from our penguins data set. You do that by via the data set name and the `\$` operator.

``````penguins\$bill_length_mm
##   [1] 39.1 39.5 40.3   NA 36.7 39.3 38.9 39.2 34.1 42.0 37.8 37.8 41.1 38.6 34.6
##  [16] 36.6 38.7 42.5 34.4 46.0 37.8 37.7 35.9 38.2 38.8 35.3 40.6 40.5 37.9 40.5
##  [31] 39.5 37.2 39.5 40.9 36.4 39.2 38.8 42.2 37.6 39.8 36.5 40.8 36.0 44.1 37.0
##  [46] 39.6 41.1 37.5 36.0 42.3 39.6 40.1 35.0 42.0 34.5 41.4 39.0 40.6 36.5 37.6
##  [61] 35.7 41.3 37.6 41.1 36.4 41.6 35.5 41.1 35.9 41.8 33.5 39.7 39.6 45.8 35.5
##  [76] 42.8 40.9 37.2 36.2 42.1 34.6 42.9 36.7 35.1 37.3 41.3 36.3 36.9 38.3 38.9
##  [91] 35.7 41.1 34.0 39.6 36.2 40.8 38.1 40.3 33.1 43.2 35.0 41.0 37.7 37.8 37.9
## [106] 39.7 38.6 38.2 38.1 43.2 38.1 45.6 39.7 42.2 39.6 42.7 38.6 37.3 35.7 41.1
## [121] 36.2 37.7 40.2 41.4 35.2 40.6 38.8 41.5 39.0 44.1 38.5 43.1 36.8 37.5 38.1
## [136] 41.1 35.6 40.2 37.0 39.7 40.2 40.6 32.1 40.7 37.3 39.0 39.2 36.6 36.0 37.8
## [151] 36.0 41.5 46.1 50.0 48.7 50.0 47.6 46.5 45.4 46.7 43.3 46.8 40.9 49.0 45.5
## [166] 48.4 45.8 49.3 42.0 49.2 46.2 48.7 50.2 45.1 46.5 46.3 42.9 46.1 44.5 47.8
## [181] 48.2 50.0 47.3 42.8 45.1 59.6 49.1 48.4 42.6 44.4 44.0 48.7 42.7 49.6 45.3
## [196] 49.6 50.5 43.6 45.5 50.5 44.9 45.2 46.6 48.5 45.1 50.1 46.5 45.0 43.8 45.5
## [211] 43.2 50.4 45.3 46.2 45.7 54.3 45.8 49.8 46.2 49.5 43.5 50.7 47.7 46.4 48.2
## [226] 46.5 46.4 48.6 47.5 51.1 45.2 45.2 49.1 52.5 47.4 50.0 44.9 50.8 43.4 51.3
## [241] 47.5 52.1 47.5 52.2 45.5 49.5 44.5 50.8 49.4 46.9 48.4 51.1 48.5 55.9 47.2
## [256] 49.1 47.3 46.8 41.7 53.4 43.3 48.1 50.5 49.8 43.5 51.5 46.2 55.1 44.5 48.8
## [271] 47.2   NA 46.8 50.4 45.2 49.9 46.5 50.0 51.3 45.4 52.7 45.2 46.1 51.3 46.0
## [286] 51.3 46.6 51.7 47.0 52.0 45.9 50.5 50.3 58.0 46.4 49.2 42.4 48.5 43.2 50.6
## [301] 46.7 52.0 50.5 49.5 46.4 52.8 40.9 54.2 42.5 51.0 49.7 47.5 47.6 52.0 46.9
## [316] 53.5 49.0 46.2 50.9 45.5 50.9 50.8 50.1 49.0 51.5 49.8 48.1 51.4 45.7 50.7
## [331] 42.5 52.2 45.2 49.3 50.2 45.6 51.9 46.8 45.7 55.8 43.5 49.6 50.8 50.2``````

So now you see that we get a vector full of the length that we want to look at. Now, what happens inside of `filter()` is that this vector is applied according to what you specified. In the first example, we just checked is this thing larger than 55. Here’s what that results in.

``````penguins\$bill_length_mm > 55
##   [1] FALSE FALSE FALSE    NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE    NA FALSE FALSE FALSE FALSE
## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [289] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE``````

Now you see that what you get is a vector full of `TRUE`s and `FALSE`s. And as you can probably guess:

• `TRUE` corresponds to the occasions where the bill length was indeed larger than 55, and
• `FALSE` corresponds to the cases where this wasn’t the case.

Really, what you put inside of `filter()` are operations that at the end of the day deliver `TRUE`s and `FALSE`s. Then, `filter()` just takes the rows that correspond to `TRUE`. You could simulate this by taking these `TRUE`s and `FALSE`s that we’ve just captured manually and sticking that into the `which()` function.

``````which(penguins\$bill_length_mm > 55)
## [1] 186 254 268 294 340``````

This gives you all of the row numbers for which you had a `TRUE`, and these are the rows that `filter()` will return.

Now why did I explain this in so much detail when it was really a simple operation? The thing is, I want you to understand this as things become more complicated. For example, with our second example using the `between()` function, you might wonder why this was valid syntax. Basically, you have to understand what `between()` does. So here’s an example.

``````between(c(54, 55, 56, 57, 58), 55, 57)
## [1] FALSE  TRUE  TRUE  TRUE FALSE``````

As you can see, `between()` is a function that does exactly what we just talked about, i.e. it returns `TRUE`s and `FALSE`s. So really, this was just a different operation that has some slightly other syntax. But the syntax doesn’t matter to us as long as at the end of the day this operation that we use inside of `filter()` will return `TRUE`s and `FALSE`s so that `filter()` can do its thing.

## Calculating and transforming columns with `mutate()`

Now let’s compute new things from our data. For example, we can compute a new column that takes the bill lengths that we’ve just looked at and scales them. Scaling is a technique used to transform the values of a variable so that they are centered around zero. This transformation allows you to get a better understanding of whether a value is high or low relative to the other values in the dataset (even if you are not familiar with the specific units of measurement.)

For example, if you scale the bill length variable, a scaled value of

• 0 would indicate an average bill length,
• while a positive value would indicate an above-average bill length, and
• a negative value would indicate a below-average bill length.

And to do this scaling, we need the `mutate()` function. It allows you to modify existing columns or create new columns in a dataset. Here’s how it works:

1. Pass your data set to the `mutate()` function.

2. Specify the column name and set it equal to the desired transformation.

``````penguins |>
mutate(bill_length_mm = scale(bill_length_mm)[,1]) |>
select(bill_length_mm)
## # A tibble: 344 × 1
##    bill_length_mm
##             <dbl>
##  1         -0.883
##  2         -0.810
##  3         -0.663
##  4         NA
##  5         -1.32
##  6         -0.847
##  7         -0.920
##  8         -0.865
##  9         -1.80
## 10         -0.352
## # ℹ 334 more rows``````

And via the same strategy, we can also create a new column. We just have to assign a new column name.

``````penguins |>
mutate(scaled_bill_length = scale(bill_length_mm)[,1]) |>
select(bill_length_mm, scaled_bill_length)
## # A tibble: 344 × 2
##    bill_length_mm scaled_bill_length
##             <dbl>              <dbl>
##  1           39.1             -0.883
##  2           39.5             -0.810
##  3           40.3             -0.663
##  4           NA               NA
##  5           36.7             -1.32
##  6           39.3             -0.847
##  7           38.9             -0.920
##  8           39.2             -0.865
##  9           34.1             -1.80
## 10           42               -0.352
## # ℹ 334 more rows``````

Behind the scenes, `mutate()` takes the specified column…

``````penguins\$bill_length_mm
##   [1] 39.1 39.5 40.3   NA 36.7 39.3 38.9 39.2 34.1 42.0 37.8 37.8 41.1 38.6 34.6
##  [16] 36.6 38.7 42.5 34.4 46.0 37.8 37.7 35.9 38.2 38.8 35.3 40.6 40.5 37.9 40.5
##  [31] 39.5 37.2 39.5 40.9 36.4 39.2 38.8 42.2 37.6 39.8 36.5 40.8 36.0 44.1 37.0
##  [46] 39.6 41.1 37.5 36.0 42.3 39.6 40.1 35.0 42.0 34.5 41.4 39.0 40.6 36.5 37.6
##  [61] 35.7 41.3 37.6 41.1 36.4 41.6 35.5 41.1 35.9 41.8 33.5 39.7 39.6 45.8 35.5
##  [76] 42.8 40.9 37.2 36.2 42.1 34.6 42.9 36.7 35.1 37.3 41.3 36.3 36.9 38.3 38.9
##  [91] 35.7 41.1 34.0 39.6 36.2 40.8 38.1 40.3 33.1 43.2 35.0 41.0 37.7 37.8 37.9
## [106] 39.7 38.6 38.2 38.1 43.2 38.1 45.6 39.7 42.2 39.6 42.7 38.6 37.3 35.7 41.1
## [121] 36.2 37.7 40.2 41.4 35.2 40.6 38.8 41.5 39.0 44.1 38.5 43.1 36.8 37.5 38.1
## [136] 41.1 35.6 40.2 37.0 39.7 40.2 40.6 32.1 40.7 37.3 39.0 39.2 36.6 36.0 37.8
## [151] 36.0 41.5 46.1 50.0 48.7 50.0 47.6 46.5 45.4 46.7 43.3 46.8 40.9 49.0 45.5
## [166] 48.4 45.8 49.3 42.0 49.2 46.2 48.7 50.2 45.1 46.5 46.3 42.9 46.1 44.5 47.8
## [181] 48.2 50.0 47.3 42.8 45.1 59.6 49.1 48.4 42.6 44.4 44.0 48.7 42.7 49.6 45.3
## [196] 49.6 50.5 43.6 45.5 50.5 44.9 45.2 46.6 48.5 45.1 50.1 46.5 45.0 43.8 45.5
## [211] 43.2 50.4 45.3 46.2 45.7 54.3 45.8 49.8 46.2 49.5 43.5 50.7 47.7 46.4 48.2
## [226] 46.5 46.4 48.6 47.5 51.1 45.2 45.2 49.1 52.5 47.4 50.0 44.9 50.8 43.4 51.3
## [241] 47.5 52.1 47.5 52.2 45.5 49.5 44.5 50.8 49.4 46.9 48.4 51.1 48.5 55.9 47.2
## [256] 49.1 47.3 46.8 41.7 53.4 43.3 48.1 50.5 49.8 43.5 51.5 46.2 55.1 44.5 48.8
## [271] 47.2   NA 46.8 50.4 45.2 49.9 46.5 50.0 51.3 45.4 52.7 45.2 46.1 51.3 46.0
## [286] 51.3 46.6 51.7 47.0 52.0 45.9 50.5 50.3 58.0 46.4 49.2 42.4 48.5 43.2 50.6
## [301] 46.7 52.0 50.5 49.5 46.4 52.8 40.9 54.2 42.5 51.0 49.7 47.5 47.6 52.0 46.9
## [316] 53.5 49.0 46.2 50.9 45.5 50.9 50.8 50.1 49.0 51.5 49.8 48.1 51.4 45.7 50.7
## [331] 42.5 52.2 45.2 49.3 50.2 45.6 51.9 46.8 45.7 55.8 43.5 49.6 50.8 50.2``````

…performs the requested transformation,

``````scale(penguins\$bill_length_mm)[,1]
##   [1] -0.88320467 -0.80993901 -0.66340769          NA -1.32279862 -0.84657184
##   [7] -0.91983750 -0.86488825 -1.79902541 -0.35202864 -1.12131806 -1.12131806
##  [13] -0.51687637 -0.97478674 -1.70744334 -1.34111504 -0.95647033 -0.26044656
##  [19] -1.74407616  0.38062795 -1.12131806 -1.13963448 -1.46932994 -1.04805240
##  [25] -0.93815391 -1.57922843 -0.60845845 -0.62677486 -1.10300165 -0.62677486
##  [31] -0.80993901 -1.23121655 -0.80993901 -0.55350920 -1.37774787 -0.86488825
##  [37] -0.93815391 -0.31539581 -1.15795089 -0.75498976 -1.35943145 -0.57182562
##  [43] -1.45101353  0.03261607 -1.26784938 -0.79162259 -0.51687637 -1.17626731
##  [49] -1.45101353 -0.29707939 -0.79162259 -0.70004052 -1.63417768 -0.35202864
##  [55] -1.72575975 -0.46192713 -0.90152108 -0.60845845 -1.35943145 -1.15795089
##  [61] -1.50596277 -0.48024354 -1.15795089 -0.51687637 -1.37774787 -0.42529430
##  [67] -1.54259560 -0.51687637 -1.46932994 -0.38866147 -1.90892390 -0.77330618
##  [73] -0.79162259  0.34399512 -1.54259560 -0.20549732 -0.55350920 -1.23121655
##  [79] -1.41438070 -0.33371222 -1.70744334 -0.18718091 -1.32279862 -1.61586126
##  [85] -1.21290014 -0.48024354 -1.39606428 -1.28616579 -1.02973599 -0.91983750
##  [91] -1.50596277 -0.51687637 -1.81734182 -0.79162259 -1.41438070 -0.57182562
##  [97] -1.06636882 -0.66340769 -1.98218956 -0.13223166 -1.63417768 -0.53519279
## [103] -1.13963448 -1.12131806 -1.10300165 -0.77330618 -0.97478674 -1.04805240
## [109] -1.06636882 -0.13223166 -1.06636882  0.30736229 -0.77330618 -0.31539581
## [115] -0.79162259 -0.22381374 -0.97478674 -1.21290014 -1.50596277 -0.51687637
## [121] -1.41438070 -1.13963448 -0.68172411 -0.46192713 -1.59754485 -0.60845845
## [127] -0.93815391 -0.44361071 -0.90152108  0.03261607 -0.99310316 -0.15054808
## [133] -1.30448221 -1.17626731 -1.06636882 -0.51687637 -1.52427919 -0.68172411
## [139] -1.26784938 -0.77330618 -0.68172411 -0.60845845 -2.16535371 -0.59014203
## [145] -1.21290014 -0.90152108 -0.86488825 -1.34111504 -1.45101353 -1.12131806
## [151] -1.45101353 -0.44361071  0.39894437  1.11328455  0.87517115  1.11328455
## [157]  0.67369059  0.47221003  0.27072946  0.50884286 -0.11391525  0.52715927
## [163] -0.55350920  0.93012040  0.28904588  0.82022191  0.34399512  0.98506964
## [169] -0.35202864  0.96675323  0.41726078  0.87517115  1.14991738  0.21578022
## [175]  0.47221003  0.43557720 -0.18718091  0.39894437  0.10588173  0.71032342
## [181]  0.78358908  1.11328455  0.61874135 -0.20549732  0.21578022  2.87166037
## [187]  0.94843681  0.82022191 -0.24213015  0.08756532  0.01429966  0.87517115
## [193] -0.22381374  1.04001889  0.25241305  1.04001889  1.20486662 -0.05896600
## [199]  0.28904588  1.20486662  0.17914739  0.23409663  0.49052644  0.83853832
## [205]  0.21578022  1.13160096  0.47221003  0.19746381 -0.02233317  0.28904588
## [211] -0.13223166  1.18655021  0.25241305  0.41726078  0.32567871  1.90089038
## [217]  0.34399512  1.07665172  0.41726078  1.02170247 -0.07728242  1.24149945
## [223]  0.69200701  0.45389361  0.78358908  0.47221003  0.45389361  0.85685474
## [229]  0.65537418  1.31476511  0.23409663  0.23409663  0.94843681  1.57119492
## [235]  0.63705776  1.11328455  0.17914739  1.25981586 -0.09559883  1.35139794
## [241]  0.65537418  1.49792926  0.65537418  1.51624567  0.28904588  1.02170247
## [247]  0.10588173  1.25981586  1.00338606  0.54547569  0.82022191  1.31476511
## [253]  0.83853832  2.19395302  0.60042493  0.94843681  0.61874135  0.52715927
## [259] -0.40697788  1.73604265 -0.11391525  0.76527266  1.20486662  1.07665172
## [265] -0.07728242  1.38803077  0.41726078  2.04742170  0.10588173  0.89348757
## [271]  0.60042493          NA  0.52715927  1.18655021  0.23409663  1.09496813
## [277]  0.47221003  1.11328455  1.35139794  0.27072946  1.60782775  0.23409663
## [283]  0.39894437  1.35139794  0.38062795  1.35139794  0.49052644  1.42466360
## [289]  0.56379210  1.47961284  0.36231154  1.20486662  1.16823379  2.57859773
## [295]  0.45389361  0.96675323 -0.27876298  0.83853832 -0.13223166  1.22318303
## [301]  0.50884286  1.47961284  1.20486662  1.02170247  0.45389361  1.62614416
## [307] -0.55350920  1.88257397 -0.26044656  1.29644869  1.05833530  0.65537418
## [313]  0.67369059  1.47961284  0.54547569  1.75435906  0.93012040  0.41726078
## [319]  1.27813228  0.28904588  1.27813228  1.25981586  1.13160096  0.93012040
## [325]  1.38803077  1.07665172  0.76527266  1.36971435  0.32567871  1.24149945
## [331] -0.26044656  1.51624567  0.23409663  0.98506964  1.14991738  0.30736229
## [337]  1.46129643  0.52715927  0.32567871  2.17563660 -0.07728242  1.04001889
## [343]  1.25981586  1.14991738``````

and then assigns that vector do the desired column name.

## Reduce your data to key quantities with `summarize()`

The `summarize()` function in R allows you to summarize your data into single values. It works similarly to `mutate()`, but instead of adding new columns, it reduces the data set to a single row. For example, you can use `summarize()` to compute the mean of the `bill_length_mm` and `flipper_length_mm` columns

``````penguins |>
summarize(
mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE)
)
## # A tibble: 1 × 2
##   mean_bill_length mean_flipper_length
##              <dbl>               <dbl>
## 1             43.9                201.``````

Note that we use `na.rm = TRUE` to remove any missing values before computing the means. And if you want to repeat the calculations for different subsets of your data, such as different species of penguins, you can use the `.by` argument:

``````penguins |>
summarize(
mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
.by = species
)
## # A tibble: 3 × 3
##   species   mean_bill_length mean_flipper_length
##   <fct>                <dbl>               <dbl>
## 2 Gentoo                47.5                217.
## 3 Chinstrap             48.8                196.``````

This will repeat the calculations separately for each species of penguin.

## Sort your data with `arrange()`

The `arrange()` function in R allows you to sort the rows of your dataset based on one or more columns. By default, it sorts in ascending order.

``````penguins |>
summarize(
mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
.by = species
) |>
arrange(mean_flipper_length)
## # A tibble: 3 × 3
##   species   mean_bill_length mean_flipper_length
##   <fct>                <dbl>               <dbl>
## 2 Chinstrap             48.8                196.
## 3 Gentoo                47.5                217.``````

But you can use the `desc()` function to sort in descending order:

``````penguins |>
summarize(
mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
.by = species
)  |>
arrange(desc(mean_flipper_length))
## # A tibble: 3 × 3
##   species   mean_bill_length mean_flipper_length
##   <fct>                <dbl>               <dbl>
## 1 Gentoo                47.5                217.
## 2 Chinstrap             48.8                196.

## Conclusion

In this blog post/video, we covered the 6 most fundamental functions for data cleaning with R. These functions provide a solid foundation for transforming and manipulating data in R.

If you enjoyed this content and want to learn more advanced data cleaning techniques, be sure to check out my Data Cleaning with R Master Class. In this master class, we

• dive deeper into these functions,
• learn to clean specific data formats (such as Excel files),
• handling text data, and
• deal with date formats.

Be sure to check out the course. And if you found this helpful, here are some other ways I can help you:

### Enjoyed this blog post?

This video course teaches you how to leverage `{ggplot2}` to make charts that communicate effectively without being a design expert. Course information can be found on the course page.