janitor package contains only a little number of functions but nevertheless
it is surprisingly convenient.
I never really fully appreciated its functionality until I took a look into the documentation.
Of course, other packages can achieve the same thing too but
a lot of tasks easy.
Thus, here is a little showcase.
Clean column names
As everyone working with data knows, data sets rarely come in a clean format. Often, the necessary cleaning process already starts with the column names. Here, take this data set from TidyTuesday, week 41.
These column names are intuitively easy to understand but not necessarily easy to process
by code as there are white spaces and other special characters.
Therefore, I accompany most data input by
clean_names() from the
Did you see what happened?
White spaces were converted to
_ and parantheses were removed.
% signs were converted to
Now, these labels are easy to understand AND process by code.
This does not mean that you are finished cleaning but at least now the columns
are more accessible.
Remove empty and or constant columns and rows
Data sets come with empty or superfluous rows or columns are not a rare sighting. This is especially true if you work with Excel files because there will be a lot of empty cells. Take a look at the dirty Excel data set from janitor’s GitHub page. It looks like this when you open it with Excel.
Taking a look just at this picture we may notice a couple of things.
First, Jason Bourne is teaching at a school. I guess being a trained assassin qualifies him to teach physical education. Also - and this is just a hunch - undercover work likely earned him his “Theater” certification.
Second, the header above the actual table will be annoying, so we must skip the first line when we read the data set.
Third, the column names are not ideal but we know how to deal with that by now.
Fourth, there are empty rows and columns we can get rid of.
Fifth, there is a column that contains only ‘YES’. Therefore it contains no information at all and can be removed.
So, let us read and clean the data.
janitor package will help us with
remove_empty() defaulted to remove, both, rows and colums.
If we wish, we can change that by setting e.g.
which = 'rows'.
Now, we may also want to see the
hire_data in a sensible format.
For example, in this dirty data set, Jason Bourne was hired on
janitor can make sense of it all.
surprise shock, R uses some unexpected rounding rule.
In my world, whenever a number ends in
.5, standard rounding would round up.
Apparently, R uses something called banker’s rounding that in these cases
rounds towards the next even number.
Take a look.
I would expect that the rounded vector contains the integers from one to five.
janitor offers a convenient rounding function.
Ok, so that gives us a new function for rounding towards integers.
But what is really convenient is that
Here, I rounded the numbers to the next quarters (
denominator = 4) but of course
any fraction is possible.
You can now live the dream of rounding towards arbitrary fractions.
Find matches in multiple characteristics
In my opinion, the
get_dupes() function is really powerful.
It allows us to find “similar” observations in a data set based on certain characteristics.
For example, the
starwars data set from
dplyr contains a lot of information
on characters from the Star Wars movies.
Possibly, we want to find out which characters are similar w.r.t. to certain traits.
So, Luke and Anakin Skywalker are similar to one another.
Who would have thought that.
Sadly, I don’t enough about Star Wars to know whether the other matches are similarly
In any case, the point here is that we can easily find matches according to
arbitrarily many characteristics.
Conveniently, these characteristics are the first columns of the new output and
we get a
Alright, this concludes our little showcase.
janitor package, there is another set of
These are meant to improve base R’s
Since I rarely use that function I did not include it but if you use
then you should definitely check out tabyl().