So, I found a great video from Storytelling with Data (SWD). In this video, a data storyteller demonstrates how a dataviz that does not demonstrate a clear story can be improved. Let’s take a look at the dataviz but, first, here’s the data.
This data set contains a lot of accuracy and error rates from different (anonymous) warehouses.
Additionally, there are “null rates”.
These are likely related to data quality issues.
Furthermore, this data set is apparently taken from a client the data storytellers helped.
In any case, here is a
ggplot2 recreation of the client’s initial plot.
Note that the plot does not match exactly but it’s close enough to get the gist.
As it is right know, the plot shows data. But what is the message of this dataviz? To make the message more explicit, the plot is transformed during the course of the video. Take a look at what story the exact same data can tell.
From reading the SWD book, I know that some of the techniques that were used in this picture can be used in many settings. Therefore, I decided to document the steps I took to recreate the dataviz with ggplot.
I tried to make this documentation as accessible as possible. Consequently, if you are already quite familiar with how to customize a ggplot’s details, then some of the explanations or references may be superfluous. Feel free to skip them. That being said, let’s transform the plot.
Flip the axes for long names
Although it is not really an issue here, warehouses or other places might be more identifiable by a (long) name rather than an ID.
To make sure that these names are legible, show them on the y-axes.
When I first learned ggplot, there was the layer
coord_flip() to do that job for us.
Nowadays, though, you can often avoid
coord_flip() because a lot of geoms already understand what you mean, when you map categorical data to the y-aesthetic.
But make sure that ggplot will know that you mean categorical data (especially if the labels are numerical like here).
Notice that I used the
group- instead of
fill-aesthetic because I only need grouping.
Also, it is always a good idea to avoid excessive use of colors.
This will allow us to emphasize parts of our story with colors later on.
Add reference points
Another good idea it to put your data into perspective. To do so, include a reference point. This can be a summary statistic like the average error rate. For more great demonstration of reference points you can also check out the evolution of a ggplot by Cédric Scherer.
Order your data
To allow your reader to gain a quick overview, put your data into some form of sensible ordering.
This eases the burden of having to make sense of what the visual shows.
Also, notice that we already did part of that.
See, with the order of the levels in the
group aesthetic, we influenced the ordering of the stacked bars.
Here, we made sure that important quantities start at the left resp. right edges.
Why is that helpful, you ask? Well, the bars that start on the left all start at the same reference point. Therefore comparisons are quite easy for these bars. The same holds true for the right edge. Consequently, it is best that we reserve these vip seats for the important data. Check out what happens if I were to put the accuracy in the middle.
Now, we can’t really make out which warehouses have a higher accuracy.
Given that the accuracy is likely something we care about, this is bad.
But we can change the order even more.
For instance, we can also order the bars by error rate.
fct_reorder() is our friend.
Highlight your story points
Next, it’s time to highlight your story points.
This can be done with the
gghighlight as I have demonstrated in another blog post.
Alternatively, we can set the colors manually.
The latter approach gave me the best results in this case, so we’ll go with that.
But I am still a big fan of
gghighlight, so don’t discard its power just yet.
Notice how your eyes are immediately drawn to the intended region.
That’s the power of colors!
Also, note that setting the colors manually like this worked because
geom_col() is vectorized.
This is not always the case.
In these instances, you may find that functional programming solves your problem.
Remove axes expansion and allow drawing outside of grid
Did you notice that there is still some clutter in the plot?
Removing clutter from a plot is a central element of the SWD look.
Personally, I like this approach.
So, let’s get down to the essentials and remove what does not need to be there.
In this case, there are still (faint) horizontal lines behind each bar.
Furthermore, this causes the warehouse IDs to be slightly removed from the bars.
We change that through formatting the coordinate system with
Here, we turned off the expansion to avoid wasting white space.
Now, the IDs are at their designated place and we do not see lines from their names to the bars anymore.
If you want even more power on the space expansion you can leave
expand = T and modify the expansion for each axis with
scale_*_continuous() and the
Check out Christian Burkhart’s neat cheatsheet that teaches you everything you need to understand expansions.
On an unrelated note, you may wonder why I set
clip = 'off'.
This little secret will be revealed soon.
For now, just know that it allows you to draw geoms outside the regular panel.
Move and format axes
You may have noticed that the x-axis in the finished plot is at the top of the panel rather than at the bottom. While that is unusual, it helps the reader to get straight to the point as the data is in view earlier. This assumes that the eyes of a typical dataviz reader will first look at the top left corner and then zigzag downwards.
ggplot2, moving the axes and setting the break points happens in a scale layer.
It is here where we use the
scales::percent() function to transform the axes labels.
Additionally, changing labels happens in
labs() and the remaining axes and text changes happen in
Notice that we have customized the theme elements via
Basically, each geom type like “line”, “rect”, “text”, etc. has their own
theme() function expects attributes to be changed using these.
If you are unfamiliar with this concept, maybe the corresponding part in my YARDS lecture notes will help you.
Aligning plot elements, e.g. labels, to form clean lines is another major aspect of the SWD look.
Before I read about it, I did not even notice it but once you see it you cannot go back.
Basically, plots feel “more harmonious” if there are clear (not necessarily drawn) lines like with the left and right edge of the stacked bars.
But this concept does not stop with the bars and can be used for the labels too.
Let’s demonstrate that by moving the labels with more of
Once again, the design enforces that important information like what’s on an axis is in the top left corner.
This was done by changing
In this case
hjust = 0 corresponds to left-justified whereas
hjust = 1 corresponds to right-justified.
vjust works similarly.
For more details w.r.t.
vjust, check out this stackoverflow answer that gives you everything that you need in one visual.
For your convenience, here is a slightly changed form of that visual.
But once you start aligning the axes titles, you notice that the 0% and 100% labels fall outside the grid.
We could try to set
theme() but sadly this is not vectorized.
hjust values must be the same.
That’s not bueno.
Therefore, I drew the axes labels manually with
annotate() but make sure that you remove the current labels in
Also, now you know why we had to set
clip = 'off' earlier.
The axes labels are outside of the regular panel.
Add text labels
The same trick can be used to add the category description (accuracy, null, error) to the right top corner and label the highlighted bars.
For the latter part, we simply extract the corresponding rows from our data and use that in conjunction with
Notice that I used a
hjust value greater than 1 here to add some white space on the right side of the labels.
Otherwise, the percent sign will be too close to the bar’s edge.
Next, we add the category descriptions.
This is a bit more tricky, though, because we want to highlight a word too,
So, we will add a
richtext as described in my previous blog post.
Add story text
Now that the bar plot is finished we can work on the story text.
For that, we create another plot that contains only the text.
Later on, we will combine both of our plots with the
There are no really knew techniques here, so let’s get straight to the code.
Add main message as new title and subtitle
As I said before, we will put the two plots together with
If you have never dealt with
patchwork, feel free to check out my short intro to patchwork.
Putting the plots together gives us another opportunity:
We can now set additional titles and subtitles of the whole plot.
Use these to add the main message of your plot.
But make sure that there is enough white space around them by setting the title margins in
Otherwise, your plot will feel “too full”.
Adding spacing is achieved through a
margin() function in
Though, in this case we use
element_markdown() which works exactly the same but enables Markdown syntax like using asterisks for bold texts.
Get the sizes right
In the last plot, I cheated. I gave you the correct code I used to generate the picture. But I did not execute it. Instead, I only displayed the code and then showed you the (imported) picture from the start of this blog post. Why did I do this? Because getting the sizes right sucks!
If you have dealt with ggplot enough, then you will know that text sizes are often set in absolute rather than in relative terms. Therefore, if you make the bar plot smaller in width (like we did), then the bars may be appropriately scaled to the new width but, more often than not, the texts are not. In this case, this led to way too large fonts as beautifully demonstrated in Christophe Nicault’s helpful blog post.
So, how do you avoid this? First off, choose size and fonts last (choose the font first, though). This will save you a lot of repetitive work when you change the alignment in your plot. But this tip will only get you so far, because you have to fix some sizes in between to get a feeling for the visualization you are trying to create.
Therefore, try to get you canvas into an appropriate size first.
I try to do this by using the
camcorder package at the start of my visualization process.
This will ensure that my plots are saved as a png-file with predetermined dimensions and the resulting file is displayed in the Viewer pane in RStudio (as opposed to the Plots pane).
For example, at the start of working on this visualization I have called
This made getting the sizes right for my final output somewhat easier because the canvas size remains the same throughout the process.
Though be sure to call
library(ggtext) or make sure that you call
gg_record() again if you add
ggtext only later.
Otherwise, your plots will revert back to being displayed in the Plots pane (with relative sizing).
Finally, if you want to use
camcorder in conjunction with
showtext, then be sure that
showtext will know what dpi value you chose when calling
Alright, that concludes this somewhat long blog post. I hope that you enjoyed it and learned something valuable. If you did, feel free to leave a comment. Also, you can stay in touch with my work by subscribing to my RSS feed or following me on Twitter.