# Alternative ways to visualize correlations

I recently saw a nice thread on Twitter and I wanted to chime in. The author of said thread suggests to use bar charts instead of colored matrices to visualize correlations. Let’s try that with `{ggplot2}`

. Additionally, I will suggest a few ideas of my own.

## Using geom_tile() to visualize correlation matrices

Let’s start with the basic plot, i.e. we

- pick a data set,
- compute correlations between variables
- visualize correlations with
`geom_tile()`

.

Our data set will be the Ames housing data set from `{modeldata}`

. Since this data set contains many variables, we’ll just pick a few from them. Otherwise, we’ll have to work hard to visualize the correlations of MANY variables. Technically, that’s possible with large images but it won’t give us any more insights into the visualization process.

Also, for a real visual, we should probably relabel the variable names into something human-readable. But for simplicity, I skip that step in this demo.

```
library(tidyverse)
data(ames, package = 'modeldata')
<- ames %>%
ames_numeric ::clean_names() %>%
janitorselect(where(is.numeric)) %>%
select(1:10)
%>% print(n = 5) ames_numeric
```

```
# A tibble: 2,930 × 10
lot_frontage lot_area year_b…¹ year_…² mas_v…³ bsmt_…⁴ bsmt_…⁵ bsmt_…⁶ total…⁷
<dbl> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 141 31770 1960 1960 112 2 0 441 1080
2 80 11622 1961 1961 0 6 144 270 882
3 81 14267 1958 1958 108 1 0 406 1329
4 93 11160 1968 1968 0 1 0 1045 2110
5 74 13830 1997 1998 0 3 0 137 928
# … with 2,925 more rows, 1 more variable: first_flr_sf <int>, and abbreviated
# variable names ¹year_built, ²year_remod_add, ³mas_vnr_area, ⁴bsmt_fin_sf_1,
# ⁵bsmt_fin_sf_2, ⁶bsmt_unf_sf, ⁷total_bsmt_sf
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
```

Next, we’re going to compute the correlation matrix with `cor()`

. Then, we make the resulting matrix into a tibble (keep the row names) and then pivot the tibble to rearrange the data.

```
<- cor(ames_numeric) %>%
correlations as_tibble(rownames = 'variable1') %>%
pivot_longer(cols = -1, names_to = 'variable2', values_to = 'correlation')
%>% print(n = 5) ames_numeric
```

```
# A tibble: 2,930 × 10
lot_frontage lot_area year_b…¹ year_…² mas_v…³ bsmt_…⁴ bsmt_…⁵ bsmt_…⁶ total…⁷
<dbl> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 141 31770 1960 1960 112 2 0 441 1080
2 80 11622 1961 1961 0 6 144 270 882
3 81 14267 1958 1958 108 1 0 406 1329
4 93 11160 1968 1968 0 1 0 1045 2110
5 74 13830 1997 1998 0 3 0 137 928
# … with 2,925 more rows, 1 more variable: first_flr_sf <int>, and abbreviated
# variable names ¹year_built, ²year_remod_add, ³mas_vnr_area, ⁴bsmt_fin_sf_1,
# ⁵bsmt_fin_sf_2, ⁶bsmt_unf_sf, ⁷total_bsmt_sf
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
```

Now, this is a tidy format. It’s majestic and `{ggplot2}`

will love the format. Now, visualizing the correlation matrix is only a matter of using `geom_tile()`

. Unfortunately, we will have to tilt the labels of the x-axis. I usually dislike this move but with this visual I don’t think there much we can do about it.

```
%>%
correlations ggplot(aes(variable1, variable2, fill = correlation)) +
geom_tile(col = 'white') +
labs(x = element_blank(), y = element_blank()) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

## Use geom_col() to visualize correlation matrices

Next, let us try visualizing the correlations with bars instead of colored tiles. For that, we need to create a new variable that describes the pairs of all variables.

Notice that we do not want to use the same pair twice. This can happen if we use “both” pairs `A + B`

and `B + A`

. To avoid that, I have set all entries of the correlation matrix that are not in the lower triangle of the matrix to zero. Here’s how a sub-matrix looks so that you know how our matrix looks in principle.

```
<- cor(ames_numeric) * lower.tri(cor(ames_numeric))
triangle_correlations 1:4, 1:4] triangle_correlations[
```

```
lot_frontage lot_area year_built year_remod_add
lot_frontage 0.00000000 0.00000000 0.0000000 0
lot_area 0.13686214 0.00000000 0.0000000 0
year_built 0.02613050 0.02325850 0.0000000 0
year_remod_add 0.06950923 0.02168222 0.6120953 0
```

Now we can

- transform the matrix to a tibble like before
- filter out zero correlations (to avoid duplicate pairs)
- construct pair labels
- order pairs by their correlation

```
<- triangle_correlations %>%
correlations as_tibble(rownames = 'variable1') %>%
pivot_longer(cols = -1, names_to = 'variable2', values_to = 'correlation') %>%
filter(abs(correlation) > 0) %>%
mutate(
pair = paste(variable1, variable2, sep = ' + '),
pair = fct_reorder(pair, correlation)
)%>% print(n = 5) correlations
```

```
# A tibble: 45 × 4
variable1 variable2 correlation pair
<chr> <chr> <dbl> <fct>
1 lot_area lot_frontage 0.137 lot_area + lot_frontage
2 year_built lot_frontage 0.0261 year_built + lot_frontage
3 year_built lot_area 0.0233 year_built + lot_area
4 year_remod_add lot_frontage 0.0695 year_remod_add + lot_frontage
5 year_remod_add lot_area 0.0217 year_remod_add + lot_area
# … with 40 more rows
# ℹ Use `print(n = ...)` to see more rows
```

This is easy to visualize with `geom_col()`

. Notice that I do not use different colors for the bars. That’s because I think that it’s unnecessary in this visual. In the matrix plot, the color gradient served a purpose. Here, this purpose is negligible.

```
<- viridisLite::mako(3)[2]
my_col %>%
correlations ggplot(aes(x = correlation, y = pair)) +
geom_col(fill = my_col) +
labs(x = 'Correlation')
```

### Move labels to bars

I think there is some room for improvement in the previous plot. For starters, I am not so happy that the reader always has to follow a line from label to bar. That’s why I would move the labels next to the bars. Then, we can also get rid of many grid lines. While we’re at it, let’s make the remaining grid lines lighter.

```
<- 0.01
nudge_x <- 2.5
text_size <- 'grey80'
grid_color <- 'grey40'
text_color
%>%
correlations ggplot(aes(y = pair, x = correlation)) +
geom_col(fill = my_col) +
geom_text(
aes(x = 0, label = pair),
size = text_size,
# Change horizontal justification based on correlation value
# This moves labels to left or right of the bars
hjust = if_else(correlations$correlation > 0, 1, 0),
# Same trick just for moving the labels a tiny bit
nudge_x = if_else(correlations$correlation > 0, -nudge_x, nudge_x)
+
) scale_y_discrete(breaks = NULL) +
labs(x = 'Correlation', y = element_blank()) +
theme_minimal() +
theme(
panel.grid = element_line(size = 0.25, linetype = 2, color = grid_color)
)
```

## Use lollipops to visualize correlation matrices

Bar charts use quite a lot of ink. And I think in this case we can do with a little bit less ink. Instead of `geom_col()`

, let us use `geom_seqment()`

and `geom_point()`

. That’s how you build a lollipop chart!

```
<- 0.75
segment_size
%>%
correlations ggplot(aes(y = pair, x = correlation)) +
geom_point(col = my_col) +
geom_segment(aes(xend = 0, yend = pair), col = my_col, size = segment_size) +
geom_text(
aes(x = 0, label = pair),
size = text_size,
hjust = if_else(correlations$correlation > 0, 1, 0),
nudge_x = if_else(correlations$correlation > 0, -nudge_x, nudge_x)
+
) scale_y_discrete(breaks = NULL) +
labs(x = 'Correlation', y = element_blank()) +
theme_minimal() +
theme(
panel.grid = element_line(size = 0.25, linetype = 2, color = grid_color)
)
```

### Highlight specific variable pairs

Finally, let me give one more reason why I didn’t use a color gradient so far. That’s because I can now highlight **selected** variable pairs.

I think the text labels clutter the image quite a lot. If we need to show all the variables this cannot be avoided. But what if we only care about certain variables or certain relationships? Then, we can highlight these and grey out everything else.

For demo purposes, I have randomly sampled a few variable pairs. Let’s highlight these.

```
<- correlations %>%
highlight slice_sample(n = 5) %>%
pull(pair)
<- thematic::okabe_ito(6)[1]
highlight_color <- 'grey40'
unhighlight_color
%>%
correlations ggplot(aes(y = pair, x = correlation)) +
geom_point(
col = if_else(correlations$pair %in% highlight, highlight_color, my_col)
+
) geom_segment(
aes(xend = 0, yend = pair),
col = if_else(correlations$pair %in% highlight, highlight_color, my_col),
size = if_else(correlations$pair %in% highlight, segment_size + 0.3, segment_size)
+
) geom_text(
aes(x = 0, label = pair),
col = if_else(correlations$pair %in% highlight, highlight_color, unhighlight_color),
fontface = if_else(correlations$pair %in% highlight, 'bold', 'plain'),
hjust = if_else(correlations$correlation > 0, 1, 0),
nudge_x = if_else(correlations$correlation > 0, -nudge_x, nudge_x),
size = text_size
+
) scale_y_discrete(breaks = NULL) +
labs(x = 'Correlation', y = element_blank()) +
theme_minimal() +
theme(
panel.grid = element_line(size = 0.25, linetype = 2, color = grid_color)
)
```

## Conclusion

I like to think that the last plot is a neat way to visualize correlations. What do you think? Feel free to let me know in the comments.

Also, if you have any questions, let me know via mail or in the comments. And don’t forget to stay in touch via my Newsletter, Twitter or my RSS feed. See you next time!