Alternative ways to visualize correlations

Visualization

We explore alternative correlation matrix plots.

Author

Albert Rapp

Published

August 1, 2022

I recently saw a nice thread on Twitter and I wanted to chime in. The author of said thread suggests to use bar charts instead of colored matrices to visualize correlations. Let’s try that with {ggplot2}. Additionally, I will suggest a few ideas of my own.

Using geom_tile() to visualize correlation matrices

Let’s start with the basic plot, i.e. we

pick a data set,
compute correlations between variables
visualize correlations with geom_tile().

Our data set will be the Ames housing data set from {modeldata}. Since this data set contains many variables, we’ll just pick a few from them. Otherwise, we’ll have to work hard to visualize the correlations of MANY variables. Technically, that’s possible with large images but it won’t give us any more insights into the visualization process.

Also, for a real visual, we should probably relabel the variable names into something human-readable. But for simplicity, I skip that step in this demo.

library(tidyverse)

ames_numeric <- modeldata::ames %>% 
  janitor::clean_names() %>% 
  select(where(is.numeric)) %>% 
  select(1:10)
ames_numeric %>% print(n = 5)

# A tibble: 2,930 × 10
  lot_frontage lot_area year_built year_remod_add mas_vnr_area bsmt_fin_sf_1
         <dbl>    <int>      <int>          <int>        <dbl>         <dbl>
1          141    31770       1960           1960          112             2
2           80    11622       1961           1961            0             6
3           81    14267       1958           1958          108             1
4           93    11160       1968           1968            0             1
5           74    13830       1997           1998            0             3
# ℹ 2,925 more rows
# ℹ 4 more variables: bsmt_fin_sf_2 <dbl>, bsmt_unf_sf <dbl>,
#   total_bsmt_sf <dbl>, first_flr_sf <int>

Next, we’re going to compute the correlation matrix with cor(). Then, we make the resulting matrix into a tibble (keep the row names) and then pivot the tibble to rearrange the data.

correlations <- cor(ames_numeric) %>% 
  as_tibble(rownames = 'variable1') %>% 
  pivot_longer(cols = -1, names_to = 'variable2', values_to = 'correlation')
ames_numeric %>% print(n = 5)

# A tibble: 2,930 × 10
  lot_frontage lot_area year_built year_remod_add mas_vnr_area bsmt_fin_sf_1
         <dbl>    <int>      <int>          <int>        <dbl>         <dbl>
1          141    31770       1960           1960          112             2
2           80    11622       1961           1961            0             6
3           81    14267       1958           1958          108             1
4           93    11160       1968           1968            0             1
5           74    13830       1997           1998            0             3
# ℹ 2,925 more rows
# ℹ 4 more variables: bsmt_fin_sf_2 <dbl>, bsmt_unf_sf <dbl>,
#   total_bsmt_sf <dbl>, first_flr_sf <int>

Now, this is a tidy format. It’s majestic and {ggplot2} will love the format. Now, visualizing the correlation matrix is only a matter of using geom_tile(). Unfortunately, we will have to tilt the labels of the x-axis. I usually dislike this move but with this visual I don’t think there much we can do about it.

correlations %>% 
  ggplot(aes(variable1, variable2)) +
  geom_tile(aes(fill = correlation)) +
  geom_text(aes(label = correlation |> round(2))) +
  labs(x = element_blank(), y = element_blank()) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_gradient2()

Use geom_col() to visualize correlation matrices

Next, let us try visualizing the correlations with bars instead of colored tiles. For that, we need to create a new variable that describes the pairs of all variables.

Notice that we do not want to use the same pair twice. This can happen if we use “both” pairs A + B and B + A. To avoid that, I have set all entries of the correlation matrix that are not in the lower triangle of the matrix to zero. Here’s how a sub-matrix looks so that you know how our matrix looks in principle.

triangle_correlations <- cor(ames_numeric) * lower.tri(cor(ames_numeric))
triangle_correlations[1:4, 1:4]

               lot_frontage   lot_area year_built year_remod_add
lot_frontage     0.00000000 0.00000000  0.0000000              0
lot_area         0.13686214 0.00000000  0.0000000              0
year_built       0.02613050 0.02325850  0.0000000              0
year_remod_add   0.06950923 0.02168222  0.6120953              0

Now we can

transform the matrix to a tibble like before
filter out zero correlations (to avoid duplicate pairs)
construct pair labels
order pairs by their correlation

correlations <- triangle_correlations %>% 
  as_tibble(rownames = 'variable1') %>% 
  pivot_longer(cols = -1, names_to = 'variable2', values_to = 'correlation') %>% 
  filter(abs(correlation) > 0) %>% 
  mutate(
    pair = paste(variable1, variable2, sep = ' + '),
    pair = fct_reorder(pair, correlation)
  )
correlations %>% print(n = 5)

# A tibble: 45 × 4
  variable1      variable2    correlation pair                         
  <chr>          <chr>              <dbl> <fct>                        
1 lot_area       lot_frontage      0.137  lot_area + lot_frontage      
2 year_built     lot_frontage      0.0261 year_built + lot_frontage    
3 year_built     lot_area          0.0233 year_built + lot_area        
4 year_remod_add lot_frontage      0.0695 year_remod_add + lot_frontage
5 year_remod_add lot_area          0.0217 year_remod_add + lot_area    
# ℹ 40 more rows

This is easy to visualize with geom_col(). Notice that I do not use different colors for the bars. That’s because I think that it’s unnecessary in this visual. In the matrix plot, the color gradient served a purpose. Here, this purpose is negligible.

my_col <- viridisLite::mako(3)[2]
correlations %>% 
  ggplot(aes(x = correlation, y = pair)) + 
  geom_col(fill = my_col) +
  labs(x = 'Correlation')

Move labels to bars

I think there is some room for improvement in the previous plot. For starters, I am not so happy that the reader always has to follow a line from label to bar. That’s why I would move the labels next to the bars. Then, we can also get rid of many grid lines. While we’re at it, let’s make the remaining grid lines lighter.

nudge_x <- 0.01
text_size <- 2.5
grid_color <-  'grey80'
text_color <- 'grey40'

correlations  %>% 
  ggplot(aes(y = pair, x = correlation)) +
  geom_col(fill = my_col) +
  geom_text(
    aes(x = 0, label = pair),
    size = text_size,
    # Change horizontal justification based on correlation value
    # This moves labels to left or right of the bars
    hjust = if_else(correlations$correlation > 0, 1, 0),
    # Same trick just for moving the labels a tiny bit
    nudge_x = if_else(correlations$correlation > 0, -nudge_x, nudge_x)
  ) +
  scale_y_discrete(breaks = NULL) +
  labs(x = 'Correlation', y = element_blank()) +
  theme_minimal() +
  theme(
    panel.grid = element_line(size = 0.25, linetype = 2, color = grid_color)
  )

Use lollipops to visualize correlation matrices

Bar charts use quite a lot of ink. And I think in this case we can do with a little bit less ink. Instead of geom_col(), let us use geom_seqment() and geom_point(). That’s how you build a lollipop chart!

segment_size <- 0.75

correlations  %>% 
  ggplot(aes(y = pair, x = correlation)) +
  geom_point(col = my_col) +
  geom_segment(aes(xend = 0, yend = pair), col = my_col, size = segment_size) +
  geom_text(
    aes(x = 0, label = pair),
    size = text_size,
    hjust = if_else(correlations$correlation > 0, 1, 0),
    nudge_x = if_else(correlations$correlation > 0, -nudge_x, nudge_x)
  ) +
  scale_y_discrete(breaks = NULL) +
  labs(x = 'Correlation', y = element_blank()) +
  theme_minimal() +
  theme(
    panel.grid = element_line(size = 0.25, linetype = 2, color = grid_color)
  )

Highlight specific variable pairs

Finally, let me give one more reason why I didn’t use a color gradient so far. That’s because I can now highlight selected variable pairs.

I think the text labels clutter the image quite a lot. If we need to show all the variables this cannot be avoided. But what if we only care about certain variables or certain relationships? Then, we can highlight these and grey out everything else.

For demo purposes, I have randomly sampled a few variable pairs. Let’s highlight these.

highlight <- correlations %>% 
  slice_sample(n = 5) %>% 
  pull(pair)

highlight_color <- thematic::okabe_ito(6)[1]
unhighlight_color <- 'grey40'

correlations %>% 
  ggplot(aes(y = pair, x = correlation)) +
  geom_point(
    col = if_else(correlations$pair %in% highlight, highlight_color, my_col)
  ) +
  geom_segment(
    aes(xend = 0, yend = pair), 
    col = if_else(correlations$pair %in% highlight, highlight_color, my_col),
    size = if_else(correlations$pair %in% highlight, segment_size + 0.3, segment_size)
  ) +
  geom_text(
    aes(x = 0, label = pair),
    col = if_else(correlations$pair %in% highlight, highlight_color, unhighlight_color),
    fontface = if_else(correlations$pair %in% highlight, 'bold', 'plain'),
    hjust = if_else(correlations$correlation > 0, 1, 0),
    nudge_x = if_else(correlations$correlation > 0, -nudge_x, nudge_x),
    size = text_size
  ) +
  scale_y_discrete(breaks = NULL) +
  labs(x = 'Correlation', y = element_blank()) +
  theme_minimal() +
  theme(
    panel.grid = element_line(size = 0.25, linetype = 2, color = grid_color)
  )

Conclusion

I like to think that the last plot is a neat way to visualize correlations. What do you think? Feel free to let me know in the comments.

Also, if you have any questions, let me know via mail or in the comments. And don’t forget to stay in touch via my Newsletter, Twitter or my RSS feed. See you next time!

Alternative ways to visualize correlations

Using geom_tile() to visualize correlation matrices

Use geom_col() to visualize correlation matrices

Move labels to bars

Use lollipops to visualize correlation matrices

Highlight specific variable pairs

Conclusion

Enjoyed this blog post?

3 Minutes Wednesdays

Data Cleaning With R Master Class

Insightful Data Visualizations for "Uncreative" R Users