Hannah Slater, Justin Millar, Will Sheahan, and Hayley Thompson
5Data Visualization in R with ggplot2
5.1 Plotting with ggplot2
The focus of this chapter will be on data visualizations. There are many different approaches for making plots and other visuals in R. In addition to the base plotting functions that are directly built into R, there are hundreds of freely available packages for making just about any kind of chart or plot possible.
The package we will be focusing on here is called ggplot2, and is included in the tidyverse umbrella package.
## For more information on ggplot here are some useful resources:
- ggplot2 Official Documentation: The official documentation provides a comprehensive guide to ggplot2 syntax, usage, and examples. It’s a great place to start for beginners.
- R for data science: This online book by Hadley Wickham and Garrett Grolemund is an excellent resource for learning data science with R. Chapter 3 specifically covers ggplot2, and it’s explained in a beginner-friendly manner.
- R Graphics Cookbook: This cookbook-style resource provides practical ‘recipes’ for creating various types of plots using ggplot2. It’s a useful reference for learners at all levels.
- The R Graph Gallery: A collection of data visualisations made with ggplot - a good source of inspiration with code to recreate each plot.
- Data Visualisation with ggplot2 Cheatsheet: This cheatsheet provides a concise overview of ggplot2 syntax, including various geoms, aesthetics, and other essential functions in a handy PDF. You can also find this by clicking: `help > cheatsheets > Data Visualisation with ggplot2` in the Rstudio taskbar. I’d advise opening this up for this session.
First let’s load the tidyverse package, as well as the scrambled dataset containing monthly district-level health facility records from DHIS2.
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tibble' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Let’s see this with our data. Since this data set is quite large to plot out individual data, let’s just filter rows for just the tested data from health facilities in Eastern Province, we’ll call this eastern_tested_data.
eastern_tested_data <- case_data %>%filter(data_type =="Tested", province =="Eastern")
First, use the ggplot() function and put eastern_tested_data in the data argument.
ggplot(data = eastern_tested_data)
Next we define a aesthetic mapping (using the aes() function). This is where we will define which columns will be assigned to the x- and y-axis, as well as other data-defined characteristics like size, shape, and color.
ggplot(data = eastern_tested_data, mapping =aes(x = period, y = count))
The final piece is to choose how we want the data to be displayed (lines, points, bars, histograms). These are called “geoms”, and there is a separate geom for each type of graphic:
geom_point() for scatter plots, dot plots, etc.
geom_boxplot() for, well, boxplots!
geom_line() for trend lines, time series, etc.
To add a geom to our plot use the + operator at the end of the line (not the start of a new line!).
These are the general components for any graphic made with ggplot. In this example, there’s too much information clutter in the plot to discern anything useful. One option is to add transparency, which can be done using the alpha argument.
The typical workflow for creating a ggplot graphic will begin with the steps we just completed: first organize the data that we want to plot, then get the primary components together in ggplot(), then add customization layers to create a final visual. In the next sections we will start starting adding more layers and aesthetics to go from this basic plot to a publication-quality graphic.
5.3Using the aes() function
In addition to organizing the axes, there other useful features that we can define in the aes() function. Specifically, this is where we will define features that are determined by aspects of the input data. Let’s see an example by introducing color to the last plot.
First, we can change the color of all the points by directly define the color argument in geom_point():
There are instances where defining a set color can be useful, but in this example it would be better if we could use color to represent an aspect of our data. For instance, in this plot we could have a different color for each district. This is possible using the aes, and stating which column we want to use to define color.
Play around with the size and col options to make your graph look good!
It is also important to use meaningful axis labels.
ggplot(data = chadiza_conf, mapping =aes(x = period, y = count)) +geom_line(size =2, col ="grey70") +aes(ymin=0) +geom_point(col="deepskyblue2", size=3) +labs(x ="Year", y ="Confirmed malaria cases",title ="Confirmed malaria cases in Chadiza (2018 - 2021)")
We can easily add a second line to this plot - lets try and add Confirmed_Passive_CHW First we need to make a new data frame that includes this variable (as well as confirmed)
We now need to specify that we want to colour the lines by ‘data_type’ - i.e. we want to use different colours for the ‘confirmed’ line and the ‘Confirmed_Passive_CHW’ lines. We do this by setting colour equal to data_type.
There are loads of ways we can make this look better - lets start with setting our own colours, making proper labels, and making the lines thicker.
ggplot(data = chadiza_conf_chw, mapping =aes(x = period, y = count)) +geom_line(aes(color = data_type), size =1.2) +scale_color_manual(values =c("Confirmed"="dodgerblue3","Confirmed_Passive_CHW"="olivedrab3")) +labs(x ="Year", y ="Confirmed and passive CHW malaria cases",title ="Confirmed and passive CHW malaria cases in Chadiza (2018 - 2021)")
Question 1a: Can you plot health facility tests over time in Katete district?
Question 1b: Can you plot health facility tests and CHW passive tests over time in Katete district?
Question 1c: How can you make your graph look better?!?
5.4.2Bar plots
Another very common data visualization is called barplot. To see this, lets make a new subset of our data for confirmed cases from health facilities in Eastern Province in January 2020. Now might be a good time to load the lubridate library if you haven’t yet.
library(lubridate)eastern_conf <- case_data %>%filter(province =="Eastern", data_type =="Confirmed", period ==ymd("2020-01-01"))
ggplot(data = eastern_conf, aes(x = district, y = count)) +geom_col() +labs(x ="District", y ="Confirmed cases",title ="Confirmed cases in health facilities", subtitle ="Jan. 2020")
In this case, it may actually be easier to read if we flip the axes. Rather than re-writing our code, we can add the coord_flip() function to do this automatically.
ggplot(data = eastern_conf, aes(x = district, y = count)) +geom_col() +labs(x ="District", y ="Confirmed cases",title ="Confirmed cases in health facilities", subtitle ="Jan. 2020") +coord_flip()
By default, the districts will be placed in alphabetical order. Alternatively, we could arrange them by descending or ascending order based on the confirmed case value by using reorder in the aes() function.
ggplot(data = eastern_conf, aes(x =reorder(district, count), y = count)) +geom_col() +labs(x ="District", y ="Confirmed cases",title ="Confirmed cases in health facilities", subtitle ="Jan. 2020") +coord_flip()
Next, we will create a bar plot for visualizing all of the case data, including from community health workers, from Eastern province in Jan. 2020. First, let’s create a new subset of our data.
Now we will create a specialized version of a bar plot called a “stacked” bar plot. In this case, we will stack each of the data types, which will be designated with a different color.
ggplot(data = eastern_cases, aes(x =reorder(district, count), y = count,fill = data_type)) +geom_col() +labs(x ="District", y ="Confirmed cases",title ="Confirmed cases in health facilities", subtitle ="Jan. 2020") +coord_flip()
Finally, we can add some customization to complete the plot.
ggplot(data = eastern_cases, aes(x =reorder(district, count), y = count,fill = data_type)) +geom_col() +labs(x ="District", y ="Confirmed cases",title ="Confirmed cases in health facilities", subtitle ="Jan. 2020") +coord_flip() +scale_fill_manual(values =c("Confirmed"="dodgerblue","Clinical"="tomato","Confirmed_Passive_CHW"="goldenrod2")) +theme_classic()
Question 2A: Create a barplot for CHW cases in each district in Southern Province during November 2019.
Question 2B: Create a stacked barplot for confirmed, clinical, and CHW cases in each district in Southern province during November 2019.
Question 2C: Create a barplot for confirmed cases in Chipata district for each month during 2020.
5.5Exporting plots
To export a plot, we use the ggsave() function. This function allow us to many characteristics of the output file, such as the file type, resolution, and size.
To use ggsave(), we will first save our plot into a new variable, then we can define an output file path. Typically you’ll want to create a new folder for outputs in your project folder, you may have done this earlier in the training process (called either plots or output).
Let’s save the last plot we made in the previous section as PDF.
stacked_bar_plot <-ggplot(data = eastern_cases, aes(x =reorder(district, count), y = count,fill = data_type)) +geom_col() +labs(x ="District", y ="Confirmed cases",title ="Confirmed cases in health facilities", subtitle ="Jan. 2020") +coord_flip() +scale_fill_manual(values =c("Confirmed"="dodgerblue","Clinical"="tomato","Confirmed_Passive_CHW"="goldenrod2")) +theme_classic()ggsave(filename ="plots/eastern-cases-jan-2020.pdf", plot = stacked_bar_plot)
We can also save the exact same plot as a PNG file just by change the extension in the filename.
We can use the data manipulation skills we covered in previous parts of the workshop to create a summarized data set, which can then be visualized with ggplot. Let’s started by creating a summarized data frame that contains the total number of monthly tests from health facilities in each province.
Once we have the summarized data set, we can use the technique above to make powerful summary visualizations. For example, let create a line plot with this summarized data using the skills we learned in the previous section. We can even use a custom color palette using the scale_color_viridis_d() function.
ggplot(data = tests_by_province, mapping =aes(x = period, y = total_tested, color = province)) +geom_line() +geom_point() +labs(y ="Total tests", x ="Date", title ="Total monthly tests in health facilities") +scale_color_viridis_d("Province") +theme_classic()
This example illustrates how in just a few lines of code we can combine the skills we have developed in the previous chapters to create powerful data visualizations.
Question 3A: Create a summarised data set that contains the cummulative monthly cases (confirmed, clinical, and CHW) for each province, then create a line plot that shows the time-series data where each province is a unique color.
5.7Facets
Another tool in ggplot is called faceting, which allows us to split one plot into multiple plots based on a categorical field in the dataset.
The facet_wrap() function is used the facets are based on a single variable and the orientation of facets is done sequentially.
ggplot(data = tests_by_province, mapping =aes(x = period, y = total_tested, color = province)) +facet_wrap(vars(province)) +geom_line() +geom_point() +labs(y ="Total tests", x ="Date", title ="Total monthly tests in health facilities") +scale_color_viridis_d("Province") +theme_classic()
Question 4: Create a summarized data set that contains the annual totals for confirmed, clinical, and CHW cases in each district in Eastern Province from 2018 to 2020. Then create a stack bar plot where the different data types are represented with a unique color, and each district has it’s own facet.
5.8 Final thoughts
Congratulations on completing this session on data visualization with ggplot2! We’ve covered a wide range of techniques for creating effective and informative plots using R. Here’s a summary of what we’ve learned:
The basic structure of a ggplot visualization.
How to customize plot aesthetics, including colors, shapes, and themes.
Different types of geoms for creating various plot types.
How to create multi-panel plots with facets.
Techniques for summarizing and visualizing data.
Remember, effective data visualization is not just about making pretty plots—it’s about effectively communicating insights and findings from your data. Always consider your audience and the message you want to convey when designing your plots.
Continue to explore and practice with ggplot2, as there is always more to learn and discover. The more familiar you become with the package, the more versatile and powerful your data visualization skills will become.