library(tidyverse)
library(sf)
library(janitor)
library(lubridate)
6 Joining_Data
7 Joining Data with R
7.1 Loading in Packages and Data
In this chapter we will practice the skills we learned in the previous section for cleaning messy data, and then learn a new technique for combining different datasets that contain matching columns, which in R is called “Joining”.
First, let’s load the two datasets we will be using in this chapter. We will start with loading in and cleaning a new dataset containing 2020 grid3 district population estimates normalized to the World Bank population for Zambia (see url in code chunk below). Once we have cleaned this, we will want to join it to “district-cases-long.csv”, which is included in the github data folder for this training.
- Load in the example data
<- read_csv("https://drive.google.com/uc?export=download&id=1ypf9TfZiZsJqZOLYzQ_1EdcGMXkYgFG4")
dist_pop
<- function(x, offset = 0.5){
scram_num set.seed(10)
round(runif(1, x*offset, x*(1+offset)))
}
<- read_csv("data/district-cases-long.csv") %>%
case_data rowwise() %>%
mutate(count = scram_num(count))
Question 1. What issues are there with the new population dataset? (Hint: Try making a table of the variables)
[1] "...1" "provive" "District" "2020 pop"
<NA>
0
At a glance, we can see that the names of the columns are not conducive to our analysis. We can either reload the dataset with the janitor::clean_names() function piped in, or, as this is a dataset with few columns, we can rename the columns manually using rename or using a shortcut in the select() command. In select, we can use the following formula – select(“new name” = “old name”).
Also, we can see from looking at a table of the district variable, that somebody left a “Test” district in, where the population is NA. Let’s remove that with a filter() command.
<- dist_pop %>%
dist_pop_clean select(c("province" = "provive",
"district" = "District",
"pop_2020" = "2020 pop")) %>%
filter(district != "Test")
There may still be some issues that we haven’t fixed yet (it is usually a good idea to assume that there are some hidden issues with your data), so let’s be on the lookout for those as we continue.
We will also want to take time at the start of our analysis to prepare our datasets so that they are on a similar level of analysis. In this case, we will be joining our case_data dataset with a district-level 2020 population file. Our case_data is already at the district level, but it isn’t aggregated to Annual statistics, which would make our lives easier later on. Let’s aggregate the data to an annual level now, and filter it so it only retains 2020 case data.
<- case_data %>% filter(!is.na(district)) %>%
case_data_2020 mutate(year = lubridate::year(period)) %>%
group_by(year, district, province, data_type) %>%
summarise(count = sum(count, na.rm=TRUE)) %>%
filter(year == 2020)
7.2 Joining Datasets Together
When we want to combine data in R, there are many potential techniques we can use. The most common dplyr-based method is to use a command called left_join(). Conceptually, this means that we will be picking one dataset to be our primary dataset (think of holding it in your left hand) and then joining a new dataset to that by one or more matching “key” variables (think of adding something from your right hand into your left hand, fitting it together with a magnet). This left_join ensures that we will only get observations from the secondary or right-side dataset that have a match in our primary or left-side dataset. See the below image for an in-depth examination of left_join() from https://www.guru99.com/r-dplyr-tutorial.html#3
This may not always be what you want to do in the future, you may wish to include all observations from both datasets, or only those that don’t match. For these applications, there are many other types of joins, and I would encourage you to read up on them in this nice tutorial https://www.garrickadenbuie.com/project/tidyexplain/
Because joining can sometimes work less than perfectly, it is VERY IMPORTANT to check your newly joined dataset, to ensure that things went as planned. Before you join, you should have a rough idea of what you expect the new dataset to look like (# of Variables, # of Observations, etc.), and then make sure to confirm those assumptions after the join goes through.
Question 2. What do you expect the new dataset to look like after joining it with the district populations data?
For now, let’s try to do a left_join, using our case_data as the primary (left-side) dataset, and the dist_pop_clean dataset as our secondary (right-side) dataset.
<- case_data_2020 %>%
case_data_pop left_join(dist_pop_clean)
Joining with `by = join_by(district, province)`
As you can see from the message in your R console, if you do not supply a key variable for R to join by, it will automatically try to identify any set of variables that match between your two datasets. This is helpful, but dangerous. R does not always do a perfect job of matching, and it is much safer to specify what columns to join by in your left_join command, using a “by” statement, as shown below.
<- case_data_2020 %>%
case_data_pop left_join(dist_pop_clean, by = c("district", "province"))
Question 3. Now that we have joined our two datasets, what issues can you find with the newly joined dataset?
table(case_data_pop$pop_2020, useNA = "always")
30704.60156 31593.93555 33673.07031 37006.19922 37020.23047 40941.14063
4 2 5 3 5 3
45870.40625 48874.85938 50516.83203 52814.09375 54163.03906 54581.34375
2 3 5 5 2 5
54907.13672 56695.67188 57503.09766 58679.96094 58700.12109 59849.41797
3 3 3 5 5 5
61512.64453 61824.52344 62841.11328 67026.73438 67056.1875 67375.28906
5 3 3 3 3 5
69251.94531 71401.75 77282.84375 77521.48438 77898.125 79467.95313
5 5 5 5 5 3
81023.49219 81508.78125 82079.375 83644.24219 83905.03906 84133.35156
5 5 5 5 5 3
85814.40625 87761.92188 87921.94531 91112.5625 91721.875 92058.78125
5 3 5 4 4 3
92482.1875 96325.78125 98767.44531 101113.6953 103601.3125 103981.0078
3 5 3 3 5 3
106124.9453 106499.375 107106.2422 107900.7969 108828.8672 110095.6328
3 5 3 5 3 4
114048.4219 115984.1563 118126.1328 118133.6484 118237.6563 118544.1641
5 3 3 5 5 3
118883.8047 123900.9219 132333.6406 133419.8281 135459.875 135854.7813
5 3 5 5 3 3
137886.0313 138663.4375 141698.2969 142577.625 146013.9531 148235.6563
3 3 3 5 3 5
150378.2031 150965.9219 152579.0781 156905.9531 158816.875 161032.0313
5 5 3 5 5 5
162097.4375 162234.1406 163914.8125 168056.4531 168097.3594 169051.4688
5 5 5 5 5 3
170751.8438 174675.875 177167.3906 177633.4531 179343.9531 181282.9688
3 5 5 4 3 5
183040.1875 189030.5469 190355.9219 211248.75 224233.1563 226902.625
5 3 3 3 5 5
236634.4219 237247.6094 238477.2188 241443.6406 241907.3906 244031.2344
5 5 3 5 5 3
247608.5313 269249.125 271302.5938 292503 302964.5313 337449.4688
5 3 3 3 5 3
338920 375986.1563 452313.1563 458736.2188 513965.5 2389815.75
5 3 3 3 3 5
<NA>
10
We only added 1 extra variable and no new observations, which is what we expected! However, it seems there are many NAs in this dataset that weren’t in the dist_pops when we loaded it in. This usually happens with joins where two columns that we think are the same (our “keys”) are not actually exactly the same. There may be one value that appears in our right-key column but not in our left-key column, so now NAs have been introduced because there was nothing to match it with in the left-key column. There are lots of ways to find out exactly where we went wrong, but usually with province and district, it is the case that some spelling error was introduced in one of the datasets, or an alternative spelling was used.
Question 4. Where are the errors coming from in the joined dataset? (Hint: There are at least two discrepancies, one each in our province and district fields respectively)
We can fix these issues using the ifelse command that we learned in the “Cleaning Messy Data” problem set from two weeks ago. Let’s go back into the dist_pops dataset, fix those issues, and try the join again.
#First fix the issues in the district populations dataset
<- dist_pop %>%
dist_pop_clean select(c("province" = "provive",
"district" = "District",
"pop_2020" = "2020 pop")) %>%
filter(district != "Test") %>%
mutate(province = ifelse(district == "Siavonga", "Southern", province),
district = ifelse(district == "Kalaba", "Kalabo", district))
#Now Join the two datasets
<- case_data_2020 %>%
case_data_pop left_join(dist_pop_clean, by = c("district", "province"))
Now that we have redone our join, did we fix the problem? Make sure to check your variables to see if things went as expected, or if there are any other issues that we may have missed.
table(case_data_pop$pop_2020, useNA = "always")
30704.60156 31593.93555 33673.07031 37006.19922 37020.23047 40941.14063
4 2 5 3 5 3
45870.40625 48874.85938 50516.83203 52814.09375 54163.03906 54581.34375
2 3 5 5 2 5
54907.13672 56421.33203 56695.67188 57503.09766 58679.96094 58700.12109
3 5 3 3 5 5
59849.41797 61512.64453 61824.52344 62841.11328 67026.73438 67056.1875
5 5 3 3 3 3
67375.28906 69251.94531 71401.75 77282.84375 77521.48438 77898.125
5 5 5 5 5 5
79467.95313 81023.49219 81508.78125 82079.375 83644.24219 83905.03906
3 5 5 5 5 5
84133.35156 85814.40625 87761.92188 87921.94531 91112.5625 91721.875
3 5 3 5 4 4
92058.78125 92482.1875 96325.78125 98767.44531 101113.6953 103293.1406
3 3 5 3 3 5
103601.3125 103981.0078 106124.9453 106499.375 107106.2422 107900.7969
5 3 3 5 3 5
108828.8672 110095.6328 114048.4219 115984.1563 118126.1328 118133.6484
3 4 5 3 3 5
118237.6563 118544.1641 118883.8047 123900.9219 132333.6406 133419.8281
5 3 5 3 5 5
135459.875 135854.7813 137886.0313 138663.4375 141698.2969 142577.625
3 3 3 3 3 5
146013.9531 148235.6563 150378.2031 150965.9219 152579.0781 156905.9531
3 5 5 5 3 5
158816.875 161032.0313 162097.4375 162234.1406 163914.8125 168056.4531
5 5 5 5 5 5
168097.3594 169051.4688 170751.8438 174675.875 177167.3906 177633.4531
5 3 3 5 5 4
179343.9531 181282.9688 183040.1875 189030.5469 190355.9219 211248.75
3 5 5 3 3 3
224233.1563 226902.625 236634.4219 237247.6094 238477.2188 241443.6406
5 5 5 5 3 5
241907.3906 244031.2344 247608.5313 269249.125 271302.5938 292503
5 3 5 3 3 3
302964.5313 337449.4688 338920 375986.1563 452313.1563 458736.2188
5 3 5 3 3 3
513965.5 2389815.75 <NA>
3 5 0
Now that we have joined and cleaned our two datasets, we can start to use them for analysis!
Question 5. What are the 10 highest malaria-burden districts by confirmed cases per 1000 population for 2020 (Hint: Use mutate() to make a new column that = cases divided by population times 1000, and make sure to use only the correct data elements)? How is this list different than if we were only using the raw case numbers? Bonus points to anyone who shows this information using a plot