Objectives: In this lesson, you’ll learn how to rename() columns in a data.frame, change one or more vectors’ class using mutate() , replace missing values using na_if(), and clean up character and categorical variables using str_detect(), chartr(), and str_to_title().
This brief introduction to data cleaning tackles some of the most common problems encountered when importing data that needs additional formatting, i.e., “messy data.” Here, we’ll review a few tools to get variables into the right format, with consistent coding of missing and character data. In other words, we’ll go over functions to generate “clean data” that can be transformed between wide and long formats, used to create visualizations, and evaluated using summary statistics. Later modules will go into greater detail on how to handle missing data and identify outliers.
To start, we’ll generate a toy messy data set that includes monthly malaria case counts and hospitalizations for five health facilities.
library(tidyverse)malaria_dat <-data.frame(health.facility.id =rep(c("Hnd7FjkO9","1Yud8rt9s", "PLd3j2ncr","LJn8gzrq2", "G5KNp94mL"), each =5),fac_name =rep(c("Dembé", "Diawar", "Gueye-Padalal", "Bassine", "Fété niébé"), each =5),health.facility.type =rep(c("Helath center", "Health center", "Health post", "Health Center","NA"), each =5),date =rep(as.character(seq.Date(from =as.Date("2010-01-01"), to =as.Date("2010-05-01"), by ="month")), 5),cases =c("59","66","NA","55","54","44","NA","65","70","54","49","57","39","34","45",NA,"74","55","NA","73","50","57","70","44",NA),hosp =c("0","0","1","1","0","0","NA","0","0","1","1","NA","1","2","2","3","2","NA","2","1","0","1",NA,"0","NA"))summary(malaria_dat)
health.facility.id fac_name health.facility.type date
Length:25 Length:25 Length:25 Length:25
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
cases hosp
Length:25 Length:25
Class :character Class :character
Mode :character Mode :character
12.1 Rename columns
First, we notice that some column names are longer than we’d like. We rename them using rename(). For this function, the new name we assign to each variable goes on the left-hand side of the equal sign, while the current name goes on the right.
rename() preserves the order of the columns in your data frame, changing only the variable names.
12.2 Change data classes
From the summary above, we can see that the dates, case data, and hospitalization data are all in character format (because we coded them that way). However, we’d like them to be dates and numeric.
First we change date from a character to a Date using the as.Date() function. Before we convert it, we should have a look at the data to verify they are, in fact, dates in a consistent format.
table(malaria_dat$date, useNA ="always") #Review format of dates
malaria_dat <- malaria_dat %>%mutate(date =as.Date(date)) #Use as.Date() to reassign the class to "Date"class(malaria_dat$date) #Confirm the class of malaria_dat$date is now "Date"
[1] "Date"
While dates are a seemingly simple data type, they’re actually comprised of three different pieces of data, year, month, and day, each which can be represented in multiple ways. Formatting dates can be tricky in R and may require getting all dates into the same format, specifying the order of each of the three components, or specifying the preferred output format. An alternative to the as.Date() function for dates in YYYY-MM-DD format is ymd(), also from the lubridate package, which loads with tidyverse.
Next, we’ll change cases and hospitalizations to numeric data. Again, before we do, we’ll review the data in each of the two columns, revealing our missing data is inconsistently represented. Some missing values are represented by the character string, “NA,” which we don’t want. We’ll revisit this after we convert cases and hospitalizations to numeric.
#3 values are character strings that read NA, while two are true NAs, shown as <NA>table(malaria_dat$cases, useNA ="always")
To change our case and hospitalization data to numeric data, we’ll use mutate() with across(), which allows us to change the class of multiple columns at one time. When you use across(), you need to concatenate your column names using c() inside the parentheses.
malaria_dat <- malaria_dat %>%#This function takes the specified columns and converts them to the specified class.mutate(across(c("cases", "hosp"), as.numeric))
Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(c("cases", "hosp"), as.numeric)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
#Summary statistics now show for cases and hosp, indicating they're numeric.summary(malaria_dat)
fac_id fac_name fac_type date
Length:25 Length:25 Length:25 Min. :2010-01-01
Class :character Class :character Class :character 1st Qu.:2010-02-01
Mode :character Mode :character Mode :character Median :2010-03-01
Mean :2010-03-02
3rd Qu.:2010-04-01
Max. :2010-05-01
cases hosp
Min. :34.00 Min. :0.00
1st Qu.:48.00 1st Qu.:0.00
Median :55.00 Median :1.00
Mean :55.70 Mean :0.90
3rd Qu.:65.25 3rd Qu.:1.25
Max. :74.00 Max. :3.00
NA's :5 NA's :5
When we change cases and hospitalizations to numeric, we receive a warning message, “NAs introduced by coercion.” This error arises because some values are character strings that read “NA” instead of true missing values. If you attempt to convert these to numeric, they will be coerced to NA because they have no numeric equivalent. Try typing is.numeric("five") in your console to see this in action. You can imagine if some of your case values were stored as text, you’d need to change them to numbers before reclassifying the entire variable as numeric. Otherwise, you’d lose those data points.
12.3 Replacing missing values
While we coerced most of our “NA” character strings to NA during the previous step, we missed a few in the variable fac_type. We’ll set those values to missing using na_if. You’ll note for the next few exercises, we’ll create a new variable for fac_type at each step. In your own work, you may decide to update the original variable instead of creating a new one. However, creating a new variable while you’re taking a first pass at your code will make it easier to see that everything is working as expected.
malaria_dat <- malaria_dat %>%#' This line of code will replace all instances of "NA" with NA (missing) for #' the variable fac_type. Note here "NA" is in quotes indicating that it's a #' character, as opposed to NA (missing data)mutate(fac_type2 =na_if(fac_type, "NA")) #' Compare the missing values in fac_type2 to those in the original. #' dnn will let you label your table so it's easier to interpret.table(malaria_dat$fac_type, malaria_dat$fac_type2, useNA ="always", dnn =c("fac_type", "fac_type2"))
fac_type2
fac_type Health center Health Center Health post Helath center <NA>
Health center 5 0 0 0 0
Health Center 0 5 0 0 0
Health post 0 0 5 0 0
Helath center 0 0 0 5 0
NA 0 0 0 0 5
<NA> 0 0 0 0 0
12.4 Correct string inconsistencies
Sometimes, we’d like to treat a character variable, like facility type, as categorical. However, the text may need to be cleaned first so that we can consolidate our values into appropriate categories, like health centers and health posts. We’ll use str_detect() from the stringr package, which is loaded with tidyverse, to identify health centers with different spellings and change them to a single character string, “Health center.”
#' Review the different spellings of "Health center" to identify those that need #' to be corrected.table(malaria_dat$fac_type2, useNA ="always")
Health center Health Center Health post Helath center <NA>
5 5 5 5 5
malaria_dat <- malaria_dat %>%#' Use str_detect to identify fac_type2 values that are health centers. #' Use if_else to reset them to "Health center." #' Otherwise, keep fac_type2 as is. #' Note: the quotation marks in str_detect are NOT present next to the pipe.mutate(fac_type3 =if_else(str_detect(fac_type2, c("Center|center")), "Health center", fac_type2)) table(malaria_dat$fac_type2, malaria_dat$fac_type3, useNA ="always", dnn =c("fac_type2", "fac_type3"))
fac_type3
fac_type2 Health center Health post <NA>
Health center 5 0 0
Health Center 5 0 0
Health post 0 5 0
Helath center 5 0 0
<NA> 0 0 5
12.5 Remove special characters
You may want to remove special characters from strings while working with your data. Here, we’ll remove the accent marks from the facility names using a base R function, chartr().
# Review the spelling of the facility names.table(malaria_dat$fac_name, useNA ="always")
#' Specify the characters you'd like to replace, the characters you'd like to #' replace them with, and the targeted column. Here we've included multiple #' characters, most of which aren't in the strings but theoretically might have been.malaria_dat <- malaria_dat %>%mutate(new_fac_name =chartr("áéèàôî", "aeeaoi", fac_name)) table(malaria_dat$new_fac_name)
Now, we’ll use str_to_title(), also from the stringr package, to change the facility names to title case. stringr has several useful functions for processing character data that are worth becoming familiar with.
# Review the case of the facility names.table(malaria_dat$new_fac_name, useNA ="always")
Exercise 1. Rename fac_name in malaria_dat to name.
Exercise 2. As discussed in the final section, you may want certain character class variables to be categorical, or factor variables. Change fac_type3 to a factor variable. Now, for practice, change the final three columns to be variables of the character class using mutate() with across().
Exercise 3. In this demo, we started with a data set that had instances of “NA” written as text in several columns. Ideally, we would like to change these to missing without having to use the na_if() on each column. Rerun the first chunk in the demo so that you’ve restored the messy data set. Now, try using mutate() and across() to change “NA” text to missing values for all columns.
Exercise 4. Change fac_type to title case. Next, change it to sentence case. Hint: Type stringr::str_to to see to different case functions in the stringr package.
Exercise 5. Write one line of code to determine if your health facilities are either “Health centers” or “Health posts.”
Exercise 6. Use str_detect() to replace the double spaces in the fac_name column with single spaces.
Exercise 7. When we cleaned fac_name, we only replaced accented letters. Use chartr() to change accented letters to letters without accents in addition to changing the hyphens to a single space.