12  Clean versus Messy Data

Objectives: In this lesson, you’ll learn how to rename() columns in a data.frame, change one or more vectors’ class using mutate() , replace missing values using na_if(), and clean up character and categorical variables using str_detect(), chartr(), and str_to_title().

This brief introduction to data cleaning tackles some of the most common problems encountered when importing data that needs additional formatting, i.e., “messy data.” Here, we’ll review a few tools to get variables into the right format, with consistent coding of missing and character data. In other words, we’ll go over functions to generate “clean data” that can be transformed between wide and long formats, used to create visualizations, and evaluated using summary statistics. Later modules will go into greater detail on how to handle missing data and identify outliers.

To start, we’ll generate a toy messy data set that includes monthly malaria case counts and hospitalizations for five health facilities.

library(tidyverse)

malaria_dat <- data.frame(
  health.facility.id = rep(c("Hnd7FjkO9","1Yud8rt9s", "PLd3j2ncr","LJn8gzrq2", "G5KNp94mL"), 
  each = 5),
  fac_name = rep(c("Dembé", "Diawar", "Gueye-Padalal", "Bassine", "Fété  niébé"), each = 5),
  health.facility.type = rep(c("Helath center", "Health center", "Health post", "Health Center","NA"), each = 5),
  date = rep(as.character(
    seq.Date(from = as.Date("2010-01-01"), 
             to = as.Date("2010-05-01"), 
             by = "month")), 5),
  cases = c("59","66","NA","55","54","44","NA","65","70",
            "54","49","57","39","34","45",NA,"74","55",
            "NA","73","50","57","70","44",NA),
  hosp = c("0","0","1","1","0","0","NA","0","0","1","1",
           "NA","1","2","2","3","2","NA","2","1","0","1",
           NA,"0","NA"))

summary(malaria_dat)
 health.facility.id   fac_name         health.facility.type     date          
 Length:25          Length:25          Length:25            Length:25         
 Class :character   Class :character   Class :character     Class :character  
 Mode  :character   Mode  :character   Mode  :character     Mode  :character  
    cases               hosp          
 Length:25          Length:25         
 Class :character   Class :character  
 Mode  :character   Mode  :character  

12.1 Rename columns

First, we notice that some column names are longer than we’d like. We rename them using rename(). For this function, the new name we assign to each variable goes on the left-hand side of the equal sign, while the current name goes on the right.

names(malaria_dat) #Look at the old column names
[1] "health.facility.id"   "fac_name"             "health.facility.type"
[4] "date"                 "cases"                "hosp"                
malaria_dat <- malaria_dat %>%
  rename(fac_id = health.facility.id,
         fac_type = health.facility.type)

names(malaria_dat) #Look at the new column names
[1] "fac_id"   "fac_name" "fac_type" "date"     "cases"    "hosp"    

rename() preserves the order of the columns in your data frame, changing only the variable names.

12.2 Change data classes

From the summary above, we can see that the dates, case data, and hospitalization data are all in character format (because we coded them that way). However, we’d like them to be dates and numeric.

First we change date from a character to a Date using the as.Date() function. Before we convert it, we should have a look at the data to verify they are, in fact, dates in a consistent format.

table(malaria_dat$date, useNA = "always") #Review format of dates

2010-01-01 2010-02-01 2010-03-01 2010-04-01 2010-05-01       <NA> 
         5          5          5          5          5          0 
malaria_dat <- malaria_dat %>%
  mutate(date = as.Date(date)) #Use as.Date() to reassign the class to "Date"

class(malaria_dat$date) #Confirm the class of malaria_dat$date is now "Date"
[1] "Date"

While dates are a seemingly simple data type, they’re actually comprised of three different pieces of data, year, month, and day, each which can be represented in multiple ways. Formatting dates can be tricky in R and may require getting all dates into the same format, specifying the order of each of the three components, or specifying the preferred output format. An alternative to the as.Date() function for dates in YYYY-MM-DD format is ymd(), also from the lubridate package, which loads with tidyverse.

Next, we’ll change cases and hospitalizations to numeric data. Again, before we do, we’ll review the data in each of the two columns, revealing our missing data is inconsistently represented. Some missing values are represented by the character string, “NA,” which we don’t want. We’ll revisit this after we convert cases and hospitalizations to numeric.

#3 values are character strings that read NA, while two are true NAs, shown as <NA>
table(malaria_dat$cases, useNA = "always") 

  34   39   44   45   49   50   54   55   57   59   65   66   70   73   74   NA 
   1    1    2    1    1    1    2    2    2    1    1    1    2    1    1    3 
<NA> 
   2 
table(malaria_dat$hosp, useNA = "always")

   0    1    2    3   NA <NA> 
   8    7    4    1    4    1 

To change our case and hospitalization data to numeric data, we’ll use mutate() with across(), which allows us to change the class of multiple columns at one time. When you use across(), you need to concatenate your column names using c() inside the parentheses.

malaria_dat <- malaria_dat %>%
  #This function takes the specified columns and converts them to the specified class.
  mutate(across(c("cases", "hosp"), as.numeric)) 
Warning: There were 2 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(c("cases", "hosp"), as.numeric)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
#Summary statistics now show for cases and hosp, indicating they're numeric.
summary(malaria_dat) 
    fac_id            fac_name           fac_type              date           
 Length:25          Length:25          Length:25          Min.   :2010-01-01  
 Class :character   Class :character   Class :character   1st Qu.:2010-02-01  
 Mode  :character   Mode  :character   Mode  :character   Median :2010-03-01  
                                                          Mean   :2010-03-02  
                                                          3rd Qu.:2010-04-01  
                                                          Max.   :2010-05-01  
                                                                              
     cases            hosp     
 Min.   :34.00   Min.   :0.00  
 1st Qu.:48.00   1st Qu.:0.00  
 Median :55.00   Median :1.00  
 Mean   :55.70   Mean   :0.90  
 3rd Qu.:65.25   3rd Qu.:1.25  
 Max.   :74.00   Max.   :3.00  
 NA's   :5       NA's   :5     

When we change cases and hospitalizations to numeric, we receive a warning message, “NAs introduced by coercion.” This error arises because some values are character strings that read “NA” instead of true missing values. If you attempt to convert these to numeric, they will be coerced to NA because they have no numeric equivalent. Try typing is.numeric("five") in your console to see this in action. You can imagine if some of your case values were stored as text, you’d need to change them to numbers before reclassifying the entire variable as numeric. Otherwise, you’d lose those data points.

12.3 Replacing missing values

While we coerced most of our “NA” character strings to NA during the previous step, we missed a few in the variable fac_type. We’ll set those values to missing using na_if. You’ll note for the next few exercises, we’ll create a new variable for fac_type at each step. In your own work, you may decide to update the original variable instead of creating a new one. However, creating a new variable while you’re taking a first pass at your code will make it easier to see that everything is working as expected.

malaria_dat <- malaria_dat %>%
  #' This line of code will replace all instances of "NA" with NA (missing) for 
  #' the variable fac_type. Note here "NA" is in quotes indicating that it's a 
  #' character, as opposed to NA (missing data)
  mutate(fac_type2 = na_if(fac_type, "NA")) 

#' Compare the missing values in fac_type2 to those in the original. 
#' dnn will let you label your table so it's easier to interpret.
table(malaria_dat$fac_type, malaria_dat$fac_type2, useNA = "always", 
      dnn = c("fac_type", "fac_type2")) 
               fac_type2
fac_type        Health center Health Center Health post Helath center <NA>
  Health center             5             0           0             0    0
  Health Center             0             5           0             0    0
  Health post               0             0           5             0    0
  Helath center             0             0           0             5    0
  NA                        0             0           0             0    5
  <NA>                      0             0           0             0    0

12.4 Correct string inconsistencies

Sometimes, we’d like to treat a character variable, like facility type, as categorical. However, the text may need to be cleaned first so that we can consolidate our values into appropriate categories, like health centers and health posts. We’ll use str_detect() from the stringr package, which is loaded with tidyverse, to identify health centers with different spellings and change them to a single character string, “Health center.”

#' Review the different spellings of "Health center" to identify those that need 
#' to be corrected.
table(malaria_dat$fac_type2, useNA = "always") 

Health center Health Center   Health post Helath center          <NA> 
            5             5             5             5             5 
malaria_dat <- malaria_dat %>%
  #' Use str_detect to identify fac_type2 values that are health centers. 
  #' Use if_else to reset them to "Health center." 
  #' Otherwise, keep fac_type2 as is. 
  #' Note: the quotation marks in str_detect are NOT present next to the pipe.
  mutate(fac_type3 = if_else(str_detect(fac_type2, c("Center|center")), "Health center", fac_type2)) 

table(malaria_dat$fac_type2, malaria_dat$fac_type3, useNA = "always", 
      dnn = c("fac_type2", "fac_type3"))
               fac_type3
fac_type2       Health center Health post <NA>
  Health center             5           0    0
  Health Center             5           0    0
  Health post               0           5    0
  Helath center             5           0    0
  <NA>                      0           0    5

12.5 Remove special characters

You may want to remove special characters from strings while working with your data. Here, we’ll remove the accent marks from the facility names using a base R function, chartr().

# Review the spelling of the facility names.
table(malaria_dat$fac_name, useNA = "always") 

      Bassine         Dembé        Diawar   Fété  niébé Gueye-Padalal 
            5             5             5             5             5 
         <NA> 
            0 
#' Specify the characters you'd like to replace, the characters you'd like to 
#' replace them with, and the targeted column. Here we've included multiple 
#' characters, most of which aren't in the strings but theoretically might have been.
malaria_dat <- malaria_dat %>%
  mutate(new_fac_name = chartr("áéèàôî", "aeeaoi", fac_name)) 

table(malaria_dat$new_fac_name)

      Bassine         Dembe        Diawar   Fete  niebe Gueye-Padalal 
            5             5             5             5             5 

12.6 Change the case

Now, we’ll use str_to_title(), also from the stringr package, to change the facility names to title case. stringr has several useful functions for processing character data that are worth becoming familiar with.

# Review the case of the facility names.
table(malaria_dat$new_fac_name, useNA = "always") 

      Bassine         Dembe        Diawar   Fete  niebe Gueye-Padalal 
            5             5             5             5             5 
         <NA> 
            0 
# Update new_fac_name case to title.
malaria_dat <- malaria_dat %>%
  mutate(new_fac_name = str_to_title(new_fac_name)) 

# Review the change.
table(malaria_dat$new_fac_name, useNA = "always") 

      Bassine         Dembe        Diawar   Fete  Niebe Gueye-Padalal 
            5             5             5             5             5 
         <NA> 
            0 

12.7 Practice

Exercise 1. Rename fac_name in malaria_dat to name.

Exercise 2. As discussed in the final section, you may want certain character class variables to be categorical, or factor variables. Change fac_type3 to a factor variable. Now, for practice, change the final three columns to be variables of the character class using mutate() with across().

Exercise 3. In this demo, we started with a data set that had instances of “NA” written as text in several columns. Ideally, we would like to change these to missing without having to use the na_if() on each column. Rerun the first chunk in the demo so that you’ve restored the messy data set. Now, try using mutate() and across() to change “NA” text to missing values for all columns.

Exercise 4. Change fac_type to title case. Next, change it to sentence case. Hint: Type stringr::str_to to see to different case functions in the stringr package.

Exercise 5. Write one line of code to determine if your health facilities are either “Health centers” or “Health posts.”

Exercise 6. Use str_detect() to replace the double spaces in the fac_name column with single spaces.

Exercise 7. When we cleaned fac_name, we only replaced accented letters. Use chartr() to change accented letters to letters without accents in addition to changing the hyphens to a single space.