5  Spatial data

Types of spatial data

5.1 Introduction

Welcome to the first lesson in our series on analyzing spatial data in R, with a focus on malaria in sub-Saharan Africa. Spatial data, information tied to a specific location, is incredibly valuable for understanding disease patterns, targeting interventions, and evaluating public health programs. Whether you’re mapping malaria incidence, analyzing access to insecticide-treated nets (ITNs), or optimizing resource allocation, understanding spatial data is crucial.

This lesson will cover:

  • Types of Spatial Data: Raster and Vector
  • Geographic Projections: Why they matter and how to handle them.
  • Essential R Packages: sf, terra, and ggplot2. We’ll also introduce malariaAtlas!
  • Practical Examples: Applying these concepts to malaria-related data using the malariaAtlas package.

5.2 Types of Spatial Data

Spatial data comes in two main forms:

  • Raster Data: Think of this as a grid of cells, like a digital photograph. Each cell holds a value representing a specific characteristic at that location. Common examples include:

    • Satellite Imagery: Land cover, vegetation indices (NDVI), temperature, precipitation.
    • Elevation Data: Digital Elevation Models (DEMs).
    • Gridded Climate Data: Temperature, rainfall.
    • Population Data: Gridded population counts or densities are crucial for calculating rates (e.g., malaria cases per 1000 people). Examples include datasets from WorldPop (https://www.worldpop.org/) and the Gridded Population of the World (GPW).
    • Modelled Malaria Incidence Surfaces: These are predictive maps of malaria risk, often generated using statistical models that incorporate environmental, socioeconomic, and demographic factors. The Malaria Atlas Project (https://malariaatlas.org/) is a primary source for these surfaces.

    In the context of malaria, raster data provides valuable insights into ecological suitability, environmental conditions, population distribution, and disease risk. For example, a high-resolution population raster overlaid with a malaria incidence raster can help identify high-risk populations and prioritize interventions.

    Example: Here’s an example of a raster visualizing population density:

    WorldPop Population density map
  • Vector Data: Represents spatial features as points, lines, or polygons.

    • Points: Individual locations, such as health facilities, households surveyed, or locations of mosquito traps.
    • Lines: Roads, rivers, or migration routes. Useful for assessing access to healthcare or movement patterns.
    • Polygons: Areas like administrative boundaries (countries, districts), land parcels, or areas of specific land use (e.g., rice paddies, urban areas). Essential for calculating rates and performing spatial joins.

    For malaria analysis, vector data can represent administrative units for calculating malaria rates, locations of health facilities, or areas receiving specific interventions. Shapefiles are the most common way to store vector data.

    Shapefiles: A Deeper Dive

    Shapefiles, traditionally associated with ESRI (Environmental Systems Research Institute), are a ubiquitous format for storing vector data. However, it’s essential to understand that a “shapefile” isn’t a single file. It’s a collection of files with the same base name but different extensions. At a minimum, an ESRI shapefile requires these files:

    • .shp: The main file that stores the geometry (the coordinates of the points, lines, or polygons).
    • .shx: An index file that speeds up spatial queries.
    • .dbf: A database file (in dBase format) that stores the attribute data associated with each feature.
    • .prj: A projection file that defines the coordinate system used in the shapefile. (Important! Always check this!)

    Sometimes you’ll also see files like .sbn and .sbx (spatial index files) or .xml (metadata).

    When sharing a shapefile, you must share all these associated files together. If you only provide the .shp file, the data is incomplete and will likely not load properly.

    Alternative Vector Formats:

    While shapefiles are common, other vector formats exist, including:

    • GeoJSON (.geojson): A text-based format using JavaScript Object Notation (JSON) to represent geographic features. It’s increasingly popular due to its human-readable format and ease of use in web applications. sf can read and write GeoJSON files.
    • GeoPackage (.gpkg): An open, standards-based format for storing geospatial vector data in a single SQLite database file. It’s becoming a preferred format due to its simplicity and efficiency. sf also supports GeoPackage.

    Example: Here’s an example of a vector polygon dataset: Vector Polygon Data

    Learn More:

5.3 Geographic Projections: Why Coordinates Matter

Imagine trying to flatten an orange peel onto a flat surface. It’s impossible without distortion. The Earth is a sphere (more accurately, a geoid), and representing it on a flat map (in R or anywhere else) requires a projection.

  • Why Projections Matter: Different projections distort different properties (area, shape, distance, direction). Choosing the right projection is crucial for accurate measurements and analyses.

  • Coordinate Reference Systems (CRS): A CRS defines how geographic coordinates (latitude and longitude) are related to a flat map. It includes both the datum (a reference point for the Earth’s shape and position) and the projection itself.

  • Common CRSs:

    • WGS 84 (EPSG:4326): A widely used geographic CRS based on latitude and longitude. Often the default for data downloaded online. Good for displaying global data, but distances and areas are distorted.
    • UTM (Universal Transverse Mercator): Divides the Earth into zones, each with its own projection. Good for measuring distances and areas within a specific zone. Needs to be chosen based on the area of interest. For example, a large area in Eastern Africa might use UTM Zone 37S or 37N (check a UTM zone map!).
    • Africa Albers Equal Area Conic (EPSG:102022): A projected CRS designed to minimize area distortion across the African continent. Useful when comparing areas of different regions.
  • Checking and Transforming Projections in R: You’ll often need to check the CRS of your data and transform it to a more suitable one for your analysis.

  • Matching CRSs: Shapefiles and Coordinate Lists

    A common challenge arises when you have spatial data in different formats, such as a shapefile for administrative boundaries and a CSV file containing the locations of health facilities (latitude and longitude). To accurately map these datasets together, you must ensure they share the same CRS.

    Here’s a typical workflow:

    1. Determine the CRS of the Shapefile: Use st_crs(your_shapefile) (we’ll show you how to load shapefiles in the R examples) to identify the CRS of your shapefile. Note the EPSG code (e.g., 4326, 32637, 102022).

    2. Create an sf Object from the Coordinate List: Assuming your CSV file has columns named longitude and latitude, use st_as_sf() (we’ll show you how to do this in the R examples) to convert it into an sf object. Crucially, specify the CRS when creating the object. You’ll need to load your CSV data into R first and then transform it to a special object.

    3. Transform the Coordinate List (if necessary): If the CRS of the new object does not match the CRS of your administrative boundary shapefile, use st_transform() (again, to be demonstrated in a code example) to transform it to the correct CRS.

    4. Verify the Transformation: After the transformation, always double-check the CRS of the transformed object using st_crs() to ensure it matches the shapefile’s CRS.

    Important Considerations:

    • Assumptions: It’s vital to know the CRS of your coordinate list. Don’t assume it’s WGS 84. If you’re unsure, contact the data provider or consult the data’s documentation. Making incorrect assumptions about the CRS can lead to significant mapping errors.
    • Unit Consistency: When performing spatial analysis (e.g., calculating distances), ensure all your data is in a projected CRS with units in meters or kilometers. Geographic CRSs (like WGS 84) use degrees, which are not suitable for distance calculations.

    Example: Here’s an example showing how coordinates with the wrong coordinate system look when plotted with shapefile data: Coordinate reference system

    Learn More:

5.4 Essential R Packages

Let’s introduce the key R packages we’ll be using.

  • sf (Simple Features): This is the modern standard for working with vector data in R. It efficiently handles shapefiles, geodatabases, and other vector formats. It seamlessly integrates with the tidyverse.

  • terra: A powerful package for working with raster data. It is designed to be a replacement for the older raster package, offering increased speed and efficiency. It is closely integrated with the sf package for vector data.

  • ggplot2: A versatile and widely-used package for creating visually appealing and informative maps and plots. It works beautifully with both sf and terra objects.

  • malariaAtlas: A package that provides access to a wealth of malaria-related data, including survey data, raster layers of malaria incidence and prevalence, and shapefiles of administrative boundaries. It’s a fantastic resource for malaria researchers.

  • tidyverse: A collection of R packages that share an underlying design philosophy, grammar, and data structures. We’ll be using the dplyr package for data manipulation, and the tidyr package for data wrangling.

  • tidyterra: Extends the functionality of ggplot2 to the terra package.

5.5 Practical Examples in R

Now, let’s apply these concepts to a practical example related to malaria in sub-Saharan Africa, using the malariaAtlas package to obtain data.

# Install the packages if you haven't already
# install.packages(c("sf", "terra", "ggplot2", "malariaAtlas", "tidyverse", "tidyterra"))

# Load the packages
library(sf)
Warning: package 'sf' was built under R version 4.3.3
Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE
library(terra)
terra 1.7.55
library(malariaAtlas)
Warning: package 'malariaAtlas' was built under R version 4.3.3
Loading required package: ggplot2
Warning: package 'ggplot2' was built under R version 4.3.3
Registered S3 method overwritten by 'malariaAtlas':
  method           from   
  autoplot.default ggplot2
library(tidyverse)
Warning: package 'tidyr' was built under R version 4.3.3
Warning: package 'readr' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::extract() masks terra::extract()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyterra)
Warning: package 'tidyterra' was built under R version 4.3.3
Registered S3 method overwritten by 'tidyterra':
  method              from        
  autoplot.SpatRaster malariaAtlas

Attaching package: 'tidyterra'

The following object is masked from 'package:stats':

    filter

5.5.1 Getting Administrative Boundaries with malariaAtlas

# Select a country - Let's stick with Ethiopia
cty = "Ethiopia"

# Get the district shapefile (admin level 2)
adm2 <- getShp(cty, admin_level = "admin2")
Loading ISO 19139 XML schemas...
Loading ISO 19115 codelists...
Please Note: Because you did not provide a version, by default the version being used is 202403 (This is the most recent version of admin unit shape data. To see other version options use function listShpVersions)
Start tag expected, '<' not found
Start tag expected, '<' not found
# Print the first few rows of the data frame
head(adm2)
Simple feature collection with 6 features and 16 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 35.1156 ymin: 5.6004 xmax: 39.4539 ymax: 12.3047
Geodetic CRS:  WGS 84
  iso admn_level   name_0     id_0  type_0 name_1      id_1 type_1
1 ETH          2 Ethiopia 10000850 Country Amhara  10313856 Region
2 ETH          2 Ethiopia 10000850 Country Sidama 868130572 Region
3 ETH          2 Ethiopia 10000850 Country   SNNP  10313355 Region
4 ETH          2 Ethiopia 10000850 Country   SNNP  10313355 Region
5 ETH          2 Ethiopia 10000850 Country   SNNP  10313355 Region
6 ETH          2 Ethiopia 10000850 Country Oromia  10316246 Region
            name_2      id_2     type_2 name_3 id_3 type_3
1        Lake Tana 868128071 Water Body     NA   NA     NA
2       Lake Awasa 868128081 Water Body     NA   NA     NA
3            Sheka  10016531       Zone     NA   NA     NA
4             Gamo 868128083       Zone     NA   NA     NA
5            Amaro  10016516       Zone     NA   NA     NA
6 North Shewa (OR)  10016504       Zone     NA   NA     NA
                       source                       geometry country_level
1 Ethiopia CSA and BoFED 2020 MULTIPOLYGON (((37.6035 12....         ETH_2
2 Ethiopia CSA and BoFED 2020 MULTIPOLYGON (((38.482 7.09...         ETH_2
3 Ethiopia CSA and BoFED 2020 MULTIPOLYGON (((35.5207 7.8...         ETH_2
4 Ethiopia CSA and BoFED 2020 MULTIPOLYGON (((37.2744 6.7...         ETH_2
5 Ethiopia CSA and BoFED 2020 MULTIPOLYGON (((38.0884 5.7...         ETH_2
6 Ethiopia CSA and BoFED 2020 MULTIPOLYGON (((38.7529 10....         ETH_2
# Check the CRS
st_crs(adm2)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["geodetic latitude (Lat)",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["geodetic longitude (Lon)",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]
# Plot a simple map
ggplot() +
  geom_sf(data = adm2) +
  ggtitle(paste("Administrative Districts (Level 2) in", cty)) +
  theme_void()

Explanation:

  • getShp(): A function from the malariaAtlas package that downloads the administrative boundaries shapefile for the specified country and admin level. Admin level 2 generally refers to districts or similar sub-national units.
  • head(): Shows the first few rows of the data, allowing us to explore the attributes associated with each district.
  • st_crs(): Confirms the Coordinate Reference System of the shapefile. Notice the EPSG code.
  • The ggplot code creates a basic map of the districts. theme_void() removes unnecessary map elements for a cleaner look.

5.5.2 Obtaining and Mapping Malaria Survey Data

# Get the malaria prevalence survey data
pr <- getPR(country = cty, species = "BOTH")
Please Note: Because you did not provide a version, by default the version being used is 202406 (This is the most recent version of PR data. To see other version options use function listPRPointVersions)
Start tag expected, '<' not found
Start tag expected, '<' not found
Warning: The `sourcedata` argument of `isAvailable_pr()` is deprecated as of
malariaAtlas 1.6.0.
ℹ The argument 'sourcedata' has been deprecated. It will be removed in the next
  version. It has no meaning.
ℹ The deprecated feature was likely used in the malariaAtlas package.
  Please report the issue at
  <https://github.com/malaria-atlas-project/malariaAtlas/issues>.
Please Note: Because you did not provide a version, by default the version being used is 202406 (This is the most recent version of PR data. To see other version options use function listPRPointVersions)
Confirming availability of PR data for: Ethiopia...
PR points are available for Ethiopia.
Attempting to download PR point data for Ethiopia ...
Start tag expected, '<' not found
Start tag expected, '<' not found
Data downloaded for Ethiopia.
# Inspect the first few rows
head(pr)
   dhs_id site_id               site_name latitude longitude rural_urban
7    <NA>    6694             Dole School  5.90141  38.94115     UNKNOWN
9    <NA>   10320                 Langano  7.51600  38.77800     UNKNOWN
12   <NA>    8017          Gongoma School  6.31747  39.83618     UNKNOWN
15   <NA>   12873           Buriya School  7.56736  40.75210     UNKNOWN
18   <NA>   12260                 Chobo 3  7.89960  34.45080     UNKNOWN
21   <NA>   21446 Mekele, Quiha & Aynalem 13.47600  39.50200     UNKNOWN
    country country_id continent_id month_start year_start month_end year_end
7  Ethiopia        ETH       Africa           6       2009         6     2009
9  Ethiopia        ETH       Africa           4       2003         4     2003
12 Ethiopia        ETH       Africa           6       2009         6     2009
15 Ethiopia        ETH       Africa           6       2009         6     2009
18 Ethiopia        ETH       Africa           9       1989        10     1989
21 Ethiopia        ETH       Africa           9       1993        12     1993
   lower_age upper_age examined positive        pr       species     method
7        4.0        15      110        0 0.0000000 P. falciparum Microscopy
9        4.0        14       89        0 0.0000000 P. falciparum Microscopy
12       4.0        15      108        0 0.0000000 P. falciparum Microscopy
15       4.0        15       63        0 0.0000000 P. falciparum Microscopy
18       0.0        99      108       14 0.1296296 P. falciparum Microscopy
21       0.5         5     2080        0 0.0000000 P. falciparum Microscopy
   rdt_type pcr_type malaria_metrics_available location_available
7      <NA>     <NA>                      TRUE               TRUE
9      <NA>     <NA>                      TRUE               TRUE
12     <NA>     <NA>                      TRUE               TRUE
15     <NA>     <NA>                      TRUE               TRUE
18     <NA>     <NA>                      TRUE               TRUE
21     <NA>     <NA>                      TRUE               TRUE
   permissions_info
7              <NA>
9              <NA>
12             <NA>
15             <NA>
18             <NA>
21             <NA>
                                                                                                                                                                                                                                                                                               citation1
7  Ashton, RA, Kefyalew, T, Tesfaye, G, Pullan, RL, Yadeta, D, Reithinger, R, Kolaczinski, JH and Brooker, S (2011).  <b>School-based surveys of malaria in Oromia Regional State, Ethiopia: a rapid survey method for malaria in low transmission settings.</b> <i>Malaria Journal</i>, <b>10</b>(1):25
9                                                               Legesse, M. and Erko, B. (2004).  <b>Prevalence of intestinal parasites among school children in a rural area close to the southeast of Lake Langano, Ethiopia.</b> <i>Ethiopian Journal of Health Development</i>, <b>18</b>(2):116-120
12 Ashton, RA, Kefyalew, T, Tesfaye, G, Pullan, RL, Yadeta, D, Reithinger, R, Kolaczinski, JH and Brooker, S (2011).  <b>School-based surveys of malaria in Oromia Regional State, Ethiopia: a rapid survey method for malaria in low transmission settings.</b> <i>Malaria Journal</i>, <b>10</b>(1):25
15 Ashton, RA, Kefyalew, T, Tesfaye, G, Pullan, RL, Yadeta, D, Reithinger, R, Kolaczinski, JH and Brooker, S (2011).  <b>School-based surveys of malaria in Oromia Regional State, Ethiopia: a rapid survey method for malaria in low transmission settings.</b> <i>Malaria Journal</i>, <b>10</b>(1):25
18                                                                                    Nigatu, W., Abebe, M. and Dejene, A. (1992).  <b><i>Plasmodium vivax</i> and <i>P. falciparum</i> epidemiology in Gambella, south-west Ethiopia.</b> <i>Tropical Medicine and Parasitology</i>, <b>43</b>(3):181-5
21                                                                                         Adish, A.A., Esrey, S.A., Gyorkos, T.W. and Johns, T. (1999).  <b>Risk factors for iron deficiency anaemia in preschool children in northern Ethiopia.</b> <i>Public Health Nutrition</i>, <b>2</b>(3):243-52
   citation2 citation3
7       <NA>      <NA>
9       <NA>      <NA>
12      <NA>      <NA>
15      <NA>      <NA>
18      <NA>      <NA>
21      <NA>      <NA>
# Summarize the number of surveys.
nrow(pr)
[1] 2794
# Convect survey data to sf object and match projection to adm2 shapefile
# Note: missing coordinates must be removed before converting to sf
pr_sf <- pr |>
  filter(!is.na(longitude) & !is.na(latitude) & !is.na(pr)) |>
  st_as_sf(coords = c("longitude", "latitude"), crs = st_crs(adm2))

# Plot data using ggplot
ggplot() +
  geom_sf(data = adm2, fill = NA, color = "black", linewidth = 0.1) +
  geom_sf(data = pr_sf, aes(fill = pr), shape = 21, alpha = 0.7, size = 2) +
  scale_fill_viridis_c(option = "F", direction = -1, name = "Prevalence") +
  labs(title = paste("Malaria Survey Prevalence in", cty),
       subtitle = "Data from malariaAtlas") +
  theme_void()

Explanation:

  • getPR(): Retrieves malaria prevalence survey data for the specified country. The species = "BOTH" argument ensures we get data for all malaria species.
  • head(pr): Displays the first few rows of the pr data frame, showing the structure of the survey data (longitude, latitude, prevalence, etc.).
  • pr |> summarise(N = length(unique(survey_id))): This code uses tidyverse syntax to calculate the number of unique surveys in the data.
  • filter(!is.na(longitude) & !is.na(latitude) & !is.na(pr)): Removes rows with missing longitude, latitude, or prevalence values to avoid errors when creating the spatial object. This is critical!
  • st_as_sf(): Converts the data frame to an sf object, using the longitude and latitude columns as coordinates. The crs = st_crs(adm2) argument ensures the survey data is in the same coordinate system as the administrative boundaries.
  • The ggplot code overlays the survey points on the map of districts, with the color of the points representing the malaria prevalence. shape = 21 creates points with both a fill and an outline, and alpha = 0.7 makes the points slightly transparent to see underlying features.

5.5.3 Mapping Malaria Incidence Raster Data

# Download malaria incidence raster from MAP
inc <- getRaster(dataset_id = "Malaria__202206_Global_Pf_Incidence_Rate",
                 shp = adm2)
Checking if the following Surface-Year combinations are available to download:

    DATASET ID  YEAR 
  - Malaria__202206_Global_Pf_Incidence_Rate:  DEFAULT 
No encoding supplied: defaulting to UTF-8.
<GMLEnvelope>
....|-- lowerCorner: 3.4011 32.9999 "2000-01-01T00:00:00"
....|-- upperCorner: 14.8942 47.9862 "2020-01-01T00:00:00"Start tag expected, '<' not found
# Plot the raster and shapefile using ggplot
ggplot() +
  geom_spatraster(data = inc) +
  geom_sf(data = adm2, fill = NA, color = "black", linewidth = 0.1) +
  scale_fill_viridis_c(
    option = "F", direction = -1,
    name = "Incidence", na.value = "transparent") +
  labs(title = paste("Malaria Incidence in", cty),
       subtitle = "Data from malariaAtlas") +
  theme_void()

Explanation:

  • getRaster(): Downloads a raster dataset from the Malaria Atlas Project (MAP). The dataset_id specifies the raster layer we want (in this case, global Plasmodium falciparum incidence rate). The shp = adm2 argument crops the raster to the extent of the administrative boundaries, which saves memory and processing time. The shp argument also ensures that the raster data have the same CRS as the provided shapefile.
  • geom_spatraster: Plots raster data objects that are of the terra class spatRaster, in ggplot.
  • scale_fill_viridis_c(): Sets the color gradient for the raster data.
  • The ggplot code displays the malaria incidence raster overlaid on the district boundaries. na.value = "transparent" makes the background (areas with no data) transparent, allowing the map underneath to show through.

5.6 Challenge Questions/Exercises

  1. Explore Different Countries: Modify the cty variable to analyze malaria data for a different country in sub-Saharan Africa (e.g., “Nigeria”, “Kenya”, “Uganda”). Run the code to generate maps for that country. What are some key differences in malaria prevalence or incidence compared to Ethiopia?
  2. Different Administrative Levels: Experiment with different admin_level arguments in getShp() (e.g., admin_level = "admin1" for regions/provinces). How does the granularity of the map change?
  3. Filter Survey Data: Filter the pr data frame to include only surveys conducted in a specific year or a specific region. Create a map showing only the filtered survey data.
  4. Change Color Scales: Experiment with different color palettes in scale_fill_viridis_c(). Try other options like "A", "B", "C", etc., and explore reverse the ordering of the colour scale by using direction = 1. Which color scale do you find most effective for visualizing malaria prevalence and incidence?
  5. Save the map: Save the last map as a png file using the ggsave() function.