Before getting started, please follow the installation instructions. This is a private R package, so the installation process is different than most other packages. Also, the package accesses remote data sources, which means you will need to have internet access to use it. Once the package has been installed, you can load it into your R session.

Finding and retrieving available datasets

The primary goal of the PATHtoolsZambia R package is to provide access to Zambia-related datasets that are “clean” and up-to-date. The list_data() function provides that table of the available datasets, which contains the reference name and a brief description.

list_data()
#>                         name
#> 33                any-travel
#> 30     catchment-annual-long
#> 29 catchment-province-output
#> 13            chw-cases-2013
#> 14            chw-cases-2014
#> 15            chw-cases-2015
#> 16            chw-cases-2016
#> 17            chw-cases-2017
#> 18            chw-cases-2018
#> 19            chw-cases-2019
#> 20            chw-cases-2020
#> 21            chw-cases-2021
#> 22            chw-cases-2022
#> 34            chw-cases-2023
#> 12            chw-masterlist
#> 32                chw-travel
#> 3               district-shp
#> 11          friction-walking
#> 7         grid3-pop-rescaled
#> 9           hf-catchment-pop
#> 2                  hf-georef
#> 5             hf-master-wide
#> 31                 hf-travel
#> 8             hfca-pop-table
#> 23         hfca-voronoi-2016
#> 24         hfca-voronoi-2017
#> 25         hfca-voronoi-2018
#> 26         hfca-voronoi-2019
#> 27         hfca-voronoi-2020
#> 28         hfca-voronoi-2021
#> 35         hfca-voronoi-2022
#> 1              monthly-cases
#> 10         monthly-inpatient
#> 6                monthly-opd
#> 4               province-shp
#>                                                                               description
#> 33                                Walking time (min) to nearest health facility or worker
#> 30       Long-form HFCA population and incidence rates from province-level gravity model.
#> 29                                        Output table from province-level gravity model.
#> 13                                                           NMEC CHW cases data for 2013
#> 14                                                           NMEC CHW cases data for 2014
#> 15                                                           NMEC CHW cases data for 2015
#> 16                                                           NMEC CHW cases data for 2016
#> 17                                                           NMEC CHW cases data for 2017
#> 18                                                           NMEC CHW cases data for 2018
#> 19                                                           NMEC CHW cases data for 2019
#> 20                                                           NMEC CHW cases data for 2020
#> 21                                                           NMEC CHW cases data for 2021
#> 22                                                           NMEC CHW cases data for 2022
#> 34                                                           NMEC CHW cases data for 2023
#> 12                                           CHW name, location, and history information.
#> 32                                            Walking time (min) to nearest health worker
#> 3                                District-level (Admin2) shapefile with population totals
#> 11                                   Walking friction surface for estimating travel time.
#> 7                                       GRID3 population raster, rescaled to 18.4 million
#> 9                                     Annual estimated catchment sizes via gravity model.
#> 2                      Georeferenced facility masterlist (one set of coordinates per UID)
#> 5      Master facility list, organized by DHIS2 UID and retaining all source information.
#> 31                                          Walking time (min) to nearest health facility
#> 8  Health facility catchment populations include catchment model and HMIS 2020 headcount.
#> 23                                                    Voronoi tesselations for 2016 HFCAs
#> 24                                                    Voronoi tesselations for 2017 HFCAs
#> 25                                                    Voronoi tesselations for 2018 HFCAs
#> 26                                                    Voronoi tesselations for 2019 HFCAs
#> 27                                                    Voronoi tesselations for 2020 HFCAs
#> 28                                                    Voronoi tesselations for 2021 HFCAs
#> 35                                                    Voronoi tesselations for 2022 HFCAs
#> 1                                    Monthly malaria cases data (HMIS and NMEC combined).
#> 10                                            Monthly HMIS inpatient data (incl. deaths).
#> 6                                                           Monthly OPD first attendence.
#> 4                          Province-level (Admin1) shapefile with GRID3 population totals

The retrieve() function is used to load in data, using the reference name field in the from list_data(). For example, we can load in a list of all of the health facilities in Zambia sourced from the HMIS, NMEC, and Zambia Ministry of Health online record.

master_facility_list <- retrieve("hf-master-wide")
head(master_facility_list)
#> # A tibble: 6 × 17
#>   org_unit_uid lon_HMIS lat_HMIS province_HMIS district_HMIS name_HMIS lon_NMEC
#>   <chr>           <dbl>    <dbl> <chr>         <chr>         <chr>        <dbl>
#> 1 sy04jreTFc0      28.3    -15.4 Lusaka        Lusaka        NA            28.3
#> 2 VEwpwUzaSZ8      NA       NA   Muchinga      Kanchibiya    NA            NA  
#> 3 bwPt010YjCo      28.4    -12.7 Copperbelt    Mufulira      NA            28.2
#> 4 Me0ZPMA7wvc      28.5    -12.8 Copperbelt    Ndola         NA            NA  
#> 5 IAWEwxGrcHM      28.5    -12.8 Copperbelt    Ndola         NA            NA  
#> 6 jGqu6BUf5hW      28.4    -13.1 Copperbelt    Luanshya      NA            28.4
#> # ℹ 10 more variables: lat_NMEC <dbl>, province_NMEC <chr>,
#> #   district_NMEC <chr>, name_NMEC <chr>, lon_ZMoH <dbl>, lat_ZMoH <dbl>,
#> #   province_ZMoH <chr>, district_ZMoH <chr>, name_ZMoH <chr>, type <chr>

Another useful dataset is "hf-georef", which contains a list of all of the health facilities that have been georeferenced, meaning each row in the table is a unique facility that has a latitude and longitude. This data is useful for constructing maps.

hf_locations <- retrieve("hf-georef")
head(hf_locations)
#> # A tibble: 6 × 10
#>   org_unit_uid   lon    lat province    district name  source type  geo_province
#>   <chr>        <dbl>  <dbl> <chr>       <chr>    <chr> <chr>  <chr> <chr>       
#> 1 A87peYAyqsf   29.1  -8.84 Luapula     Chienge  Kany… NMEC   Heal… Luapula     
#> 2 ANtd2l36nZS   33.5 -10.4  Muchinga    Mafinga  Kaly… ZMoH   Heal… Muchinga    
#> 3 ARNhWzN9QfA   31.9 -14.4  Eastern     Sinda    Mng'… ZMoH   Heal… Eastern     
#> 4 ASnusR9MFtB   22.3 -15.0  Western     Sikongo  Siko… ZMoH   Rura… Western     
#> 5 AVmbFzKj1bY   28   -15    Lusaka      Lusaka   NA    HMIS   Other Central     
#> 6 AaHjJI4XyW2   24.3 -13.6  Northweste… Kabompo  Kama… NMEC   Heal… Northwestern
#> # ℹ 1 more variable: geo_district <chr>

Most of the datasets are relatively small, so they should download quickly, however the large datasets such as the monthly cases records may take longer. Typically data are stored in tables, however there are some that are more complex file types such as shapefiles or rasters.

Quick data checking

We have started to put together some quick data summaries for the datasets, using the sanity_check() function. This can be useful for checking for errors in the data (which are certainly possible!), and providing some quick aggregations. If you have suggestions for more useful summaries, or for the package in general, please add your comments here.

Here is an example of the sanity_check() function for the monthly cases dataset.

case_check <- sanity_check("monthly-cases")
#> Filtering case records from 2018-01-01 to 2021-03-01.
#> Warning: There was 1 warning in `dplyr::mutate()`.
#>  In argument: `Total = sum(dplyr::c_across(), na.rm = T)`.
#>  In row 1.
#> Caused by warning:
#> ! Using `c_across()` without supplying `cols` was deprecated in dplyr 1.1.0.
#>  Please supply `cols` instead.
#>  The deprecated feature was likely used in the PATHtoolsZambia package.
#>   Please report the issue at
#>   <https://github.com/PATH-Global-Health/PATHtoolsZambia/issues>.
#> 3146 UIDs.
#> 39 unique periods.
#> Data types: Confirmed, Confirmed_Passive_CHW, Tested, Tested_Passive_CHW, Treated_Confirmed, Treated_Clinical, Clinical
#> Age groups: Between 1-4 y., Over 5 y., Under 5 y., NA, Under 1 y.
#> Average annual cases: 12.77 million
case_check$cases_by_province
#> # A tibble: 11 × 6
#>    reported_province   Total yr_2018 yr_2019 yr_2020 yr_2021
#>    <chr>               <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
#>  1 Central           3888438  888848 1028010 1532583  438997
#>  2 Copperbelt        6547586 1448887 1906933 2546797  644969
#>  3 Eastern           6991698 1706367 1899670 2680579  705082
#>  4 Luapula           5358926 1337374 1560610 1966265  494677
#>  5 Lusaka             671614  139371  158776  283348   90119
#>  6 Muchinga          3612982  785174 1000778 1461448  365582
#>  7 Northern          4366792  984689 1269207 1699522  413374
#>  8 Northwestern      5466213 1214523 1474881 2231151  545658
#>  9 Southern           481756  128878   86256  184240   82382
#> 10 Western           4100488 1126414  669896 1561744  742434
#> 11 NA                  15614      NA    1408   14206      NA