This vignette provides some further detail on the Vector attribute joining section (see https://geocompr.robinlovelace.net/attr.html#vector-attribute-joining ) of the Geocomputation with R book. This vignette requires the following packages to be installed and attached:

library(sf)
library(spData)
library(tidyverse)

We will use an sf object north_america with country codes (iso_a2), names and geometries, as well as a data.frame object wb_north_america containing information about urban population and unemployment for three countries. Note that north_america contains data about Canada, Greenland and the United States but the World Bank dataset (wb_north_america) contains information about Canada, Mexico and the United States:

north_america = world %>%
  filter(subregion == "Northern America") %>%
  dplyr::select(iso_a2, name_long)
north_america
## Simple feature collection with 3 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -171.7911 ymin: 18.91619 xmax: -12.20855 ymax: 83.64513
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##   iso_a2     name_long                           geom
## 1     CA        Canada MULTIPOLYGON (((-122.84 49,...
## 2     US United States MULTIPOLYGON (((-122.84 49,...
## 3     GL     Greenland MULTIPOLYGON (((-46.76379 8...
wb_north_america = worldbank_df %>% 
  filter(name %in% c("Canada", "Mexico", "United States")) %>%
  dplyr::select(name, iso_a2, urban_pop, unemploy = unemployment)
wb_north_america
## # A tibble: 3 x 4
##   name          iso_a2 urban_pop unemploy
##   <chr>         <chr>      <dbl>    <dbl>
## 1 Canada        CA      29014612     6.91
## 2 Mexico        MX      98099040     4.81
## 3 United States US     259460378     6.17

We will use a left join to combine the two datasets. Left joins are the most commonly used operation for adding attributes to spatial data, as they return all observations from the left object (north_america) and the matched observations from the right object (wb_north_america) in new columns. Rows in the left object without matches in the right (Greenland in this case) result in NA values.

To join two objects we need to specify a key. This is a variable (or a set of variables) that uniquely identifies each observation (row). The by argument of dplyr’s join functions lets you identify the key variable. In simple cases, a single, unique variable exist in both objects like the iso_a2 column in our example (you may need to rename columns with identifying information for this to work):

## Simple feature collection with 3 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -171.7911 ymin: 18.91619 xmax: -12.20855 ymax: 83.64513
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##   iso_a2     name_long          name urban_pop unemploy
## 1     CA        Canada        Canada  29014612     6.91
## 2     US United States United States 259460378     6.17
## 3     GL     Greenland          <NA>        NA       NA
##                             geom
## 1 MULTIPOLYGON (((-122.84 49,...
## 2 MULTIPOLYGON (((-122.84 49,...
## 3 MULTIPOLYGON (((-46.76379 8...

This has created a spatial dataset with the new variables added. The utility of this is shown in the figure below, which shows the unemployment rate (a World Bank variable) across the countries of North America.

Figure 1. The unemployment rate (taken from World Bank statistics) in Canada and the United States to illustrate the utility of joining attribute data on to spatial datasets.

Figure 1. The unemployment rate (taken from World Bank statistics) in Canada and the United States to illustrate the utility of joining attribute data on to spatial datasets.

It is also possible to join objects by different variables. Both of the datasets have variables with names of countries, but they are named differently. The north_america has a name_long column and the wb_north_america has a name column. In these cases a named vector, such as c("name_long" = "name"), can specify the connection:

## Simple feature collection with 3 features and 5 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -171.7911 ymin: 18.91619 xmax: -12.20855 ymax: 83.64513
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##   iso_a2.x     name_long iso_a2.y urban_pop unemploy
## 1       CA        Canada       CA  29014612     6.91
## 2       US United States       US 259460378     6.17
## 3       GL     Greenland     <NA>        NA       NA
##                             geom
## 1 MULTIPOLYGON (((-122.84 49,...
## 2 MULTIPOLYGON (((-122.84 49,...
## 3 MULTIPOLYGON (((-46.76379 8...

Note that the result contains two duplicated variables - iso_a2.x and iso_a2.y because both x and y objects have the column iso_a2. This can be solved by specifying all the keys:

## Simple feature collection with 3 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -171.7911 ymin: 18.91619 xmax: -12.20855 ymax: 83.64513
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##   iso_a2     name_long urban_pop unemploy                           geom
## 1     CA        Canada  29014612     6.91 MULTIPOLYGON (((-122.84 49,...
## 2     US United States 259460378     6.17 MULTIPOLYGON (((-122.84 49,...
## 3     GL     Greenland        NA       NA MULTIPOLYGON (((-46.76379 8...

Joins also work when a data frame is the first argument. This keeps the geometry column but drops the sf class, returning a data.frame object.

## # A tibble: 3 x 6
##   name   iso_a2 urban_pop unemploy name_long                           geom
##   <chr>  <chr>      <dbl>    <dbl> <chr>                 <MULTIPOLYGON [°]>
## 1 Canada CA      29014612     6.91 Canada     (((-122.84 49, -122.9742 49.…
## 2 Mexico MX      98099040     4.81 <NA>                               EMPTY
## 3 Unite… US     259460378     6.17 United St… (((-122.84 49, -120 49, -117…
class(left_join4)
## [1] "tbl_df"     "tbl"        "data.frame"

On the other hand, it is also possible to remove the geometry column of left_join4 using base R functions or dplyr. Here, this is this simple because the geometry column is just another data.frame column and no longer the sticky geometry column of an sf object (see also the chapter Geographic data in R):

## # A tibble: 3 x 5
##   name          iso_a2 urban_pop unemploy name_long    
##   <chr>         <chr>      <dbl>    <dbl> <chr>        
## 1 Canada        CA      29014612     6.91 Canada       
## 2 Mexico        MX      98099040     4.81 <NA>         
## 3 United States US     259460378     6.17 United States
class(left_join4_df)
## [1] "tbl_df"     "tbl"        "data.frame"

In contrast to left_join(), inner_join() keeps only observations from the left object (north_america) where there are matching observations in the right object (wb_north_america). All columns from the left and right object are still kept:

## Simple feature collection with 2 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -171.7911 ymin: 18.91619 xmax: -52.6481 ymax: 83.23324
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##   iso_a2     name_long urban_pop unemploy                           geom
## 1     CA        Canada  29014612     6.91 MULTIPOLYGON (((-122.84 49,...
## 2     US United States 259460378     6.17 MULTIPOLYGON (((-122.84 49,...