This vignette provides some further detail on the vector attribute subsetting section (see https://geocompr.robinlovelace.net/attr.html#vector-attribute-subsetting ) and the spatial subsetting section (see https://geocompr.robinlovelace.net/spatial-operations.html#spatial-subsetting ) of the Geocomputation with R book. It goal is to demonstrate some subtleties of spatial subsetting of
sf objects using base R and tidyverse methods.
The starting point is to attach necessary packages:
And subsetting canterbury and
nz_height features in the region:
We can see how to get a list representation for subsetting:
##  "sgbp"
In the above code chunk, an object of class
sgbp is created and then converted into a logical vector
sel_logical. Values of TRUE in
sel_logical represents points from
nz_height that intersects with the
canterbury polygon. In other words, it returns elevation points located in the Canterbury region.
Use the result using base R:
And use the result with the tidyverse:
We can test whether the subsetted objects are identical as follows:
##  TRUE
##  FALSE
What is different about
canterbury_height3? The only difference is that
filter() changed the row names:
##  "5" "6" "7"
##  "1" "2" "3"
If the row names are re-set, the objects become identical:
##  TRUE
This discarding of row names is not something that is specific to spatial data, as illustrated in the code chunk below. dplyr discards row names by design. For further discussion of this decision, and some controversy, see the (closed) issue #366 in the package’s issue tracker.
In the code below a dense matrix is created.
sel is not, as one might imagine, a
vector (although it behaves as one as it only has one column) but a
##  "matrix"
##  "logical"
##  101 1
The dimensions of
sel (returned by the base R command
dim()) show one row per feature in the target object (
nz_height) and a column per feature in the subsetting object (
canterbury). The general pattern here is that
sel[i, j] is
TRUE if the ith feature in the target object intersects with the jth feature in the subsetting object. If there is more than one feature in
y the resulting selection
matrix must be converted into a
vector before it is used for subsetting, e.g. with
rowSums(sel_matrix) > 0. Another solution is to convert the default sparse matrix (
list) output from
st_intersects() to a
logical vector using the function
lengths(). This approach to spatial subsetting, used internally by sf (see the source code of
sf:::`[.sf`), is illustrated in the code chunk below:
The above code chunk results in an object,
heights_co, that represents the high points that intersect with either Canterbury or Otago region (hence the object name
co). It did this in four stages:
nzcontaining “Canter” or “Otago” in their names. This was done using the pattern matching function
grepl()in combination with the
|character, which means ‘or’, resulting in the subsetting object
sgbpobject, a list representing which features of
nz_heightintersect with the regions in
lengths()finds the features in
nz_heightmatching any features in
nz_heights, creating a new object
Another spatial subsetting example will use an object representing the countries of Africa, created using attribute subsetting as follows:1
To further prepare the input data, we will reproject the data to the coordinate reference system (CRS) 32630, its EPSG code (explained in Chapter 6):
We can also use the
[ operator for Spatial subsetting. The difference is that we use another spatial object inside the square brackets instead of an
logical vector. This is a concise and consistent syntax, as shown in the next code chunk. Let’s test it with a hypothetical scenario: we want to subset all countries within 2000 km of the point where the equator (where latitude = 0 degrees) intersects the prime meridian (longitude = 0 degrees), as illustrated in the next figure. The subsetting object is created below. Note that this must have the same CRS as the target object (set with the
The data to be subset, or ‘target layer’, is the
africa object created above, which has a projected CRS (
32630). Subsequently, spatial subsetting can be done with a single, concise command:
Sidenote: If we were using geographic (‘lon/lat’) data the previous command would have emitted a message warning about assuming
planar coordinates. This is because spatial operations (especially distance and area calculations) cannot be assumed to be accurate in a geographic (longitude/latitude) CRS. In this case one could justify the use of a lon/lat CRS: the data is close to the equator where there is least distortion caused by the curvature of the earth. It is good practice to reproject spatial datasets before performing spatial operations on them.
The spatial subsetting clearly worked: only countries intersecting with the giant circle are returned (see the figure below):
Note that countries that just touch the giant circle are selected such as Chad (northeast of the circle). This is because the default subsetting operator is
st_intersects(), which returns any type of spatial relation. Other spatial subsetting operations such as
st_within() are more conservative, as shown in the topological relations section.
## Simple feature collection with 1 feature and 1 field ## geometry type: POLYGON ## dimension: XY ## bbox: xmin: -1166021 ymin: -2e+06 xmax: 2833979 ymax: 2e+06 ## epsg (SRID): 32630 ## proj4string: +proj=utm +zone=30 +datum=WGS84 +units=m +no_defs ## pop geometry ## 1 487896821 POLYGON ((2833979 0, 283123...
buff_agg, is a spatial object with the same geometry as
by (the circular buffer in this case) but with an additional variable,
pop reporting summary statistics for all features in
x that intersect with
by (the total population of the countries that touch the buffer in this case). Plotting the result (with
plot(buff_agg)) shows that the operation does not really make sense: The figure below shows a population of over half a billion people mostly located in a giant circle floating off the west coast of Africa!
The results of the spatial aggregation exercise presented in this figure are unrealistic for three reasons:
st_intersects()is too ‘greedy’). The most extreme example of this is Algeria, the most northerly country selected: the spatial aggregation operation assumes that all 39 million Algerian citizens reside in the tiny southerly tip that is within the circular buffer.
A number of methods can be used to overcome these issues, and generate a more realistic population attributed to the circular buffer illustrated in the above figure. The simplest of these is to convert the country polygons into points representing their geographic centroids before aggregation. This would ensure that any spatially contiguous aggregating object covering the target object (the Earth in this case) would result in the same total: there would be no double counting. The estimated total population residing within the study area would be more realistic if geographic centroids were used. (The centroid of Algeria, for example, is far outside the aggregating buffer.)
Except in cases where the number of target features per aggregating feature is very large, or where the aggregating object is spatially congruent with the target, using centroids can also lead to errors due to boundary effects: imagine a buffer that covers a large area but contains no centroids. These issues can be tackled when aggregating areal target data with areal interpolation.
Recall attribute subsetting can also be done in base R with
africa_wgs = world[world$continent == "Africa", ].↩