Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use exactextractr as single engine for zonal statistics #334

Open
goergen95 opened this issue Aug 17, 2024 · 4 comments
Open

Use exactextractr as single engine for zonal statistics #334

goergen95 opened this issue Aug 17, 2024 · 4 comments
Assignees

Comments

@goergen95
Copy link
Member

Currently, we supply and maintain three different engines for the extraction of zonal statistics. This situation is not ideal, because they differ in performance, the numerical results are not equal and they also differ in the set of summary statistics they support. exactextractr works with a very performant algorithm to calculate exact pixel coverage, allows a large set of summary statistics which are easily weighted by area or custom weight rasters and is already a suggested dependency. I thus suggest to re-implement the engine behavior to use exactextractr as the sole engine and move it to Imports. The engine argument will be deprecated and informative warning messages will be included in the next release, before it will be removed eventually. The work on this already started and is available on the rework-engine branch.

@goergen95 goergen95 self-assigned this Aug 19, 2024
@Jo-Schie
Copy link
Member

Jo-Schie commented Sep 3, 2024

we should discuss this. exactextract is in fact a very fast engine but it is not the fastest in every setting according to the masterthesis of Om back then. The ideal choice of engine was dependent on the input data. IT is true, that exactextract produces more accurete results because it basically crops gridcells whereas terra only accounts for the size of whole grids which is especially bad if you have a low spatial raster resolution. So in theory it is better to use it. However the thing that I most fear is, that exactetract at some point may be discontinued. You can see from the package contributors that it basically relies on one single person. I think as for terra it is less likely to miss support at some point in time because it is developed in a broader setting.

@goergen95
Copy link
Member Author

Happy to discuss pros and cons! In these kind of discussions, from my POV, it is always a good idea try to provide reproducible evidence for one's claims (I try to something similar for the question on the effect of tiling here). Do you have access to those findings you mentioned, what do they tell and, more importantly, can we reproduce and do they reflect our use case scenario?

If we were to re-work package internals to use a single engine, this could easily be exchanged by another implementation at a later point in time, if necessary. My main issue with the current approach is that it seems inefficient (needs to be proved), it is confusing both to users and developers and we produce different numerical results depending on the "engine" used.

To put some of the statements you made about maintenance into perspective:
You are linking the contributions page of the R wrapper of the exactextract C++ library. The library itself has a similar contribution pattern (see here) as terra (see here): both projects are practically developed/maintained by a single maintainer with small contributions by a community of contributors. Also, raster, the successor of terra, had/has a very similar contribution pattern (see here), and yet it was superseded by terra requiring users of that package to either put in the work to transition to terra to benefit from its features or stay with the now legacy code base of raster. All that is to say that I consider the analysis of the contribution pattern as a weak argument in the present case, and especially in the context of estimating future support/development.

@goergen95
Copy link
Member Author

goergen95 commented Sep 3, 2024

To provide some initial evidence (we can always make the examples more complex) here is a reprex where I see consistently faster performance of exactextractr for both high and low resolution rasters:

EDIT history:

  • uses microbenchmark and increases polygons size for more realistic number of cells
  • compares engines calculating multiple statistics
  • compares engines using 10 layers
library(sf)
#> Linking to GEOS 3.12.2, GDAL 3.9.1, PROJ 9.4.1; sf_use_s2() is TRUE
library(terra)
#> terra 1.7.78
library(microbenchmark)
library(exactextractr)

stats <- c("min", "mean", "sum", "max")
nlayer <- 10

r_low_res <- rast(resolution = c(0.25, 0.25))
r_low_res[] <- runif(ncell(r_low_res))
r_low_res <- do.call(c, lapply(1:nlayer, function(x) r_low_res))
names(r_low_res) <- paste0("var_", 1:nlayer)
(r_low_res)
#> class       : SpatRaster 
#> dimensions  : 720, 1440, 10  (nrow, ncol, nlyr)
#> resolution  : 0.25, 0.25  (x, y)
#> extent      : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (CRS84) (OGC:CRS84) 
#> source(s)   : memory
#> names       :        var_1,        var_2,        var_3,        var_4,        var_5,        var_6, ... 
#> min values  : 4.440080e-07, 4.440080e-07, 4.440080e-07, 4.440080e-07, 4.440080e-07, 4.440080e-07, ... 
#> max values  : 9.999999e-01, 9.999999e-01, 9.999999e-01, 9.999999e-01, 9.999999e-01, 9.999999e-01, ...

r_high_res <- rast(resolution = c(0.025, 0.025))
r_high_res[] <- runif(ncell(r_high_res))
r_high_res <- do.call(c, lapply(1:nlayer, function(x) r_high_res))
names(r_high_res) <- paste0("var_", 1:nlayer)
(r_high_res)
#> class       : SpatRaster 
#> dimensions  : 7200, 14400, 10  (nrow, ncol, nlyr)
#> resolution  : 0.025, 0.025  (x, y)
#> extent      : -180, 180, -90, 90  (xmin, xmax, ymin, ymax)
#> coord. ref. : lon/lat WGS 84 (CRS84) (OGC:CRS84) 
#> source(s)   : memory
#> names       :        var_1,        var_2,        var_3,        var_4,        var_5,        var_6, ... 
#> min values  : 1.164153e-09, 1.164153e-09, 1.164153e-09, 1.164153e-09, 1.164153e-09, 1.164153e-09, ... 
#> max values  : 1.000000e+00, 1.000000e+00, 1.000000e+00, 1.000000e+00, 1.000000e+00, 1.000000e+00, ...

pts <- st_sample(st_bbox(r_low_res), 100)
pts <- st_as_sf(pts)
pol <- st_buffer(pts, dist = 100000)

# low res

microbenchmark(
  "zonal" = lapply(stats, function(stat) zonal(r_low_res, vect(pol), fun = stat)),
  "extract" = extract(r_low_res, vect(pol), stats),
  "exact" = exact_extract(r_low_res, pol, stats, progress = F),
  times = 10
)
#> Unit: milliseconds
#>     expr       min        lq     mean   median        uq      max neval cld
#>    zonal 434.20002 436.16781 438.7373 438.3603 441.76197 443.8704    10 a  
#>  extract 108.40040 109.34919 112.8417 109.6444 110.24886 142.0795    10  b 
#>    exact  41.61539  41.89287  55.4624  42.0558  47.85883 162.9634    10   c

microbenchmark(
  "zonal" = lapply(stats, function(stat) zonal(r_high_res, vect(pol), fun = stat)),
  "extract" = extract(r_high_res, vect(pol), stats),
  "exact" = exact_extract(r_high_res, pol, stats, progress = F),
  times = 10
)
#> Unit: milliseconds
#>     expr      min        lq      mean    median        uq       max neval cld
#>    zonal 6739.114 7457.8483 7418.5185 7517.2377 7580.2243 7608.3937    10 a  
#>  extract 1660.335 1680.2645 1803.5384 1817.4242 1883.2219 1935.5932    10  b 
#>    exact  320.236  385.8399  427.8119  439.6597  498.2279  522.5295    10   c

Created on 2024-09-03 with reprex v2.1.0

@goergen95
Copy link
Member Author

@Jo-Schie: I just remembered that our implementation of the GFW-based indicators does not work without exactextractr being available, so there is a strong dependency on that package for one of our main indicators already:

calc_treecover_area <- function(years = 2000:2023,
min_size = 10,
min_cover = 35) {
check_namespace("exactextractr")
check_namespace("landscapemetrics", error = FALSE)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants