Introduction
Species distribution modeling (SDM) has become a standard tool in multiple research areas, including ecology, conservation biology, biogeography, paleobiogeography, and epidemiology. SDM is an area of active theoretical and methodological research. The flexsdm package provides users the ability to manipulate and parameterize models in a variety of ways that meet their unique research needs.
This flexibility enables users to define their own complete or partial modeling procedure specific for their modeling situations (e.g., number of variables, number of records, different algorithms and ensemble methods, algorithms tuning, etc.).
In this vignette, users will learn about the second set of functions in the flexsdm package that fall under the “modeling” umbrella. These functions were designed to construct and validate different types of models and can be grouped into fit_* , tune_* , and esm_* family functions. In addition there is a function to perform ensemble modeling.
The fit_* functions construct and validate models with default hyper-parameter values. The tune_* functions construct and validate models by searching for the best combination of hyper-parameter values, and esm_ functions can be used for constructing and validating Ensemble of Small Models. Finally, the fit_ensemble() function is for fitting and validating ensemble models.
These are the functions for model construction and validation:
fit_* functions family
fit_gam() Fit and validate Generalized Additive Models
fit_gau() Fit and validate Gaussian Process models
fit_gbm() Fit and validate Generalized Boosted Regression models
fit_glm() Fit and validate Generalized Linear Models
fit_max() Fit and validate Maximum Entropy models
fit_net() Fit and validate Neural Networks models
fit_raf() Fit and validate Random Forest models
fit_svm() Fit and validate Support Vector Machine models
tune_* functions family
tune_gbm() Fit and validate Generalized Boosted Regression models with exploration of hyper-parameters
tune_max() Fit and validate Maximum Entropy models with exploration of hyper-parameters
tune_net() Fit and validate Neural Networks models with exploration of hyper-parameters
tune_raf() Fit and validate Random Forest models with exploration of hyper-parameters
tune_svm() Fit and validate Support Vector Machine models with exploration of hyper-parameters
model ensemble
- fit_ensemble() Fit and validate ensemble models with different ensemble methods
esm_* functions family
esm_gam() Fit and validate Generalized Additive Models with Ensemble of Small Model approach
esm_gau() Fit and validate Gaussian Process models Models with Ensemble of Small Model approach
esm_gbm() Fit and validate Generalized Boosted Regression models with Ensemble of Small Model approach
esm_glm() Fit and validate Generalized Linear Models with Ensemble of Small Model approach
esm_max() Fit and validate Maximum Entropy models with Ensemble of Small Model approach
esm_net() Fit and validate Neural Networks models with Ensemble of Small Model approach
esm_svm() Fit and validate Support Vector Machine models with Ensemble of Small Model approach
Installation
First, install the flexsdm package. You can install the released version of flexsdm from github with:
# devtools::install_github('sjevelazco/flexsdm')
require(flexsdm)
#> Loading required package: flexsdm
require(terra)
#> Loading required package: terra
#> terra 1.7.78
#>
#> Attaching package: 'terra'
#> The following object is masked from 'package:knitr':
#>
#> spin
require(dplyr)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:terra':
#>
#> intersect, union
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
Project directory setup
Decide where on your computer you would like to store the inputs and outputs of your project (this will be your main directory). Use an existing one or use dir.create() to create your main directory. Then specify whether or not to include folders for projections, calibration areas, algorithms, ensembles, and thresholds. For more details see Vignette 01_pre_modeling
Data, species occurrence and background data
In this tutorial, we will be using species occurrences and environmental data that are available through the flexsdm package. The “abies” example dataset includes a pr_ab column (presence = 1, and absence = 0), location columns (x, y) and other environmental data. You can load the “abies” data into your local R environment by using the code below:
(THIS EXAMPLE LOOKS A LITTLE STRANGE BECAUSE WE ARE ALSO USING BACKGROUND DATA, WHILE THE ABIES DATASET CLEARLY HAS ABSENCES…)
data("abies")
data("backg")
dplyr::glimpse(abies)
#> Rows: 1,400
#> Columns: 13
#> $ id <int> 715, 5680, 7907, 1850, 1702, 10036, 12384, 6513, 9884, 8651, …
#> $ pr_ab <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ x <dbl> -95417.134, 98986.536, 121474.257, -39976.221, 111372.261, -2…
#> $ y <dbl> 314240.13, -159415.18, -99463.44, -17456.11, -91404.05, 39222…
#> $ aet <dbl> 323.1133, 447.5567, 182.2833, 372.3867, 209.4567, 308.3000, 5…
#> $ cwd <dbl> 546.1400, 815.4033, 271.1800, 946.2933, 398.5500, 534.9533, 3…
#> $ tmin <dbl> 1.2433, 9.4267, -4.9500, 8.7767, -4.0333, 4.6600, 4.3800, 4.9…
#> $ ppt_djf <dbl> 62.7257, 129.6406, 150.7003, 116.0236, 164.9327, 166.2220, 48…
#> $ ppt_jja <dbl> 17.7941, 6.4317, 11.2294, 2.7020, 9.2686, 16.5310, 41.2494, 8…
#> $ pH <dbl> 5.773341, 5.600000, 0.000000, 6.411796, 0.000000, 5.700000, 5…
#> $ awc <dbl> 0.10837019, 0.16000000, 0.00000000, 0.09719457, 0.00000000, 0…
#> $ depth <dbl> 152.000000, 201.000000, 0.000000, 59.759930, 0.000000, 112.99…
#> $ landform <fct> 7, 11, 15, 14, 15, 15, 7, 15, 4, 10, 6, 10, 10, 15, 10, 11, 1…
dplyr::glimpse(backg)
#> Rows: 5,000
#> Columns: 13
#> $ pr_ab <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ x <dbl> 160779.16, 36849.16, -240170.84, -152420.84, -193190.84, …
#> $ y <dbl> -449968.33, 24151.67, 90031.67, -143518.33, 24151.67, 223…
#> $ aet <dbl> 280.4567, 259.7800, 400.1767, 367.4833, 397.3667, 385.263…
#> $ cwd <dbl> 1137.2433, 381.5367, 699.6500, 843.4467, 842.3833, 637.35…
#> $ tmin <dbl> 13.5100, -3.1733, 8.6800, 9.0133, 8.9700, 4.9333, 6.2933,…
#> $ ppt_djf <dbl> 71.2741, 171.4537, 285.0893, 72.0309, 125.2467, 226.1534,…
#> $ ppt_jja <dbl> 1.1920, 17.5193, 5.0158, 1.2047, 1.9778, 8.1554, 18.4182,…
#> $ pH <dbl> 0.0000000, 0.2122687, 5.7222223, 7.5350823, 6.1963525, 5.…
#> $ awc <dbl> 0.000000000, 0.003473487, 0.080370426, 0.170000002, 0.131…
#> $ depth <dbl> 0.00000, 201.00000, 50.07409, 154.39426, 122.39575, 56.17…
#> $ percent_clay <dbl> 0.0000000, 0.4438345, 18.4111176, 46.9751244, 37.1873169,…
#> $ landform <fct> 13, 10, 6, 6, 10, 14, 8, 14, 6, 7, 11, 14, 14, 10, 6, 6, …
If you want to replace the abies dataset with your own data, make sure that your dataset contains the environmental conditions related to presence-absence data. We use a pre-modeling family function for a k-fold partition method (to be used for cross-validation). In the partition method the number of folds or replications must be the same for presence-absence and for background points datasets.
abies2 <- part_random(
data = abies,
pr_ab = "pr_ab",
method = c(method = "kfold", folds = 5)
)
dplyr::glimpse(abies2)
#> Rows: 1,400
#> Columns: 14
#> $ id <int> 715, 5680, 7907, 1850, 1702, 10036, 12384, 6513, 9884, 8651, …
#> $ pr_ab <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ x <dbl> -95417.134, 98986.536, 121474.257, -39976.221, 111372.261, -2…
#> $ y <dbl> 314240.13, -159415.18, -99463.44, -17456.11, -91404.05, 39222…
#> $ aet <dbl> 323.1133, 447.5567, 182.2833, 372.3867, 209.4567, 308.3000, 5…
#> $ cwd <dbl> 546.1400, 815.4033, 271.1800, 946.2933, 398.5500, 534.9533, 3…
#> $ tmin <dbl> 1.2433, 9.4267, -4.9500, 8.7767, -4.0333, 4.6600, 4.3800, 4.9…
#> $ ppt_djf <dbl> 62.7257, 129.6406, 150.7003, 116.0236, 164.9327, 166.2220, 48…
#> $ ppt_jja <dbl> 17.7941, 6.4317, 11.2294, 2.7020, 9.2686, 16.5310, 41.2494, 8…
#> $ pH <dbl> 5.773341, 5.600000, 0.000000, 6.411796, 0.000000, 5.700000, 5…
#> $ awc <dbl> 0.10837019, 0.16000000, 0.00000000, 0.09719457, 0.00000000, 0…
#> $ depth <dbl> 152.000000, 201.000000, 0.000000, 59.759930, 0.000000, 112.99…
#> $ landform <fct> 7, 11, 15, 14, 15, 15, 7, 15, 4, 10, 6, 10, 10, 15, 10, 11, 1…
#> $ .part <int> 3, 2, 5, 3, 5, 4, 2, 1, 2, 2, 1, 2, 1, 2, 2, 2, 5, 4, 5, 1, 5…
Now, in the abies2 object we have a new column called “.part” with the 5 k-folds (1, 2, 3, 4, 5), indicating which partition each record (row) is a member of. Next, we have to apply the same partition method and number of folds to the environmental conditions of the background points.
backg2 <- part_random(
data = backg,
pr_ab = "pr_ab",
method = c(method = "kfold", folds = 5)
)
dplyr::glimpse(backg2)
#> Rows: 5,000
#> Columns: 14
#> $ pr_ab <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ x <dbl> 160779.16, 36849.16, -240170.84, -152420.84, -193190.84, …
#> $ y <dbl> -449968.33, 24151.67, 90031.67, -143518.33, 24151.67, 223…
#> $ aet <dbl> 280.4567, 259.7800, 400.1767, 367.4833, 397.3667, 385.263…
#> $ cwd <dbl> 1137.2433, 381.5367, 699.6500, 843.4467, 842.3833, 637.35…
#> $ tmin <dbl> 13.5100, -3.1733, 8.6800, 9.0133, 8.9700, 4.9333, 6.2933,…
#> $ ppt_djf <dbl> 71.2741, 171.4537, 285.0893, 72.0309, 125.2467, 226.1534,…
#> $ ppt_jja <dbl> 1.1920, 17.5193, 5.0158, 1.2047, 1.9778, 8.1554, 18.4182,…
#> $ pH <dbl> 0.0000000, 0.2122687, 5.7222223, 7.5350823, 6.1963525, 5.…
#> $ awc <dbl> 0.000000000, 0.003473487, 0.080370426, 0.170000002, 0.131…
#> $ depth <dbl> 0.00000, 201.00000, 50.07409, 154.39426, 122.39575, 56.17…
#> $ percent_clay <dbl> 0.0000000, 0.4438345, 18.4111176, 46.9751244, 37.1873169,…
#> $ landform <fct> 13, 10, 6, 6, 10, 14, 8, 14, 6, 7, 11, 14, 14, 10, 6, 6, …
#> $ .part <int> 4, 4, 1, 5, 5, 2, 5, 3, 2, 5, 4, 1, 4, 1, 5, 1, 1, 5, 4, …
In backg2 object we have a new column called “.part” with the 5 k-folds (1, 2, 3, 4, 5).
1. Fit and validate models
We fit and validate models: I. a maximum entropy model with default hyper-parameter values (flexsdm::fit_max) and II. a random forest model with exploration of hyper-parameters (flexsdm::tune_raf).
I. Maximum Entropy models with default hyper-parameter values.
max_t1 <- fit_max(
data = abies2,
response = "pr_ab",
predictors = c("aet", "ppt_jja", "pH", "awc", "depth"),
predictors_f = c("landform"),
partition = ".part",
background = backg2,
thr = c("max_sens_spec", "equal_sens_spec", "max_sorensen"),
clamp = TRUE,
classes = "default",
pred_type = "cloglog",
regmult = 1
)
#> Formula used for model fitting:
#> ~aet + ppt_jja + pH + awc + depth + I(aet^2) + I(ppt_jja^2) + I(pH^2) + I(awc^2) + I(depth^2) + hinge(aet) + hinge(ppt_jja) + hinge(pH) + hinge(awc) + hinge(depth) + ppt_jja:aet + pH:aet + awc:aet + depth:aet + pH:ppt_jja + awc:ppt_jja + depth:ppt_jja + awc:pH + depth:pH + depth:awc + categorical(landform) - 1
#> Replica number: 1/1
#> Partition number: 1/5
#> Partition number: 2/5
#> Partition number: 3/5
#> Partition number: 4/5
#> Partition number: 5/5
This function returns a list object with the following elements:
names(max_t1)
#> [1] "model" "predictors" "performance" "data_ens"
model: A “MaxEnt” class object. This object can be used for predicting.
options(max.print = 20)
max_t1$model
#>
#> Call: glmnet::glmnet(x = mm, y = as.factor(p), family = "binomial", weights = weights, lambda = 10^(seq(4, 0, length.out = 200)) * sum(reg)/length(reg) * sum(p)/sum(weights), standardize = F, penalty.factor = reg)
#>
#> Df %Dev Lambda
#> 1 0 0.00 21.3700
#> 2 0 0.00 20.4100
#> 3 0 0.00 19.4800
#> 4 0 0.00 18.6000
#> 5 0 0.00 17.7600
#> 6 0 0.00 16.9600
#> [ reached getOption("max.print") -- omitted 194 rows ]
predictors: A tibble with quantitative (c column names) and qualitative (f column names) variables use for modeling.
max_t1$predictors
#> # A tibble: 1 × 6
#> c1 c2 c3 c4 c5 f
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 aet ppt_jja pH awc depth landform
performance: The performance metric (see sdm_eval). Those metrics that are threshold dependent are calculated based on the threshold specified in the argument. We can see all the selected threshold values.
max_t1$performance
#> # A tibble: 3 × 25
#> model threshold thr_value n_presences n_absences TPR_mean TPR_sd TNR_mean
#> <chr> <chr> <dbl> <int> <int> <dbl> <dbl> <dbl>
#> 1 max equal_sens_sp… 0.573 700 700 0.674 0.0164 0.674
#> 2 max max_sens_spec 0.416 700 700 0.909 0.0260 0.52
#> 3 max max_sorensen 0.335 700 700 0.95 0.0101 0.469
#> # ℹ 17 more variables: TNR_sd <dbl>, SORENSEN_mean <dbl>, SORENSEN_sd <dbl>,
#> # JACCARD_mean <dbl>, JACCARD_sd <dbl>, FPB_mean <dbl>, FPB_sd <dbl>,
#> # OR_mean <dbl>, OR_sd <dbl>, TSS_mean <dbl>, TSS_sd <dbl>, AUC_mean <dbl>,
#> # AUC_sd <dbl>, BOYCE_mean <dbl>, BOYCE_sd <dbl>, IMAE_mean <dbl>,
#> # IMAE_sd <dbl>
Predicted suitability for each test partition (row) based on the best model. This database is used in fit_ensemble.
max_t1$data_ens
#> # A tibble: 1,400 × 5
#> rnames replicates part pr_ab pred
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 8 .part 1 0 0.600
#> 2 11 .part 1 0 0.237
#> 3 13 .part 1 0 0.0483
#> 4 20 .part 1 0 0.115
#> 5 32 .part 1 0 0.716
#> 6 33 .part 1 0 0.0430
#> 7 48 .part 1 0 0.143
#> 8 55 .part 1 0 0.726
#> 9 65 .part 1 0 0.850
#> 10 75 .part 1 0 0.308
#> # ℹ 1,390 more rows
II- Random forest models with exploration of hyper-parameters.
First, we create a data.frame that provides hyper-parameters values to be tested. It Is recommended to generate this data.frame. Hyper-parameter needed for tuning is ‘mtry’. The maximum mtry must be equal to total number of predictors.
tune_grid <-
expand.grid(
mtry = seq(1, 7, 1),
ntree = c(300, 500, 700, 900)
)
We use the same data object abies2, with the same k-fold partition method:
rf_t <-
tune_raf(
data = abies2,
response = "pr_ab",
predictors = c(
"aet", "cwd", "tmin", "ppt_djf",
"ppt_jja", "pH", "awc", "depth"
),
predictors_f = c("landform"),
partition = ".part",
grid = tune_grid,
thr = "max_sens_spec",
metric = "TSS",
)
#> Formula used for model fitting:
#> pr_ab ~ aet + cwd + tmin + ppt_djf + ppt_jja + pH + awc + depth + landform
#> Tuning model...
#> Replica number: 1/1
#> Formula used for model fitting:
#> pr_ab ~ aet + cwd + tmin + ppt_djf + ppt_jja + pH + awc + depth + landform
#> Replica number: 1/1
#> Partition number: 1/5
#> Partition number: 2/5
#> Partition number: 3/5
#> Partition number: 4/5
#> Partition number: 5/5
Let’s see what the output object contains. This function returns a list object with the following elements:
names(rf_t)
#> [1] "model" "predictors" "performance"
#> [4] "hyper_performance" "data_ens"
model: A “randomForest” class object. This object can be used to see the formula details, a basic summary o fthe model, and for predicting.
rf_t$model
#>
#> Call:
#> randomForest(formula = formula1, data = data, mtry = mtry, ntree = 500, importance = TRUE, )
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 3
#>
#> OOB estimate of error rate: 11.29%
#> Confusion matrix:
#> 0 1 class.error
#> 0 605 95 0.1357143
#> 1 63 637 0.0900000
predictors: A tibble with quantitative (c column names) and qualitative (f column names) variables use for modeling.
rf_t$predictors
#> # A tibble: 1 × 9
#> c1 c2 c3 c4 c5 c6 c7 c8 f
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 aet cwd tmin ppt_djf ppt_jja pH awc depth landform
performance: The performance metric (see sdm_eval). Those metrics that are threshold dependent are calculated based on the threshold specified in the argument. We can see all the selected threshold values.
rf_t$performance
#> # A tibble: 1 × 27
#> mtry ntree model threshold thr_value n_presences n_absences TPR_mean TPR_sd
#> <dbl> <dbl> <chr> <chr> <dbl> <int> <int> <dbl> <dbl>
#> 1 3 300 raf max_sens_s… 0.606 700 700 0.911 0.0314
#> # ℹ 18 more variables: TNR_mean <dbl>, TNR_sd <dbl>, SORENSEN_mean <dbl>,
#> # SORENSEN_sd <dbl>, JACCARD_mean <dbl>, JACCARD_sd <dbl>, FPB_mean <dbl>,
#> # FPB_sd <dbl>, OR_mean <dbl>, OR_sd <dbl>, TSS_mean <dbl>, TSS_sd <dbl>,
#> # AUC_mean <dbl>, AUC_sd <dbl>, BOYCE_mean <dbl>, BOYCE_sd <dbl>,
#> # IMAE_mean <dbl>, IMAE_sd <dbl>
Predicted suitability for each test partition (row) based on the best model. This database is used in fit_ensemble.
rf_t$data_ens
#> # A tibble: 1,400 × 5
#> rnames replicates part pr_ab pred
#> <chr> <chr> <chr> <fct> <dbl>
#> 1 8 .part 1 0 0.118
#> 2 11 .part 1 0 0.116
#> 3 13 .part 1 0 0.006
#> 4 20 .part 1 0 0.412
#> 5 32 .part 1 0 0.128
#> 6 33 .part 1 0 0.032
#> 7 48 .part 1 0 0.008
#> 8 55 .part 1 0 0.072
#> 9 65 .part 1 0 0.366
#> 10 75 .part 1 0 0.088
#> # ℹ 1,390 more rows
These model objects can be used in flexsdm::fit_ensemble().
2. Model Ensemble
In this example we fit and validate and ensemble model using the two model objects that were just created.
# Fit and validate ensemble model
an_ensemble <- fit_ensemble(
models = list(max_t1, rf_t),
ens_method = "meansup",
thr = NULL,
thr_model = "max_sens_spec",
metric = "TSS"
)
#> | | | 0% | |======================================================================| 100%
# Outputs
names(an_ensemble)
#> [1] "models" "thr_metric" "predictors" "performance"
an_ensemble$thr_metric
#> [1] "max_sens_spec" "TSS_mean"
an_ensemble$predictors
#> # A tibble: 2 × 9
#> c1 c2 c3 c4 c5 f c6 c7 c8
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 aet ppt_jja pH awc depth landform NA NA NA
#> 2 aet cwd tmin ppt_djf ppt_jja landform pH awc depth
an_ensemble$performance
#> # A tibble: 7 × 25
#> model threshold thr_value n_presences n_absences TPR_mean TPR_sd TNR_mean
#> <chr> <chr> <dbl> <int> <int> <dbl> <dbl> <dbl>
#> 1 meansup equal_sens_… 0.568 700 700 0.879 0.0267 0.879
#> 2 meansup lpt 0.022 700 700 1 0 0.407
#> 3 meansup max_fpb 0.486 700 700 0.923 0.0234 0.859
#> 4 meansup max_jaccard 0.486 700 700 0.923 0.0234 0.859
#> 5 meansup max_sens_sp… 0.486 700 700 0.911 0.0314 0.87
#> 6 meansup max_sorensen 0.486 700 700 0.923 0.0234 0.859
#> 7 meansup sensitivity 0.53 700 700 0.9 0 0.863
#> # ℹ 17 more variables: TNR_sd <dbl>, SORENSEN_mean <dbl>, SORENSEN_sd <dbl>,
#> # JACCARD_mean <dbl>, JACCARD_sd <dbl>, FPB_mean <dbl>, FPB_sd <dbl>,
#> # OR_mean <dbl>, OR_sd <dbl>, TSS_mean <dbl>, TSS_sd <dbl>, AUC_mean <dbl>,
#> # AUC_sd <dbl>, BOYCE_mean <dbl>, BOYCE_sd <dbl>, IMAE_mean <dbl>,
#> # IMAE_sd <dbl>
3. Fit and validate models with Ensemble of Small Model approach
This method consists of creating bivariate models with all pair-wise combinations of predictors and perform an ensemble based on the average of suitability weighted by Somers’ D metric (D = 2 x (AUC -0.5)). ESM is recommended for modeling species with very few occurrences. This function does not allow categorical variables because the use of these types of variables could be problematic when applied to species with few occurrences. For more detail see Breiner et al. (2015, 2018)
data("abies")
library(dplyr)
# Create a smaller subset of occurrences
set.seed(10)
abies2 <- abies %>%
na.omit() %>%
group_by(pr_ab) %>%
dplyr::slice_sample(n = 10) %>%
group_by()
We can use different methods in the flexsdm::part_random function according to our data. See part_random for more details.
# Using k-fold partition method for model cross validation
abies2 <- part_random(
data = abies2,
pr_ab = "pr_ab",
method = c(method = "kfold", folds = 3)
)
abies2
#> # A tibble: 20 × 14
#> id pr_ab x y aet cwd tmin ppt_djf ppt_jja pH awc
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 12040 0 -308909. 384248. 573. 332. 4.84 521. 48.8 5.63 0.108
#> 2 10361 0 -254286. 417158. 260. 469. 2.93 151. 15.1 6.20 0.0950
#> 3 9402 0 -286979. 386206. 587. 376. 6.45 333. 15.7 5.5 0.160
#> 4 9815 0 -291849. 445595. 443. 455. 4.39 332. 19.1 6 0.0700
#> 5 10524 0 -256658. 184438. 355. 568. 5.87 303. 10.6 5.20 0.0800
#> 6 8860 0 121343. -164170. 354. 733. 3.97 182. 9.83 0 0
#> 7 6431 0 107903. -122968. 461. 578. 4.87 161. 7.66 5.90 0.0900
#> 8 11730 0 -333903. 431238. 561. 364. 6.73 387. 25.2 5.80 0.130
#> 9 808 0 -150163. 357180. 339. 564. 2.64 220. 15.3 6.40 0.100
#> 10 11054 0 -293663. 340981. 477. 396. 3.89 332. 26.4 4.60 0.0634
#> 11 2960 1 -49273. 181752. 512. 275. 0.920 319. 17.3 5.92 0.0900
#> 12 3065 1 126907. -198892. 322. 544. 0.700 203. 10.6 5.60 0.110
#> 13 5527 1 116751. -181089. 261. 537. 0.363 178. 7.43 0 0
#> 14 4035 1 -31777. 115940. 394. 440. 2.07 298. 11.2 6.01 0.0769
#> 15 4081 1 -5158. 90159. 301. 502. 0.703 203. 14.6 6.11 0.0633
#> 16 3087 1 102151. -143976. 299. 425. -2.08 205. 13.4 3.88 0.110
#> 17 3495 1 -19586. 89803. 438. 419. 2.13 189. 15.2 6.19 0.0959
#> 18 4441 1 49405. -60502. 362. 582. 2.42 218. 7.84 5.64 0.0786
#> 19 301 1 -132516. 270845. 367. 196. -2.56 422. 26.3 6.70 0.0300
#> 20 3162 1 59905. -53634. 319. 626. 1.99 212. 4.50 4.51 0.0396
#> # ℹ 3 more variables: depth <dbl>, landform <fct>, .part <int>
This function constructs Generalized Additive Models using the Ensembles of Small Models (ESM) approach (Breiner et al., 2015, 2018).
# We set the model without threshold specification and with the kfold created above
esm_gam_t1 <- esm_gam(
data = abies2,
response = "pr_ab",
predictors = c("aet", "cwd", "tmin", "ppt_djf", "ppt_jja", "pH", "awc", "depth"),
partition = ".part",
thr = NULL
)
#>
#> Model has more coefficients than data used for training it. Try to reduce k
This function returns a list object with the following elements:
names(esm_gam_t1)
#> NULL
esm_model: A list with “GAM” class object for each bivariate model. This object can be used for predicting using the ESM approachwith sdm_predict function.
options(max.print = 10) # If you don't want to see printed all the output
esm_gam_t1$esm_model
#> NULL
predictors: A tibble with variables use for modeling.
esm_gam_t1$predictors
#> NULL
performance: Performance metric (see sdm_eval). Those threshold dependent metrics are calculated based on the threshold specified in the argument.
esm_gam_t1$performance
#> NULL
Now, we test the rep_kfold partition method. In this method ‘folds’ refers to the number of partitions for data partitioning and ‘replicate’ refers to the number of replicates. Both assume values >=1.
# Remove the previous k-fold partition
abies2 <- abies2 %>% select(-starts_with("."))
# Test with rep_kfold partition using 3 folds and 5 replicates
set.seed(10)
abies2 <- part_random(
data = abies2,
pr_ab = "pr_ab",
method = c(method = "rep_kfold", folds = 3, replicates = 5)
)
abies2
#> # A tibble: 20 × 18
#> id pr_ab x y aet cwd tmin ppt_djf ppt_jja pH awc
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 12040 0 -308909. 384248. 573. 332. 4.84 521. 48.8 5.63 0.108
#> 2 10361 0 -254286. 417158. 260. 469. 2.93 151. 15.1 6.20 0.0950
#> 3 9402 0 -286979. 386206. 587. 376. 6.45 333. 15.7 5.5 0.160
#> 4 9815 0 -291849. 445595. 443. 455. 4.39 332. 19.1 6 0.0700
#> 5 10524 0 -256658. 184438. 355. 568. 5.87 303. 10.6 5.20 0.0800
#> 6 8860 0 121343. -164170. 354. 733. 3.97 182. 9.83 0 0
#> 7 6431 0 107903. -122968. 461. 578. 4.87 161. 7.66 5.90 0.0900
#> 8 11730 0 -333903. 431238. 561. 364. 6.73 387. 25.2 5.80 0.130
#> 9 808 0 -150163. 357180. 339. 564. 2.64 220. 15.3 6.40 0.100
#> 10 11054 0 -293663. 340981. 477. 396. 3.89 332. 26.4 4.60 0.0634
#> 11 2960 1 -49273. 181752. 512. 275. 0.920 319. 17.3 5.92 0.0900
#> 12 3065 1 126907. -198892. 322. 544. 0.700 203. 10.6 5.60 0.110
#> 13 5527 1 116751. -181089. 261. 537. 0.363 178. 7.43 0 0
#> 14 4035 1 -31777. 115940. 394. 440. 2.07 298. 11.2 6.01 0.0769
#> 15 4081 1 -5158. 90159. 301. 502. 0.703 203. 14.6 6.11 0.0633
#> 16 3087 1 102151. -143976. 299. 425. -2.08 205. 13.4 3.88 0.110
#> 17 3495 1 -19586. 89803. 438. 419. 2.13 189. 15.2 6.19 0.0959
#> 18 4441 1 49405. -60502. 362. 582. 2.42 218. 7.84 5.64 0.0786
#> 19 301 1 -132516. 270845. 367. 196. -2.56 422. 26.3 6.70 0.0300
#> 20 3162 1 59905. -53634. 319. 626. 1.99 212. 4.50 4.51 0.0396
#> # ℹ 7 more variables: depth <dbl>, landform <fct>, .part1 <int>, .part2 <int>,
#> # .part3 <int>, .part4 <int>, .part5 <int>
We use the new rep_kfold partition in the gam model
esm_gam_t2 <- esm_gam(
data = abies2,
response = "pr_ab",
predictors = c("aet", "cwd", "tmin", "ppt_djf", "ppt_jja", "pH", "awc", "depth"),
partition = ".part",
thr = NULL
)
#>
#> Model has more coefficients than data used for training it. Try to reduce k
Test with random bootstrap partitioning. In method ‘replicate’ refers to the number of replicates (assumes a value >=1), ‘proportion’ refers to the proportion of occurrences used for model fitting (assumes a value >0 and <=1). With this method we can configure the proportion of training and testing data according to the species occurrences. In this example, proportion=‘0.7’ indicates that 70% of data will be used for model training, while 30% will be used for model testing. For this method, the function will return .partX columns with “train” or “test” words as the entries.
# Remove the previous k-fold partition
abies2 <- abies2 %>% select(-starts_with("."))
# Test with bootstrap partition using 10 replicates
set.seed(10)
abies2 <- part_random(
data = abies2,
pr_ab = "pr_ab",
method = c(method = "boot", replicates = 10, proportion = 0.7)
)
abies2
#> # A tibble: 20 × 23
#> id pr_ab x y aet cwd tmin ppt_djf ppt_jja pH awc
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 12040 0 -308909. 384248. 573. 332. 4.84 521. 48.8 5.63 0.108
#> 2 10361 0 -254286. 417158. 260. 469. 2.93 151. 15.1 6.20 0.0950
#> 3 9402 0 -286979. 386206. 587. 376. 6.45 333. 15.7 5.5 0.160
#> 4 9815 0 -291849. 445595. 443. 455. 4.39 332. 19.1 6 0.0700
#> 5 10524 0 -256658. 184438. 355. 568. 5.87 303. 10.6 5.20 0.0800
#> 6 8860 0 121343. -164170. 354. 733. 3.97 182. 9.83 0 0
#> 7 6431 0 107903. -122968. 461. 578. 4.87 161. 7.66 5.90 0.0900
#> 8 11730 0 -333903. 431238. 561. 364. 6.73 387. 25.2 5.80 0.130
#> 9 808 0 -150163. 357180. 339. 564. 2.64 220. 15.3 6.40 0.100
#> 10 11054 0 -293663. 340981. 477. 396. 3.89 332. 26.4 4.60 0.0634
#> 11 2960 1 -49273. 181752. 512. 275. 0.920 319. 17.3 5.92 0.0900
#> 12 3065 1 126907. -198892. 322. 544. 0.700 203. 10.6 5.60 0.110
#> 13 5527 1 116751. -181089. 261. 537. 0.363 178. 7.43 0 0
#> 14 4035 1 -31777. 115940. 394. 440. 2.07 298. 11.2 6.01 0.0769
#> 15 4081 1 -5158. 90159. 301. 502. 0.703 203. 14.6 6.11 0.0633
#> 16 3087 1 102151. -143976. 299. 425. -2.08 205. 13.4 3.88 0.110
#> 17 3495 1 -19586. 89803. 438. 419. 2.13 189. 15.2 6.19 0.0959
#> 18 4441 1 49405. -60502. 362. 582. 2.42 218. 7.84 5.64 0.0786
#> 19 301 1 -132516. 270845. 367. 196. -2.56 422. 26.3 6.70 0.0300
#> 20 3162 1 59905. -53634. 319. 626. 1.99 212. 4.50 4.51 0.0396
#> # ℹ 12 more variables: depth <dbl>, landform <fct>, .part1 <chr>, .part2 <chr>,
#> # .part3 <chr>, .part4 <chr>, .part5 <chr>, .part6 <chr>, .part7 <chr>,
#> # .part8 <chr>, .part9 <chr>, .part10 <chr>
Use the new rep_kfold partition in the gam model
esm_gam_t3 <- esm_gam(
data = abies2,
response = "pr_ab",
predictors = c("aet", "cwd", "tmin", "ppt_djf", "ppt_jja", "pH", "awc", "depth"),
partition = ".part",
thr = NULL
)
#>
#> Model has more coefficients than data used for training it. Try to reduce k
#=========#=========#=========#=========#=========#=========#=========#
Vignette still under construction and changes
#=========#=========#=========#=========#=========#=========#=========#