Fit and validate Extreme Gradient Boosting models
Usage
fit_abund_xgb(
data,
response,
predictors,
predictors_f = NULL,
partition,
hold_out_set = NULL,
predict_part = FALSE,
nrounds = 1000,
max_depth = 5,
learning_rate = 0.1,
min_split_loss = 1,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 0.5,
objective = "reg:squarederror",
early_stopping = list(cv_strategy = 10, fm_strategy = "median"),
verbose = TRUE
)Arguments
- data
tibble or data.frame. Database with response, predictors, and partition values
- response
character. Column name with species abundance.
- predictors
character. Vector with the column names of quantitative predictor variables (i.e. continuous variables). Usage predictors = c("temp", "precipt", "sand")
- predictors_f
character. Vector with the column names of qualitative predictor variables (i.e. ordinal or nominal variables type). Usage predictors_f = c("landform")
- partition
character. Column name with training and validation partition groups.
- hold_out_set
tibble or data.frame. A hold-out dataset used for evaluation and early stopping. This data is never used during the training phase.
- predict_part
logical. Save predicted abundance for testing data. Default = FALSE.
- nrounds
integer. Max number of boosting iterations. Default is 100.
- max_depth
integer. The maximum depth of each tree. Default 5
- learning_rate
numeric. The learning rate of the algorithm. Default 0.1
- min_split_loss
numeric. Minimum loss reduction required to make a further partition on a leaf node of the tree. Default is 1.
- colsample_bytree
numeric. Subsample ratio of columns when constructing each tree. Default is 1.
- min_child_weight
numeric. Minimum sum of instance weight needed in a child. Default is 1.
- subsample
numeric. Subsample ratio of the training instance. Default is 0.5.
- objective
character. The learning task and the corresponding learning objective. Default is "reg:squarederror", regression with squared loss.
- early_stopping
A list containing two elements:
cv_strategy: Numerical. Specifies the number of rounds without improvement before training stops during the cross-validation stage.fm_strategy: A vector defining the strategy for the final model:c("hold_out", n): Stops training afternrounds without improvement, using thehold_out_setas the evaluation set.c("mean"): Trains the final model using the average number of rounds reached across all cross-validation folds.c("median"): Uses the median number of rounds from the cross-validation stage.c("max"): Uses the maximum number of rounds reached in any fold.c("min"): Uses the minimum number of rounds reached in any fold.
- verbose
logical. If FALSE, disables all console messages. Default TRUE.
- hold_out_evaluation
logical. If
TRUE, performance metrics will also be calculated for thehold_out_set.
Value
A list object with:
model: A "xgb.Booster" class object from xgboost package. This object can be used for predicting.
predictors: A tibble with quantitative (c column names) and qualitative (f column names) variables use for modeling.
performance: Averaged performance metrics (see
adm_eval).performance_part: Performance metrics for each replica and partition.
predicted_part: Observed and predicted abundance for each test partition.
Examples
if (FALSE) { # \dontrun{
require(terra)
require(dplyr)
# Database with species abundance and x and y coordinates
data("sppabund")
# Extract data for a single species
some_sp <- sppabund %>%
dplyr::filter(species == "Species one") %>%
dplyr::select(-.part2, -.part3)
# Explore reponse variables
some_sp$ind_ha %>% range()
some_sp$ind_ha %>% hist()
# Here we balance number of absences
some_sp <-
balance_dataset(some_sp, response = "ind_ha", absence_ratio = 0.2)
# Fit a XGB model
mxgb <- fit_abund_xgb(
data = some_sp,
response = "ind_ha",
predictors = c("bio12", "elevation", "sand"),
predictors_f = NULL,
partition = ".part",
nrounds = 200,
max_depth = 5,
learning_rate = 0.1,
min_split_loss = 1,
colsample_bytree = 0.7,
min_child_weight = 2,
subsample = 0.3,
objective = "reg:squarederror",
predict_part = TRUE
)
mxgb
} # }