Skip to contents

Fit and validate Random Forest models with exploration of hyper-parameters that optimize performance

Usage

tune_raf(
  data,
  response,
  predictors,
  predictors_f = NULL,
  fit_formula = NULL,
  partition,
  grid = NULL,
  thr = NULL,
  metric = "TSS",
  n_cores = 1
)

Arguments

data

data.frame. Database with response (0,1) and predictors values.

response

character. Column name with species absence-presence data (0,1).

predictors

character. Vector with the column names of quantitative predictor variables (i.e. continuous variables). Usage predictors = c("aet", "cwd", "tmin")

predictors_f

character. Vector with the column names of qualitative predictor variables (i.e. ordinal or nominal variables type). Usage predictors_f = c("landform")

fit_formula

formula. A formula object with response and predictor variables (e.g. formula(pr_ab ~ aet + ppt_jja + pH + awc + depth + landform)). Note that the variables used here must be consistent with those used in response, predictors, and predictors_f arguments. Default NULL

partition

character. Column name with training and validation partition groups.

grid

data.frame. A data frame object with algorithm hyper-parameters values to be tested. It is recommended to generate this data.frame with the grid() function. Hyper-parameter needed for tuning is 'mtry' and 'ntree'. The maximum mtry cannot exceed the total number of predictors.

thr

character. Threshold used to get binary suitability values (i.e. 0,1), needed for threshold-dependent performance metrics. It is possible to use more than one threshold type. It is necessary to provide a vector for this argument. The following threshold types are available:

  • lpt: The highest threshold at which there is no omission.

  • equal_sens_spec: Threshold at which the sensitivity and specificity are equal.

  • max_sens_spec: Threshold at which the sum of the sensitivity and specificity is the highest (aka threshold that maximizes the TSS).

  • max_jaccard: The threshold at which the Jaccard index is the highest.

  • max_sorensen: The threshold at which the Sorensen index is highest.

  • max_fpb: The threshold at which FPB is highest.

  • sensitivity: Threshold based on a specified sensitivity value. Usage thr = c('sensitivity', sens='0.6') or thr = c('sensitivity'). 'sens' refers to sensitivity value. If it is not specified a sensitivity values, function will use by default 0.9

If using more than one threshold type concatenate them, e.g., thr=c('lpt', 'max_sens_spec', 'max_jaccard'), or thr=c('lpt', 'max_sens_spec', 'sensitivity', sens='0.8'), or thr=c('lpt', 'max_sens_spec', 'sensitivity'). Function will use all thresholds if no threshold is specified.

metric

character. Performance metric used for selecting the best combination of hyper -parameter values. One of the following metrics can be used: SORENSEN, JACCARD, FPB, TSS, KAPPA, AUC, and BOYCE. TSS is used as default.

n_cores

numeric. Number of cores use for parallelization. Default 1

Value

A list object with:

  • model: A "randomForest" class object from randomForest package. This object can be used for predicting.

  • predictors: A tibble with quantitative (c column names) and qualitative (f column names) variables use for modeling.

  • performance: Hyper-parameters values and performance metric (see sdm_eval) for the best hyper-parameters combination.

  • hyper_performance: Performance metric (see sdm_eval) for each combination of the hyper-parameters.

  • data_ens: Predicted suitability for each test partition based on the best model. This database is used in fit_ensemble

See also

Examples

if (FALSE) { # \dontrun{
data(abies)
abies

# Partition the data with the k-fold method

abies2 <- part_random(
  data = abies,
  pr_ab = "pr_ab",
  method = c(method = "kfold", folds = 5)
)

tune_grid <- expand.grid(
  mtry = seq(1, 7, 1),
  ntree = c(400, 600, 800)
)

tune_grid

rf_t <-
  tune_raf(
    data = abies2,
    response = "pr_ab",
    predictors = c(
      "aet", "cwd", "tmin", "ppt_djf",
      "ppt_jja", "pH", "awc", "depth"
    ),
    predictors_f = c("landform"),
    partition = ".part",
    grid = tune_grid,
    thr = "max_sens_spec",
    metric = "TSS",
    n_cores = 1
  )

# Outputs
rf_t$model
rf_t$predictors
rf_t$performance
rf_t$hyper_performance
rf_t$data_ens
} # }