Skip to contents

Collinearity reduction of predictor variables

Usage

correct_colinvar(
  env_layer,
  method,
  proj = NULL,
  restric_to_region = NULL,
  restric_pca_proj = FALSE,
  maxcell = NULL
)

Arguments

env_layer

SpatRaster An object of class SpatRaster containing the predictors. This function does not allow categorical variables

method

character. Collinearity reduction method. It is necessary to provide a vector for this argument. The next methods are implemented:

  • pearson: Highlights correlated variables according to Pearson correlation. A threshold of maximum correlation must be specified. Otherwise, a threshold of 0.7 is defined as default. Usage method = c('pearson', th='0.7').

  • vif: Select variables by Variance Inflation Factor, a threshold can be specified by user. Otherwise, a threshold of 10 is defined as default.Usage method = c('vif', th = '10').

  • pca: Perform a Principal Component Analysis and use the principal components as the new predictors. The selected components account for 95% of the whole variation in the system. Usage method = c('pca').

  • fa: Perform a Factorial Analysis and select, from the original predictors, the number of factors is defined by Broken-Stick and variables with the highest correlation to the factors are selected. Usage method = c('fa').

proj

character. Only used for pca method. Path to a folder that contains sub-folders for the different projection scenarios. Variables names must have the same names as in the raster used in env_layer argument. Usage proj = "C:/User/Desktop/Projections" (see in Details more about the use of this argument)

restric_to_region

SpatVector. Area used to restrict cells of env_layer at moment to perform collinearity reduction.

restric_pca_proj

logical. Area used to restrict geographically PCA projection within SpatVector used in restric_to_region. Only use for PCA analysis. default: FALSE.

maxcell

numeric. Number of raster cells to be randomly sampled. Taking a sample could be useful to reduce memory usage for large rasters. If NULL, the function will use all raster cells. Default NULL. Usage maxcell = 50000.

Value

#' If 'pearson', returns a list with the following elements:

  • cor_table: a matrix object with pairwise correlation values of the environmental variables

  • cor_variables: a list object with the same length of the number of environmental values containing the pairwise relations that exceeded the correlation threshold for each one of the environmental variables

If 'vif' method, returns a list with the following elements:

  • env_layer: a SpatRaster object with selected environmental variables

  • removed_variables: a character vector with removed environmental variables

  • vif_table: a data frame with VIF values for all environmental variables

If 'pca' method, returns a list with the following elements:

  • env_layer: SpatRaster with scores of selected principal component (PC) that sum up 95% of the whole variation or original environmental variables

  • coefficients: a matrix with the coefficient of principal component (PC) for predictors

  • cumulative_variance: a tibble with the cumulative variance explained in selected principal component (PC)

If 'fa' method, returns a list with the following elements:

  • env_layer: SpatRaster with scores of selected variables due to correlation to factors.

  • number_factors: number of factors selected according to the Broken-Stick criteria,

  • removed_variables: removed variables,

  • uniqueness: uniqueness of each environmental variable according to the factorial analysis,

  • loadings: environmental variables loadings in each of the chosen factors

Details

In the case of having environmental variables for the current conditions and other time periods (future or present), it is recommended to perform the PCA analysis with the current environmental condition and project the PCA for the other time periods. To do so, it is necessary to use “proj” argument. Path to a folder (e.g., projections) that contains sub-folders for the different projection scenarios (e.g., years and emissions). Within each sub-folder must be stored single or multiband rasters with the environmental variables.

For example:

C:/Users/my_pc/projections/
├── MRIESM_2050_ssp126
│ └── var1.tif
│ └── var2.tif
│ └── var3.tif
├── MRIESM_2080_ssp585
│ └── var1.tif
│ └── var2.tif
│ └── var3.tif
├── UKESM_2050_ssp370
│ └── var1.tif
│ └── var2.tif
│ └── var3.tif

If pca method is run with time projections, correct_colinvar function will create the Projection_PCA (the exact path is in the path object returned by the function) with the same system of sub-folders and multiband raster with the principal components (pcs.tif)

C:/Users/my_pc/Projection_PCA/
├── MRIESM_2050_ssp126
│ └── pcs.tif # a multiband tif with principal components
├── MRIESM_2080_ssp585
│ └── pcs.tif
├── UKESM_2050_ssp370
│ └── pcs.tif

Examples

if (FALSE) {
require(terra)
require(dplyr)

somevar <- system.file("external/somevar.tif", package = "flexsdm")
somevar <- terra::rast(somevar)

# Perform pearson collinearity control
var <- correct_colinvar(env_layer = somevar, method = c("pearson", th = "0.7"))
var$cor_table
var$cor_variables

# For all correct_colinvar methods it is possible to take a sample or raster to reduce memory
var <- correct_colinvar(env_layer = somevar, method = c("pearson", th = "0.7"), maxcell = 10000)
var$cor_table
var$cor_variables

# Perform vif collinearity control
var <- correct_colinvar(env_layer = somevar, method = c("vif", th = "8"))
var$env_layer
var$removed_variables
var$vif_table

# Perform pca collinearity control
var <- correct_colinvar(env_layer = somevar, method = c("pca"))
plot(var$env_layer)
var$env_layer
var$coefficients
var$cumulative_variance


# Perform pca collinearity control with different projections
## Below will be created a set of folders to simulate the structure of the directory where
## environmental variables are stored for different scenarios
dir_sc <- file.path(tempdir(), "projections")
dir.create(dir_sc)
dir_sc <- file.path(dir_sc, c('scenario_1', 'scenario_2'))
sapply(dir_sc, dir.create)

somevar <-
  system.file("external/somevar.tif", package = "flexsdm")
somevar <- terra::rast(somevar)

terra::writeRaster(somevar, file.path(dir_sc[1], "somevar.tif"), overwrite=TRUE)
terra::writeRaster(somevar, file.path(dir_sc[2], "somevar.tif"), overwrite=TRUE)

## Perform pca with projections
dir_w_proj <- dirname(dir_sc[1])
dir_w_proj
var <- correct_colinvar(env_layer = somevar, method = "pca", proj = dir_w_proj)
var$env_layer
var$coefficients
var$cumulative_variance
var$proj


# Perform fa colinearity control
var <- correct_colinvar(env_layer = somevar, method = c("fa"))
var$env_layer
var$number_factors
var$removed_variables
var$uniqueness
var$loadings

##%######################################################%##
#                                                          #
####            Other option to perform PCA             ####
####      considering cell restricted to a region       ####
#                                                          #
##%######################################################%##

# Define a calibration area
abies2 <- abies %>%
  dplyr::select(x, y, pr_ab) %>%
  dplyr::filter(pr_ab==1)

plot(somevar[[1]])
points(abies2[-3])
ca <- calib_area(abies2, x = "x", y = "y", method = c("mcp"), crs=crs(somevar))
plot(ca, add=T)

# Full geographical range to perform PCA
pca_fr <- correct_colinvar(env_layer = somevar ,
                           method = c("pca"),
                           maxcell = NULL,
                           restric_to_region = NULL,
                           restric_pca_proj = FALSE)

# Perform PCA only with cell delimited by polygon used in restric_to_region
pca_rr <- correct_colinvar(env_layer = somevar ,
                           method = c("pca"),
                           maxcell = NULL,
                           restric_to_region = ca,
                           restric_pca_proj = FALSE)

# Perform and predicted PCA only with cell delimited by polygon used in restric_to_region
pca_rrp <- correct_colinvar(env_layer = somevar ,
                            method = c("pca"),
                            maxcell = NULL,
                            restric_to_region = ca,
                            restric_pca_proj = TRUE)

plot(pca_fr$env_layer) # PCA with all cells
plot(pca_rr$env_layer) # PCA with calibration area cell but predicted for entire region
plot(pca_rrp$env_layer) # PCA performed and predicted for cells within calibration area (ca)

}