Integration of outliers detection methods in environmental space

This function performs different methods for detecting outliers in species distribution data based on the environmental conditions of occurrences. Some methods need presence and absence data (e.g. Two-class Support Vector Machine and Random Forest) while other only use presences (e.g. Reverse Jackknife, Box-plot, and Random Forest outliers) . Outlier detection can be a useful procedure in occurrence data cleaning (Chapman 2005, Liu et al., 2018).

Usage

env_outliers(data, x, y, pr_ab, id, env_layer)

Arguments

data: data.frame or tibble with presence (or presence-absence) records, and coordinates
x: character. Column name with longitude data.
y: character. Column name with latitude data.
pr_ab: character. Column name with presence and absence data (i.e. 1 and 0)
id: character. Column name with row id. Each row (record) must have its own unique code.
env_layer: SpatRaster. Raster with environmental variables

Value

A tibble object with the same database used in 'data' argument and with seven additional columns, where 1 and 0 denote that a presence was detected or not as outliers

.out_bxpt: outliers detected with Box-plot method
.out_jack: outliers detected with Reverse Jackknife method
.out_svm: outliers detected with Support Vector Machine method
.out_rf: outliers detected with Random Forest method
.out_rfout: outliers detected with Random Forest Outliers method
.out_sum: frequency of a presences records was detected as outliers based on the previews methods (values between 0 and 6).

Details

This function will apply outliers detection methods to occurrence data. Box-plot and Reverse Jackknife method will test outliers for each variable individually, if an occurrence behaves as an outlier for at least one variable it will be highlighted as an outlier. If the user uses only presence data, Support Vector Machine and Random Forest Methods will not be performed. Support Vector Machine and Random Forest are performed with default hyper-parameter values. In the case of a species with < 7 occurrences, the function will not perform any methods (i.e. the additional columns will have 0 values); nonetheless, it will return a tibble with the additional columns with 0 and 1. For further information about these methods, see Chapman (2005), Liu et al. (2018), and Velazco et al. (2022).

References

Chapman, A. D. (2005). Principles and methods of data cleaning: Primary Species and Species- Occurrence Data. version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. p72. http://www.gbif.org/document/80528
Liu, C., White, M., & Newell, G. (2018). Detecting outliers in species distribution data. Journal of Biogeography, 45(1), 164 - 176. https://doi.org/10.1111/jbi.13122
Velazco, S.J.E.; Bedrij, N.A.; Keller, H.A.; Rojas, J.L.; Ribeiro, B.R.; De Marco, P. (2022) Quantifying the role of protected areas for safeguarding the uses of biodiversity. Biological Conservation, xx(xx) xx-xx. https://doi.org/10.1016/j.biocon.2022.109525

Examples

if (FALSE) { # \dontrun{
require(dplyr)
require(terra)
require(ggplot2)

# Environmental variables
somevar <- system.file("external/somevar.tif", package = "flexsdm")
somevar <- terra::rast(somevar)

# Species occurrences
data("spp")
spp
spp1 <- spp %>% dplyr::filter(species == "sp1")

somevar[[1]] %>% plot()
points(spp1 %>% filter(pr_ab == 1) %>% select(x, y), col = "blue", pch = 19)
points(spp1 %>% filter(pr_ab == 0) %>% select(x, y), col = "red", cex = 0.5)

spp1 <- spp1 %>% mutate(idd = 1:nrow(spp1))

# Detect outliers
outs_1 <- env_outliers(
  data = spp1,
  pr_ab = "pr_ab",
  x = "x",
  y = "y",
  id = "idd",
  env_layer = somevar
)

# How many outliers were detected by different methods?
out_pa <- outs_1 %>%
  dplyr::select(starts_with("."), -.out_sum) %>%
  apply(., 2, function(x) sum(x, na.rm = T))
out_pa

# How many outliers were detected by the sum of different methods?
outs_1 %>%
  dplyr::group_by(.out_sum) %>%
  dplyr::count()

# Let explor where are locate records highlighted as outliers
outs_1 %>%
  dplyr::filter(pr_ab == 1, .out_sum > 0) %>%
  ggplot(aes(x, y)) +
  geom_point(aes(col = factor(.out_sum))) +
  facet_wrap(. ~ factor(.out_sum))

# Detect outliers only with presences
outs_2 <- env_outliers(
  data = spp1 %>% dplyr::filter(pr_ab == 1),
  pr_ab = "pr_ab",
  x = "x",
  y = "y",
  id = "idd",
  env_layer = somevar
)

# How many outliers were detected by different methods
out_p <- outs_2 %>%
  dplyr::select(starts_with("."), -.out_sum) %>%
  apply(., 2, function(x) sum(x, na.rm = T))

# How many outliers were detected by the sum of different methods?
outs_2 %>%
  dplyr::group_by(.out_sum) %>%
  dplyr::count()

# Let explor where are locate records highlighted as outliers
outs_2 %>%
  dplyr::filter(pr_ab == 1, .out_sum > 0) %>%
  ggplot(aes(x, y)) +
  geom_point(aes(col = factor(.out_sum))) +
  facet_wrap(. ~ factor(.out_sum))


# Comparison of function outputs when using it with
# presences-absences or only presences data.

bind_rows(out_p, out_pa)
# Because the second case only were used presences, outliers methods
# based in Random Forest (.out_rf) and Support Vector Machines (.out_svm)
# were not performed.
} # }