Background, Data Access, and Data Structure

Tom Auer, Daniel Fink

2020-03-23

Outline

  1. Background
  2. Data Access
  3. Data Types and Structure
  4. Vignettes
  5. Conversion to Flat Format

Background

The study and conservation of the natural world relies on detailed information about species’ distributions, abundances, environmental associations, and population trends over time. For many taxa, this information is challenging to obtain at relevant geographic scales. The goal of the eBird Status and Trends project is to use data from eBird, the global citizen science bird monitoring administered by Cornell Lab of Ornithology, to generate a reliable, standardized source of biodiversity information for the world’s bird populations. To translate the eBird observations into robust data products, we use statistical models to fill spatiotemporal gaps, using local land cover descriptions derived from NASA MODIS and other remote sensing data, while controlling for biases inherent in species observations collected by volunteers.

This data set provides estimates of the year-round distribution, abundances, and environmental associations for a the majority of North American bird species in 2018. For each species, distribution and abundance estimates are available for all 52 weeks of the year across a regular grid of locations that cover terrestrial North and South America at a resolution of 2.96 km x 2.96 km. Variation in detectability associated with the search effort is controlled by standardizing the estimates as the expected occupancy rate and count of the species on a search conducted for one hour while traveling 1 km at the optimal time of day for detection of that species, on the given day at the given location by a skilled eBirder. To describe how each species is associated with features of its local environment, estimates of the relative importance each remote sensed variable (e.g. land cover, elevation, etc), are available throughout the year at a monthly temporal and regional spatial resolution. Additionally, to assess estimate quality, we provide upper and lower confidence bounds for all abundance estimates and we provide regional-seasonal scale validation metrics for the underlying statistical models. For more information about the data products see the FAQ and summaries. See Fink et al. (2019) for more information about the analysis used to generate these data.

Data Access

eBird Status and Trends data are stored in the cloud on Amazon Web Services and accessible for download using the function ebirdst_download().

library(ebirdst)

# download data
# download a simplified example dataset from aws s3
# example data are for Yellow-bellied Sapsucker in Michigan
# by default file will be stored in a persistent data directory:
# rappdirs::user_data_dir("ebirdst"))
sp_path <- ebirdst_download(species = "example_data")

Throughout this vignette, a simplified example dataset is used consisting of Yellow-bellied Sapsucker in Michigan. For a full list of the species available for download, look at the data frame ebirst_runs, which is included in this package.

Access from Amazon Web Services

The eBird Status and Trends data are stored in the cloud using Amazon Web Services (AWS) object storage service, S3. AWS also provides access to flexible cloud computing resources in the form of EC2 instances. Users may want to considering analyzing Status and Trends data using an AWS EC2 instance because data transfer will be extremely fast between S3 and EC2 compared to downloading the data to a local machine. An additional benefit of using EC2 for analyses is access to instances with more powerful computing resources than a desktop or laptop. Working with the Status and Trends data can be extremely memory intensive, due to their high spatial and temporal resolution, so these additional resources can significantly speed up analyses.

For instruction on how to set up RStudio on an AWS EC2 instance consult Louis Aslett’s detailed guide. After following these instructions, you will be able to log in to an RStudio session in the cloud using your web browser. Within this RStudio session, run install.packages("ebirdst") to install the ebirdst package. You should now be able to work with eBird Status and Trends data in the cloud on AWS.

Data Types and Structure

IMPORTANT. AFTER DOWNLOADING THE RESULTS, DO NOT CHANGE THE FILE STRUCTURE. All functionality in this package relies on the structure inherent in the delivered results. Changing the folder and file structure will cause errors with this package. If you use this package to download and analyze the results, you do not ever need to interact with the files directly, outside of R. If you intend to use the data outside of this package, than this warning does not necessarily apply to you.

Data Types

The data products included in the downloads contain two types of data: a) raster data containing occurrence and abundance estimates at a 2.96 km resolution for each of 52 week across North America, b) non-raster, tabular, text data containing information about modeled relationships between observations and the ecological covariates, in the form of: predictor importances (PIs) and predictive performance metrics (PPMs). The raster data will be the most commonly used data, as it provides high resolution, spatially-explicit information about the abundance and occurrence of each species. The non-raster data is an advanced, modeling-oriented product that requires more understanding about the modeling process.

Data Structure

Data are grouped by species, using a unique run name. The structure of the run name is: six_letter_code-ERD2018-EBIRD_SCIENCE-date-uuid. A full list of species and run names can be found in the ebirdst_runs data frame. If you are not using the R package you can find a CSV equivalent at the root of the AWS s3 bucket at this url.

For each individual species, the data are structured in the following way:

/<run_name>/data/<run_name>_test-data.csv /<run_name>/data/<run_name>_srd_raster_template.tif /<run_name>/results/preds/test_pred_ave.txt /<run_name>/results/stixels/pi.txt /<run_name>/results/stixels/summary.txt /<run_name>/results/tifs/<run_name>_hr_2018_abundance_median.tif /<run_name>/results/tifs/<run_name>_hr_2018_abundance_upper.tif /<run_name>/results/tifs/<run_name>_hr_2018_abundance_lower.tif /<run_name>/results/tifs/<run_name>_hr_2018_count_median.tif /<run_name>/results/tifs/<run_name>_hr_2018_occurrence_median.tif

In addition, seasonal abundance rasters are provided for those seasons that passed the review process. Up to four layers are provided, structured as follows:

/<run_name>/results/tifs/<run_name>_hr_2018_abundance_seasonal_<season>.tif

Where <season> is one of postbreeding_migration, prebreeding_migration, nonbreeding, breeding, or year_round.

Raster Data

eBird Status and Trends abundance, count, and occurrence estimates are currently provided in the widely used GeoTIFF raster format. These are easily opened with the raster package in R, as well as with a variety of GIS software tools. Each estimate is stored in a multi-band GeoTIFF file. These “cubes” come with areas of predicted and assumed zeroes, such that any cells that are NA represent areas outside of the area of estimation. All cubes have 52 weeks, even if some weeks are all NA (such as those species that winter entirely outside of the study area). In addition, seasonally averaged abundance layers are provided. The two vignettes that are relevant to the raster data are the intro mapping vignette and the advanced mapping vignette.

The relevant abundance and occurrence estimate GeoTiff files are found under the /<run_name>/results/tifs/ directory and contain the following files.

/<run_name>/results/tifs/<run_name>_hr_2018_abundance_median.tif /<run_name>/results/tifs/<run_name>_hr_2018_abundance_upper.tif /<run_name>/results/tifs/<run_name>_hr_2018_abundance_lower.tif /<run_name>/results/tifs/<run_name>_hr_2018_count_median.tif /<run_name>/results/tifs/<run_name>_hr_2018_occurrence_median.tif /<run_name>/results/tifs/<run_name>_hr_2018_abundance_seasonal_<season>.tif

Projection

The GeoTIFFs use the same Sinusoidal projection as NASA MODIS data. As part of each data package, a template raster (/<run_name>/data/<run_name_srd_raster_template.tif) is provided, that contains the spatial extent and resolution for the full Western Hemisphere. Accessing this raster directly through the package is not necessary, and can be applied elsewhere (e.g., other GIS software). Note that this projection is ideal for analysis, as it is an equal are projection, but is not ideal for mapping. See the intro mapping vignette for details on using a more suitable projection for mapping.

Raster Layer Descriptions

Type Measure File Name
occurrence median <run_name>_hr_2018_occurrence_median.tif
count median <run_name>_hr_2018_count_median.tif
abundance median <run_name>_hr_2018_abundance_median.tif
abundance 10th quantile <run_name>_hr_2018_abundance_lower.tif
abundance 90th quantile <run_name>_hr_2018_abundance_upper.tif

occurrence_median

This layer represents the expected probability of occurrence of the species, ranging from 0 to 1, on an eBird Traveling Count by a skilled eBirder starting at the optimal time of day with the optimal search duration and distance that maximizes detection of that species in a region.

count_median

This layer represents the expected count of a species, conditional on its occurrence at the given locatiion, on an eBird Traveling Count by a skilled eBirder starting at the optimal time of day with the optimal search duration and distance that maximizes detection of that species in a region.

abundance_median

This layer represents the expected relative abundance, computed as the product of the probability of occurrence and the count conditional on occurrence, of the species on an eBird Traveling Count by a skilled eBirder starting at the optimal time of day with the optimal search duration and distance that maximizes detection of that species in a region.

abundance_lower

This layer represents the lower 10th quantile of the expected relative abundance of the species on an eBird Traveling Count by a skilled eBirder starting at the optimal time of day with the optimal search duration and distance that maximizes detection of that species in a region.

abundance_upper

This layer represents the upper 90th quantile of the expected relative abundance of the species on an eBird Traveling Count by a skilled eBirder starting at the optimal time of day with the optimal search duration and distance that maximizes detection of that species in a region.

Non-raster Data

The non-raster, tabular data containing information about modeled relationships between observations and the ecological covariates are best accessed through functionality provided in this package. However, in using them through the package, it is possible to export them to other tabular formats for use with other software. For information about working with these data, please reference to the non-raster data vignette for details on how to access additional information from the model results about predictor importance and directionality, as well as predictive performance metrics. Important note: to access the non-raster data, use the parameter tifs_only = FALSE in the ebirdst_download() function.

Vignettes

Beyond this introduction to the eBird Status and Trends products and data, we have written multiple vignettes to help guide users in using the data and the functionality provided by this package. An intro mapping vignette expands upon the quick start readme and shows the basic mapping moves. The advanced mapping vignette shows how to reproduce the seasonal maps and statistics on the eBird Status and Trends website. Finally, the non-raster data vignette details how to access additional information from the model results about predictor importance and directionality, as well as predictive performance metrics.

Conversion to Flat Format

The raster package has a lot of functionality and the RasterLayer format is useful for spatial analysis and mapping, but some users do not have GIS experience or want the data in a simpler format for their preferred method of analysis. There are multiple ways to get more basic representations of the data.

library(raster)

# load trimmed mean abundances
abunds <- load_raster("abundance", path = sp_path)

# use parse_raster_dates() to get actual date objects for each layer
date_vector <- parse_raster_dates(abunds)

# to convert the data to a simpler geographic format and access tabularly   
# reproject into geographic (decimal degrees) 
abund_stack_ll <- projectRaster(abunds[[26]], crs = "+init=epsg:4326", 
                                method = "ngb")

# Convert raster object into a matrix
p <- rasterToPoints(abund_stack_ll)
colnames(p) <- c("longitude", "latitude", "abundance_umean")
head(p)

These results can then be written to CSV.

write.csv(p, file = "yebsap_week26.csv", row.names = FALSE)

References

Fink, D., T. Auer, A. Johnston, V. Ruiz‐Gutierrez, W.M. Hochachka, S. Kelling. 2019. Modeling avian full annual cycle distribution and population trends with citizen science data. Ecological Applications, 00(00):e02056. https://doi.org/10.1002/eap.2056