rgbif
now has the ability to clean data retrieved from GBIF based on GBIF issues. These issues are returned in data retrieved from GBIF, e.g., through the occ_search()
function. Inspired by magrittr
, we’ve setup a workflow for cleaning data based on using the operator %>%
. You don’t have to use it, but as we show below, it can make the process quite easy.
Note that you can also query based on issues, e.g., occ_search(taxonKey=1, issue='DEPTH_UNLIKELY')
. However, we imagine it’s more likely that you want to search for occurrences based on a taxonomic name, or geographic area, not based on issues, so it makes sense to pull data down, then clean as needed using the below workflow with occ_issues()
.
Note that occ_issues()
only affects the data element in the gbif class that is returned from a call to occ_search()
. Maybe in a future version we will remove the associated records from the hierarchy and media elements as they are remove from the data element.
occ_issues()
also works with data from occ_download()
.
Install from CRAN
install.packages("rgbif")
Or install the development version from GitHub
::install_github("ropensci/rgbif") remotes
Load rgbif
library('rgbif')
Get taxon key for Helianthus annuus
name_suggest(q='Helianthus annuus', rank='species')$key[1]) (key <-
Then pass to occ_search()
occ_search(taxonKey=key, limit=100)) (res <-
The dataset gbifissues
can be retrieved using the function gbif_issues()
. The dataset’s first column code
is a code that is used by default in the results from occ_search()
, while the second column issue
is the full issue name given by GBIF. The third column is a full description of the issue.
head(gbif_issues())
You can query to get certain issues
gbif_issues()[ gbif_issues()$code %in% c('cdround','cudc','gass84','txmathi'), ]
The code cdround
represents the GBIF issue COORDINATE_ROUNDED
, which means that
Original coordinate modified by rounding to 5 decimals.
The content for this information comes from http://gbif.github.io/gbif-api/apidocs/org/gbif/api/vocabulary/OccurrenceIssue.html.
Now that we know a bit about GBIF issues, you can parse your data based on issues. Using the data generated above, and using the function %>%
imported from magrittr
, we can get only data with the issue gass84
, or GEODETIC_DATUM_ASSUMED_WGS84
(Note how the records returned goes down to 98 instead of the initial 100).
%>%
res occ_issues(gass84)
Note also that we’ve set up occ_issues()
so that you can pass in issue names without having to quote them, thereby speeding up data cleaning.
Next, we can remove data with certain issues just as easily by using a -
sign in front of the variable, like this, removing data with issues depunl
and mdatunl
.
%>%
res occ_issues(-depunl, -mdatunl)
Another thing we can do with occ_issues()
is go from issue codes to full issue names in case you want those in your dataset (here, showing only a few columns to see the data better for this demo):
res %>% occ_issues(mutate = "expand")
out <-head(out$data[,c(1,5)])
Sometimes you may want to have each type of issue as a separate column.
Split out each issue type into a separate column, with number of columns equal to number of issue types
res %>% occ_issues(mutate = "split")
out <-head(out$data[,c(1,5:10)])
Or you can expand each issue type into its full name, and split each issue into a separate column.
res %>% occ_issues(mutate = "split_expand")
out <-head(out$data[,c(1,5:10)])
We hope this helps users get just the data they want, and nothing more. Let us know if you have feedback on data cleaning functionality in rgbif
at info@ropensci.org or at https://github.com/ropensci/rgbif/issues.