Interface to Species Occurrence Data Sources

A programmatic interface to many species occurrence data sources, including Global Biodiversity Information Facility ('GBIF'), 'USGSs' Biodiversity Information Serving Our Nation ('BISON'), 'iNaturalist', Berkeley 'Ecoinformatics' Engine, 'eBird', 'AntWeb', Integrated Digitized 'Biocollections' ('iDigBio'), 'VertNet', Ocean 'Biogeographic' Information System ('OBIS'), and Atlas of Living Australia ('ALA'). Includes functionality for retrieving species occurrence data, and combining those data.


Build Status Build status codecov.io rstudio mirror downloads cran version

spocc = SPecies OCCurrence data

At rOpenSci, we have been writing R packages to interact with many sources of species occurrence data, including GBIF, Vertnet, BISON, iNaturalist, the Berkeley ecoengine, AntWeb, and eBird. Other databases are out there as well, which we can pull in. spocc is an R package to query and collect species occurrence data from many sources. The goal is to to create a seamless search experience across data sources, as well as creating unified outputs across data sources.

spocc currently interfaces with ten major biodiversity repositories

  1. Global Biodiversity Information Facility (GBIF) (via rgbif) GBIF is a government funded open data repository with several partner organizations with the express goal of providing access to data on Earth's biodiversity. The data are made available by a network of member nodes, coordinating information from various participant organizations and government agencies.

  2. Berkeley Ecoengine (via ecoengine) The ecoengine is an open API built by the Berkeley Initiative for Global Change Biology. The repository provides access to over 3 million specimens from various Berkeley natural history museums. These data span more than a century and provide access to georeferenced specimens, species checklists, photographs, vegetation surveys and resurveys and a variety of measurements from environmental sensors located at reserves across University of California's natural reserve system.

  3. iNaturalist iNaturalist provides access to crowd sourced citizen science data on species observations.

  4. VertNet (via rvertnet) Similar to rgbif, ecoengine, and rbison (see below), VertNet provides access to more than 80 million vertebrate records spanning a large number of institutions and museums primarly covering four major disciplines (mammology, herpetology, ornithology, and icthyology). Note that we don't currenlty support VertNet data in this package, but we should soon

  5. Biodiversity Information Serving Our Nation (via rbison) Built by the US Geological Survey's core science analytic team, BISON is a portal that provides access to species occurrence data from several participating institutions.

  6. eBird (via rebird) ebird is a database developed and maintained by the Cornell Lab of Ornithology and the National Audubon Society. It provides real-time access to checklist data, data on bird abundance and distribution, and communtiy reports from birders.

  7. AntWeb (via AntWeb) AntWeb is the world's largest online database of images, specimen records, and natural history information on ants. It is community driven and open to contribution from anyone with specimen records, natural history comments, or images.

  8. iDigBio (via ridigbio) iDigBio facilitates the digitization of biological and paleobiological specimens and their associated data, and houses specimen data, as well as providing their specimen data via RESTful web services.

  9. OBIS OBIS (Ocean Biogeographic Information System) allows users to search marine species datasets from all of the world's oceans.

  10. Atlas of Living Australia ALA (Atlas of Living Australia) contains information on all the known species in Australia aggregated from a wide range of data providers: museums, herbaria, community groups, government departments, individuals and universities; it contains more than 50 million occurrence records.

The inspiration for this comes from users requesting a more seamless experience across data sources, and from our work on a similar package for taxonomy data (taxize).

BEWARE: In cases where you request data from multiple providers, especially when including GBIF, there could be duplicate records since many providers' data eventually ends up with GBIF. See ?spocc_duplicates, after installation, for more.

See CONTRIBUTING.md

Installation

Stable version from CRAN

install.packages("spocc", dependencies = TRUE)

Or the development version from GitHub

install.packages("devtools")
devtools::install_github("ropensci/spocc")
library("spocc")

Basic use

Get data from GBIF

(out <- occ(query = 'Accipiter striatus', from = 'gbif', limit = 100))
#> Searched: gbif
#> Occurrences - Found: 617,957, Returned: 100
#> Search type: Scientific
#>   gbif: Accipiter striatus (100)

Just gbif data

out$gbif
#> Species [Accipiter striatus (100)] 
#> First 10 rows of [Accipiter_striatus]
#> 
#> # A tibble: 100 × 63
#>                  name  longitude latitude  prov         issues        key
#>                 <chr>      <dbl>    <dbl> <chr>          <chr>      <int>
#> 1  Accipiter striatus  -97.12924 32.70085  gbif cdround,gass84 1453324136
#> 2  Accipiter striatus  -84.74625 40.01773  gbif cdround,gass84 1453369124
#> 3  Accipiter striatus  -72.58904 43.85320  gbif cdround,gass84 1453335509
#> 4  Accipiter striatus  -96.77096 33.22315  gbif cdround,gass84 1453335637
...

Pass options to each data source

Get fine-grained detail over each data source by passing on parameters to the packge rebird in this example.

(out <- occ(query = 'Setophaga caerulescens', from = 'ebird', ebirdopts = list(region = 'US')))
#> Searched: ebird
#> Occurrences - Found: 0, Returned: 199
#> Search type: Scientific
#>   ebird: Setophaga caerulescens (199)

Just ebird data

out$ebird
#> Species [Setophaga caerulescens (199)] 
#> First 10 rows of [Setophaga_caerulescens]
#> 
#> # A tibble: 199 × 12
#>                      name longitude latitude  prov
#>                     <chr>     <dbl>    <dbl> <chr>
#> 1  Setophaga caerulescens -81.74960 24.57340 ebird
#> 2  Setophaga caerulescens -82.50378 27.31897 ebird
#> 3  Setophaga caerulescens -82.81150 27.83556 ebird
#> 4  Setophaga caerulescens -93.94818 29.69841 ebird
...

Many data sources at once

Get data from many sources in a single call

ebirdopts = list(region = 'US'); gbifopts = list(country = 'US')
out <- occ(query = 'Setophaga caerulescens', from = c('gbif','bison','inat','ebird'), gbifopts = gbifopts, ebirdopts = ebirdopts, limit = 50)
dat <- occ2df(out)
head(dat); tail(dat)
#> # A tibble: 6 × 6
#>                     name   longitude  latitude  prov       date        key
#>                    <chr>       <chr>     <chr> <chr>     <date>      <chr>
#> 1 Setophaga caerulescens -122.673863 45.476817  gbif 2017-01-09 1453379582
#> 2 Setophaga caerulescens  -83.035698 35.431075  gbif 2016-04-25 1453190650
#> 3 Setophaga caerulescens  -83.162943 41.615537  gbif 2016-05-12 1269558094
#> 4 Setophaga caerulescens  -74.405661 40.058324  gbif 2016-05-20 1453340127
#> 5 Setophaga caerulescens  -83.449402 44.252577  gbif 2016-05-14 1291104360
#> 6 Setophaga caerulescens  -83.449517 44.253819  gbif 2016-05-07 1291149600
#> # A tibble: 6 × 6
#>                     name   longitude   latitude  prov       date      key
#>                    <chr>       <chr>      <chr> <chr>     <date>    <chr>
#> 1 Setophaga caerulescens -82.8282344 27.8851167 ebird 2017-04-18 L3547190
#> 2 Setophaga caerulescens   -82.64435  27.532639 ebird 2017-04-18  L189003
#> 3 Setophaga caerulescens -81.7713915 26.1084774 ebird 2017-04-18 L2603780
#> 4 Setophaga caerulescens -84.8486335  29.671734 ebird 2017-04-18  L352112
#> 5 Setophaga caerulescens -80.7833333 33.7833333 ebird 2017-04-18  L109521
#> 6 Setophaga caerulescens -81.9212021  26.746724 ebird 2017-04-17 L3579621

Clean data

All data cleaning functionality is in a new package scrubr. On CRAN.

Make maps

All mapping functionality is now in a separate package mapr (formerly known as spoccutils), to make spocc easier to maintain. On CRAN.

Meta

  • Please report any issues or bugs.
  • License: MIT
  • Get citation information for spocc in R doing citation(package = 'spocc')
  • Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

ropensci_footer

News

spocc 0.7.0

NEW FEATURES

  • Removed javascript and V8 package import and using wicket C++ based package instead. So you no longer need V8 which should make installation easier on some platforms. (#172)

MINOR IMPROVEMENTS

  • httr replaced with crul for HTTP reqeusts (#174)
  • Moved to using markdown for docs. The only thing you should notice that's different now is doing curl options is slightly different - it's just curl::curl_options() (#176)
  • All as.*() functions can now pass on curl options to the http client (#177)
  • Bumped minimum versions for a number of dependencies

BUG FIXES

  • Fix to foo_ala() - the internal plugin for occ() that handles ALA queries: changed query from full text query using q=foo bar to q=taxon_name="foo bar" - in addition, improved error handling as sometimes occurrences slot is returned in results but is empty, whereas before it seemd to always be absent if no results (#178)

spocc 0.6.0

NEW FEATURES

  • Added a new data source: Atlas of Living Australia (ALA), under the abbreviation ala (#98)
  • Added a new data source: Ocean Biogeographic Information System (OBIS), under the abbreviation obis (#155)

MINOR IMPROVEMENTS

  • Added note to docs and minor tweak to internal methods to account for max results from iDigBio of 100,000. Now when you request more than 100K, you should get a warning saying as much (#169)

BUG FIXES

  • Made occ2df() more robust to varied inputs - allowing for users that may on purpose or not have a subset of the data source slots normally in the occdat class object (#171)

spocc 0.5.4

MINOR IMPROVEMENTS

  • rvertnet, a dependency dealing with data from Vertnet, was failing on certain searches. rvertnet was fixed and a new version on CRAN now. No changes here other than requiring the new version of rvertnet (#168)
  • Fix to internal INAT parsers to handle JSON data output instead of CSV output. And fix to internal date parsing; INAT changed field for date from datetime to observed_on.
  • Move all is() to inherits(), and namespace all setNames() calls
  • We are now using rgbif::occ_data() instead of rgbif::occ_search()
  • We are now using rvertnet::searchbyterm() instead of rgbif::vertsearch()

BUG FIXES

  • Fixes to iDigBio internal plugin - we were dropping scientificname if geometry was passed by the user. Fixed now. (#167)
  • Fixed bug in GBIF internal plugin - when more than 1 result given back (e.g., multiple searches were done, resulting in a list of objects) we weren't parsing the output correctly. Fixed now. (#166)

spocc 0.5.0

NEW FEATURES

  • occ() now allows queries that only pass from and one of the data source opts params (e.g., gbifopts) - allows specifying any options passed down to the internal functions used to do data queries without having to use the other params in occ (#163)

MINOR IMPROVEMENTS

  • Now using tibble for representing data.frames (#164)
  • Now using explicit encoding="UTF-8" in httr::content() calls to parse raw data from web requests (#160)
  • Now using ridigbio as its on CRAN - was using internal fxns prior to this (#154)

BUG FIXES

  • There was a problem in the ebird parser where it wasn't processing results from ebird with no data. A problem with has_coords also fixed. (#161)

spocc 0.4.5

MINOR IMPROVEMENTS

  • Using data.table::setDF() instead of data.frame() to set a data.table style table to a data.frame
  • Added many more tests to make it less likely errors will occur
  • Added vertnet as an option to occ_options() to get the options for passing to vertopts in occ()

BUG FIXES

  • Fix to print.occdatind() - which in last version introduced a bug in this print method - wasn't fatal as only applied to empty slots in the output of a call to occ(), but nonetheless, not good (#159)

spocc 0.4.4

MINOR IMPROVEMENTS

  • New import data.table for fast list to data.frame

BUG FIXES

  • Fix to ecoengine spatial search - internally we were not making the bounding box correctly - fixed now (#158)

spocc 0.4.0

NEW FEATURES

  • New function as.vertnet() to coerce various inputs (e.g., result from occ(), occ2df(), or a key itself) to occurrence data objects (#142)
  • occ() gains two parameters start and page to facilitate paging through results across data sources, instead of having to page individually for each data source. Some sources use the start parameter, while others use the page parameter. See Paging section in ?occ for details on Paging (#140)

MINOR IMPROVEMENTS

  • Added Code of Conduct

BUG FIXES

  • wkt_vis() now works with WKT polygons with multipe polygons, e.g., spocc::wkt_vis("POLYGON((-125 38.4, -121.8 38.4, -121.8 40.9, -125 40.9, -125 38.4), (-115 22.4, -111.8 22.4, -111.8 30.9, -115 30.9, -115 22.4))") (#147)
  • Fix to print.occdatind() to print more helpful info when a geometry search is used as opposed to a taxonomy based search (#149)
  • Fix to print.occdatind() to not fail when first element not present; proceeds to next slot with data (#143)
  • Fixed problem where occ() failed when multiple geometry elements passed in along with taxonomic names (#146)
  • Fix to occ2df() for combining outputs to not fail when AntWeb doesn't give back dates (#144) (#145) - thanks @timcdlucas
  • Fix to occ2df() to not fail when date field missing (#141)

spocc 0.3.2

NEW FEATURES

  • Added iDigBio as a new data source in spocc (#136) (#124)

MINOR IMPROVEMENTS

  • Added much more detail on what parameters in child packages are being used inside of the occ() function. Each data source is taken care of in a separate package or set of wrapper functions, and the man file now details what API parameters are being queried (#138)

BUG FIXES

  • Fixed bug where when latitude/longitude columns missing, caused problems downstream in printing outputs, etc. Now we put in NA's when those columns missing (#139)
  • Fixed bug in inat data source - Datetime variable changed to datetime
  • Fixed bug in vertnet data source - occurrenceID variable changed to occurrenceid

spocc 0.3.0

NEW FEATURES

  • Mapping functions all gone, and put into a new package spoccutils (https://github.com/ropensci/spoccutils) (#132)
  • occ() gains new parameter has_coords - a global parameter (except for ebird and bison) to return only records with lat/long data. (#128)
  • type (#134) and rank (#133) parameters dropped from occ()
  • When object returned by occ() is printed, we now include a message that total count of records found (not returned) is not completely known if ebird is included, because eBird does not include data on records found on their servers with requests to their API (#111)
  • New functions as.*() (e.g., as.gbif) for most data sources. These functions take in occurrence keys or sets of keys, and retrieve detailed occurrence record data for each key (#112)
  • New data source: VertNet (#110)
  • occ2df() now returns more fields. This function collapses all essential fields that are easy to get in all data sources: name, lat, long, prov, date, key. The key field is the occurrence key for each record, which you can use to keep track of individual records, get more data on the record, etc. (#103) (#108)
  • New function inspect() - takes output from occ() or individual occurrence keys and gets detailed occurrence data.

MINOR IMPROVEMENTS

  • Now importing packages: jsonlite, V8, utils, and methods. No longer importing: ggmap, maptools, rworldmap, sp, rgeos, RColorBrewer, rgdal, and leafletR. Pkgs removed mostly due to splitting off some functionality into spoccutils. related issues: (#131) (#132)
  • Now importing explicitly all non-base R functions that we use: now importing methods, utils (#120)
  • We now attempt to standardize dates across all data sources, and return that in the output of a call to occ2df() (#106)
  • wkt_vis() now only has an option to view a WKT shape in the browser.

BUG FIXES

  • Fixes to being able to pass curl options on to each data source's functions (#107)

spocc 0.2.4

MINOR IMPROVEMENTS

  • Improved documentation for bounding boxes, their expected format, etc. (#96)
  • Remove dependency on the following packages: assertthat, plyr, data.table, and XML (#102)
  • Using package gistr now to post interactive geojson maps on GitHub gists (#100)
  • rgbif now must be v0.7.7 or greater (the latest version on CRAN).
  • Removed the startup message.

BUG FIXES

  • Duplicate, but not working correctly, function occ2sp() removed. The function occ_to_sp() function is the working version. (#97)
  • Fixed bug where some records returned form GBIF did not have lat/long column headers, and we internally rearranged columns, which caused complete stop when that happened. Fixed now. (#101)
  • Changed all \donttest to \dontrun in examples as requested by CRAN maintainers (#99)

spocc 0.2.2

NEW FEATURES

  • Added new function occ_names() to search only for taxonomic names. The goal here is to use ths function if there is some question about what names you want to use to search for occurrences with. (#84). Suggested by @jarioksa
  • New function occ_names_options() to quickly get parameter options to pass to occ_names().
  • New summary() method for the occdat S3 object that is output from occ() (#83)
  • In many places in spocc (README, vignette, occ() documentation file, at package startup), we make it clear that there could be duplicate records returned in certain scenarios. And a new documentation page detailing what to watch out for: ?spocc_duplicates. (#77)

MINOR IMPROVEMENTS

  • All latitude/longitude column headers are now changed to latitude and longitude, whereas they use to vary from latitude, decimalLatitude, Latitude, lat, and decimal_latitude. (#91)
  • Default is 500 now for the limit parameter in occ() (#78)
  • You can now pass in limit to each functions options parameter, and it will work. Each data source can have a different parameter internally from limit, but now internally within spocc, we allow you to use limit so you don't have to know what the data source specific parameter is. (#81)
  • There is a now a startup message to give information on the package (#79)
  • occ_options() gains new parameter where to print either in the console or to open man file in the IDE, or prints to console in command line R.

spocc 0.2.0

NEW FEATURES

  • occ() gains new parameter callopts to pass on curl debugging options to httr::GET() (#35)
  • wkt_vis() now by default plots a well known text area (WKT) on an interactive mapbox map in your default browser. New parameter which allows you to choose the interactive map or a static ggplot2 map. (#70)
  • Individual data sources occ() gains new class. In the previous version of this package, a data.frame was printed. Now the data is assigned the object occdatind (short for occdat individual).
  • occ() now uses a print method for the occdatind class, adopted from dplyr that prints a brief data.frame, with columns wrapped to fit the width of your console, and additional columns not printed given at bottom with their class type. Note that the print behavior for the resulting object of an occ() call remains the same. (#69) (#74)

MINOR IMPROVEMENTS

  • Added whisker as a package import to use in the wkt_vis() function. (#70)
  • Mapping functions now all accept the same input. Previously mapggplot() accepted the output of occ(), of class occdat, while the other two functions for mapping, mapleaflet() and mapgist() accepted a data.frame. Now all three functions accept the output of occ(), an object of class occdat. (#75)
  • The meta slot in each returned object (indexed by object$meta) contains spots for returned and found, to designate number of records returned, and number of records found. (#64)

BUG FIXES

  • Fixed bug in AntWeb output, where there was supposed to be a column titled name. (#71)

spocc 0.1.4

NEW FEATURES

  • Can now do geometry only queries. See examples in occ().
  • In addition, you can pass in sp objects of SpatialPolygons or SpatialPolygonsDataFrame classes.

spocc 0.1.2

NEW FEATURES

  • There were quite a few changes in one of the key packages that spocc depends on: rgbif. A number of input and output parameter names changed. A new version of rgbif was pushed to CRAN. (#56)
  • New function clean_spocc() started (not finished yet) to attempt to clean data. For example, one use case is removing impossible lat/long values (i.e., longitue values greater than absolute 180). Another, not implemented yet, is to remove points that are not in the country or habitat your points are supposed to be in. (#44)
  • New function fixnames() to trim species names with optional input parameters to make data easier to use for mapping.
  • New function wkt_vis() to visualize a WKT (well-known text) area on a map. Uses ggmap to pull down a Google map so that the visualization has some geographic and natural earth context. We'll soon introduce an interactive version of this function that will bring up a small Shiny app to draw a WKT area, then return those coordinates to your R session. (#34)

MINOR IMPROVEMENTS

  • Added a CONTRIBUTING.md file to the github repo to help guide contributions (#61)
  • Packages that require a certain version are forced to be X version or greater. Thes are rinat (>= 0.1.1), rbison (>= 0.3.2), rgbif (>= 0.6.2), ecoengine (>= 1.3), rebird (>= 0.1.1), AntWeb (>= 0.6.1), and leafletR (>= 0.2-0). This should help avoid problems.
  • General improvement to function documentation.

spocc 0.1.0

  • Initial release to CRAN

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("spocc")

0.7.0 by Scott Chamberlain, a year ago


https://github.com/ropensci/spocc


Report a bug at https://github.com/ropensci/spocc/issues


Browse source code at https://github.com/cran/spocc


Authors: Scott Chamberlain [aut, cre], Karthik Ram [ctb], Ted Hart [ctb]


Documentation:   PDF Manual  


Task views:


MIT + file LICENSE license


Imports utils, rgbif, rbison, rebird, rvertnet, ridigbio, lubridate, crul, whisker, jsonlite, data.table, tibble, wicket

Suggests roxygen2, testthat, knitr, taxize


Imported by mapr, wallace.

Suggested by ENMeval.


See at CRAN