Interface to the Search 'API' for 'PLoS' Journals

A programmatic interface to the 'SOLR' based search 'API' (< http://api.plos.org/>) provided by the Public Library of Science journals to search their articles. Functions are included for searching for articles, retrieving articles, making plots, doing 'faceted' searches, 'highlight' searches, and viewing results of 'highlighted' searches in a browser.


You can get this package at CRAN here, or install it within R by doing

install.packages("rplos")

Or install the development version from GitHub

install.packages("devtools")
devtools::install_github("ropensci/rplos")
library("rplos")

rplos is a package for accessing full text articles from the Public Library of Science journals using their API.

You used to need a key to use rplos - you no longer do as of 2015-01-13 (or v0.4.5.999).

rplos tutorial at rOpenSci website here

PLoS API documentation here

Crossref API documentation here, here, and here. Note that we are working on a new package rcrossref (on CRAN) with a much fuller implementation of R functions for all Crossref endpoints.

Search for the term ecology, and return id (DOI) and publication date, limiting to 5 items

searchplos('ecology', 'id,publication_date', limit = 5)
#> $meta
#>   numFound start maxScore
#> 1    33699     0       NA
#> 
#> $data
#>                                                        id
#> 1                            10.1371/journal.pone.0059813
#> 2                            10.1371/journal.pone.0001248
#> 3 10.1371/annotation/69333ae7-757a-4651-831c-f28c5eb02120
#> 4                            10.1371/journal.pone.0080763
#> 5                            10.1371/journal.pone.0155019
#>       publication_date
#> 1 2013-04-24T00:00:00Z
#> 2 2007-11-28T00:00:00Z
#> 3 2013-10-29T00:00:00Z
#> 4 2013-12-10T00:00:00Z
#> 5 2016-05-11T00:00:00Z

Get DOIs for full article in PLoS One

searchplos(q="*:*", fl='id', fq=list('cross_published_journal_key:PLoSONE',
   'doc_type:full'), limit=5)
#> $meta
#>   numFound start maxScore
#> 1   161020     0       NA
#> 
#> $data
#>                             id
#> 1 10.1371/journal.pone.0024229
#> 2 10.1371/journal.pone.0024016
#> 3 10.1371/journal.pone.0094828
#> 4 10.1371/journal.pone.0095065
#> 5 10.1371/journal.pone.0130784

Query to get some PLOS article-level metrics, notice difference between two outputs

out <- searchplos(q="*:*", fl=c('id','counter_total_all','alm_twitterCount'), fq='doc_type:full')
out_sorted <- searchplos(q="*:*", fl=c('id','counter_total_all','alm_twitterCount'),
   fq='doc_type:full', sort='counter_total_all desc')
head(out$data)
#>                             id alm_twitterCount counter_total_all
#> 1 10.1371/journal.pone.0024229                0              1762
#> 2 10.1371/journal.pgen.1004259                0               686
#> 3 10.1371/journal.pone.0024016                0              1774
#> 4 10.1371/journal.pgen.1004316                0              1597
#> 5 10.1371/journal.pone.0094828                0              1024
#> 6 10.1371/journal.pgen.1004352                1              1647
head(out_sorted$data)
#>                                                        id alm_twitterCount
#> 1                            10.1371/journal.pmed.0020124             2480
#> 2 10.1371/annotation/80bd7285-9d2d-403a-8e6f-9c375bf977ca                0
#> 3                            10.1371/journal.pcbi.0030102               53
#> 4                            10.1371/journal.pcbi.1003149              145
#> 5                            10.1371/journal.pmed.0050045              153
#> 6                            10.1371/journal.pone.0069841              860
#>   counter_total_all
#> 1           1707520
#> 2            546738
#> 3            448864
#> 4            409507
#> 5            391646
#> 6            389823

A list of articles about social networks that are popular on a social network

searchplos(q="*:*",fl=c('id','alm_twitterCount'),
   fq=list('doc_type:full','subject:"Social networks"','alm_twitterCount:[100 TO 10000]'),
   sort='counter_total_month desc')
#> $meta
#>   numFound start maxScore
#> 1       43     0       NA
#> 
#> $data
#>                              id alm_twitterCount
#> 1  10.1371/journal.pone.0155885              118
#> 2  10.1371/journal.pmed.1000316              923
#> 3  10.1371/journal.pone.0151588              153
#> 4  10.1371/journal.pone.0069841              860
#> 5  10.1371/journal.pone.0149885              123
#> 6  10.1371/journal.pone.0073791             1667
#> 7  10.1371/journal.pcbi.1003789             1528
#> 8  10.1371/journal.pbio.1001535             1798
#> 9  10.1371/journal.pone.0148405              469
#> 10 10.1371/journal.pone.0156409              121

Show all articles that have these two words less then about 15 words apart

searchplos(q='everything:"sports alcohol"~15', fl='title', fq='doc_type:full', limit=3)
#> $meta
#>   numFound start maxScore
#> 1      100     0       NA
#> 
#> $data
#>                                                                                                                                                       title
#> 1                                                               Alcohol Advertising in Sport and Non-Sport TV in Australia, during Children’s Viewing Times
#> 2                                                   Correction: Alcohol Advertising in Sport and Non-Sport TV in Australia, during Children’s Viewing Times
#> 3 Symptoms of Insomnia and Sleep Duration and Their Association with Incident Strokes: Findings from the Population-Based MONICA/KORA Augsburg Cohort Study

Narrow results to 7 words apart, changing the ~15 to ~7

searchplos(q='everything:"sports alcohol"~7', fl='title', fq='doc_type:full', limit=3)
#> $meta
#>   numFound start maxScore
#> 1       51     0       NA
#> 
#> $data
#>                                                                                                                                                       title
#> 1                                                               Alcohol Advertising in Sport and Non-Sport TV in Australia, during Children’s Viewing Times
#> 2                                                   Correction: Alcohol Advertising in Sport and Non-Sport TV in Australia, during Children’s Viewing Times
#> 3 Symptoms of Insomnia and Sleep Duration and Their Association with Incident Strokes: Findings from the Population-Based MONICA/KORA Augsburg Cohort Study

Remove DOIs for annotations (i.e., corrections) and Viewpoints articles

searchplos(q='*:*', fl=c('id','article_type'),
   fq=list('-article_type:correction','-article_type:viewpoints'), limit=5)
#> $meta
#>   numFound start maxScore
#> 1  1565546     0       NA
#> 
#> $data
#>                                          id     article_type
#> 1        10.1371/journal.pone.0024229/title Research Article
#> 2     10.1371/journal.pone.0024229/abstract Research Article
#> 3   10.1371/journal.pone.0024229/references Research Article
#> 4         10.1371/journal.pone.0024229/body Research Article
#> 5 10.1371/journal.pone.0024229/introduction Research Article

Facet on multiple fields

facetplos(q='alcohol', facet.field=c('journal','subject'), facet.limit=5)
#> $facet_queries
#> NULL
#> 
#> $facet_fields
#> $facet_fields$journal
#>                                 X1      X2
#> 1                         plos one 1319779
#> 2                    plos genetics   51827
#> 3                   plos pathogens   45119
#> 4       plos computational biology   38599
#> 5 plos neglected tropical diseases   37114
#> 
#> $facet_fields$subject
#>                              X1      X2
#> 1     biology and life sciences 1505447
#> 2  medicine and health sciences 1167857
#> 3 research and analysis methods  999924
#> 4                  biochemistry  736199
#> 5                  cell biology  629720
#> 
#> 
#> $facet_dates
#> NULL
#> 
#> $facet_ranges
#> NULL

Range faceting

facetplos(q='*:*', url=url, facet.range='counter_total_all',
 facet.range.start=5, facet.range.end=100, facet.range.gap=10)
#> $facet_queries
#> NULL
#> 
#> $facet_fields
#> NULL
#> 
#> $facet_dates
#> NULL
#> 
#> $facet_ranges
#> $facet_ranges$counter_total_all
#>    X1   X2
#> 1   5  395
#> 2  15  826
#> 3  25 1106
#> 4  35 1686
#> 5  45 2134
#> 6  55 1979
#> 7  65 1845
#> 8  75 1784
#> 9  85 1441
#> 10 95 1279

Search for and highlight the term alcohol in the abstract field only

(out <- highplos(q='alcohol', hl.fl = 'abstract', rows=3))
#> $`10.1371/journal.pmed.0040151`
#> $`10.1371/journal.pmed.0040151`$abstract
#> [1] "Background: <em>Alcohol</em> consumption causes an estimated 4% of the global disease burden, prompting"
#> 
#> 
#> $`10.1371/journal.pone.0027752`
#> $`10.1371/journal.pone.0027752`$abstract
#> [1] "Background: The negative influences of <em>alcohol</em> on TB management with regard to delays in seeking"
#> 
#> 
#> $`10.1371/journal.pmed.0050108`
#> $`10.1371/journal.pmed.0050108`$abstract
#> [1] " study that links retail <em>alcohol</em> sales and violent assaults.\n      "

And you can browse the results in your default browser

highbrow(out)

Simple function to get full text urls for a DOI

full_text_urls(doi='10.1371/journal.pone.0086169')
#> [1] "http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0086169&representation=XML"
(out <- plos_fulltext(doi='10.1371/journal.pone.0086169'))
#> 1 full-text articles retrieved 
#> Min. Length: 110717 - Max. Length: 110717 
#> DOIs: 10.1371/journal.pone.0086169 ... 
#> 
#> NOTE: extract xml strings like output['<doi>']

Then parse the XML any way you like, here getting the abstract

library("XML")
xpathSApply(xmlParse(out$`10.1371/journal.pone.0086169`), "//abstract", xmlValue)
#> [1] "Mammalian females pay high energetic costs for reproduction, the greatest of which is imposed by lactation. The synthesis of milk requires, in part, the mobilization of bodily reserves to nourish developing young. Numerous hypotheses have been advanced to predict how mothers will differentially invest in sons and daughters, however few studies have addressed sex-biased milk synthesis. Here we leverage the dairy cow model to investigate such phenomena. Using 2.39 million lactation records from 1.49 million dairy cows, we demonstrate that the sex of the fetus influences the capacity of the mammary gland to synthesize milk during lactation. Cows favor daughters, producing significantly more milk for daughters than for sons across lactation. Using a sub-sample of this dataset (N = 113,750 subjects) we further demonstrate that the effects of fetal sex interact dynamically across parities, whereby the sex of the fetus being gestated can enhance or diminish the production of milk during an established lactation. Moreover the sex of the fetus gestated on the first parity has persistent consequences for milk synthesis on the subsequent parity. Specifically, gestation of a daughter on the first parity increases milk production by ∼445 kg over the first two lactations. Our results identify a dramatic and sustained programming of mammary function by offspring in utero. Nutritional and endocrine conditions in utero are known to have pronounced and long-term effects on progeny, but the ways in which the progeny has sustained physiological effects on the dam have received little attention to date."

There are a series of convience functions for searching within sections of articles.

  • plosauthor()
  • plosabstract()
  • plosfigtabcaps()
  • plostitle()
  • plossubject()

For example:

plossubject(q='marine ecology',  fl = c('id','journal'), limit = 10)
#> $meta
#>   numFound start maxScore
#> 1     3050     0       NA
#> 
#> $data
#>                                                    id  journal
#> 1                        10.1371/journal.pone.0021810 PLoS ONE
#> 2                  10.1371/journal.pone.0021810/title PLoS ONE
#> 3               10.1371/journal.pone.0021810/abstract PLoS ONE
#> 4             10.1371/journal.pone.0021810/references PLoS ONE
#> 5                   10.1371/journal.pone.0021810/body PLoS ONE
#> 6           10.1371/journal.pone.0021810/introduction PLoS ONE
#> 7  10.1371/journal.pone.0021810/materials_and_methods PLoS ONE
#> 8                        10.1371/journal.pone.0092590 PLoS ONE
#> 9           10.1371/journal.pone.0092590/introduction PLoS ONE
#> 10                 10.1371/journal.pone.0092590/title PLoS ONE

However, you can always just do this in searchplos() like searchplos(q = "subject:science"). See also the fq parameter. The above convenience functions are simply wrappers around searchplos, so take all the same parameters.

Search by article views

Search with term marine ecology, by field subject, and limit to 5 results

plosviews(search='marine ecology', byfield='subject', limit=5)
#>                             id counter_total_all
#> 3 10.1371/journal.pone.0080365              2003
#> 1 10.1371/journal.pone.0021810              2327
#> 5 10.1371/journal.pone.0149852              4162
#> 4 10.1371/journal.pone.0031775              4805
#> 2 10.1371/journal.pone.0092590              7183

Visualize word use across articles

plosword(list('monkey','Helianthus','sunflower','protein','whale'), vis = 'TRUE')
#> $table
#>   No_Articles       Term
#> 1       10480     monkey
#> 2         428 Helianthus
#> 3        1171  sunflower
#> 4      117345    protein
#> 5        1370      whale
#> 
#> $plot


This package is part of a richer suite called fulltext, along with several other packages, that provides the ability to search for and retrieve full text of open access scholarly articles. We recommend using fulltext as the primary R interface to rplos unless your needs are limited to this single source.


News

  • URLs to full text XML have been changed - old URLs were working but were going through 2 302 redirects to get there. Updated URLs. (#107)
  • Fixed content-type check for plos_fulltext() function. XML can be either application/xml or text/xml (#108)
  • Added notes to documentation for relavant functions for how to do phrase searching. (#96) (#97) thanks @poldham
  • Removed parameter random parameter from citations() function as it's no longer available in the API (#103)
  • Swapped out all uses of dplyr::rbind_all() for dplyr::bind_rows() (#105)
  • full_text_urls() now gives back NA when DOIs for annotations are given, which can be easily removed.
  • Fixed full_text_urls() function to create full text URLs for PLOS Clinical Trials correctly (#104)
  • move ggplot2 from Depends to Imports, and using @importFrom for ggplot2 functions, now all imports are using @importFrom (#99)
  • Fixes for httr::content() to parse manually, and use explicit encoding of UTF-8 (#102)
  • Change solr dependency to require version v0.1.6 or less (#94)
  • More tests added (#94)
  • Fix encoding in parsing of XML data in plos_fulltext() to avoid unicode problems (#93)
  • Now importing non-Base R functions from utils, stats, and methods packages (#90)
  • Fixes for httr v1 that broke rplos when length 0 list passed to query parameter (#89)
  • Added vignettes/figure to .Rbuildignore as requested by CRAN admin (#87)
  • API key no longer required (#86)
  • searchplos() now returns a list of length two, meta and data, and meta is a data.frame of metadata for the search.
  • Switched from CC0 to MIT license.
  • No longer importing libraries RCurl, data.table, googleVis, assertthat, RJSONIO, and stringr (#79) (#82) (#84)
  • Now importing dplyr.
  • Moved jsonlite from Suggests to Imports. Replaces use of RJSONIO. (#80)
  • crossref() now defunct. See package rcrossref https://github.com/ropensci/rcrossref. (#83)
  • highplos() now uses solr::solr_highlight() to do highlight searches.
  • searchplos(), plosabstract(), and other functions that wrap searchplos() now use ... to pass in curl options to httr::GET(). You'll now get an error on using callopts parameter.
  • Added manual file entry for the dataset isocodes.
  • Reworked both plosword() and plot_throughtime() to have far less code, uses httr now instead of RCurl, but to the user, everything should be the same.
  • Made documentation more clear on discrepancy between PLOS website behavior and rplos behavior, and how to make them match, or match more closely (#76)
  • Added package level man file to allow ?rplos to go to help page.
  • Removed some examples from searchplos() that are now not working for some unknown reason. (#81)
  • Previously when user set limit=0, we still gave back data, this is fixed, and now the meta slot given back, and the data slot gives an NA (#85)
  • Fixed some broken tests.
  • Errors from the data provider are reported now. At least we attempt to do so when they are given, for example if you specify asc or desc incorrectly with the sort parameter. See the check_response() function for examples.
  • New functions facetplos() and highplos() using the solr R wrapper to the Solr indexing engine. The PLOS API just exposes the Solr endpoints, so we can use the general Solr wrapper package solr to allow more flexible Solr searching.
  • New function highbrow() to visualize highlighting results in a browser.
  • New function plos_fulltext() to get full text xml of PLOS articles. Helper function full_text_urls() constructs the URL's for full text xml.
  • Fixed bug in tests where we forgot to give a key. No key is required per se, but PLOS encourages it so we prevent a call from happening without at least a dumby key.
  • Added function check_response() to check responses from the PLOS API, deals with capturing server error messages, and checking for correct content type, etc.
  • Removed function crossref_r() as we are working on a package for the CrossRef API.
  • Parameter arguments in searchplos(), plosauthor(), plosfigtabcaps(), plossubject(), and plostitle() were changed to match closer the Solr parameter names. terms to q. fields to fl. toquery to fq.
  • Multiple values passed to fields
  • returndf parameter is gone from searchplos(), plosauthor(), plosfigtabcaps(), plossubject(), and plostitle(). You can easily get raw JSON, etc. data using the solr package.
  • Now using httr instead of RCurl in plosviews() function.
  • All search functions (searchplos(), plosabstract(), plosauthor(), plosfigtabcaps(), plossubject(), and plostitle()) gain highlighting argument, setting to TRUE (default=FALSE) returns matching sentence fragments that were matched. NOTE that if highlighting=TRUE the output can be a list of data.frame's if returndf=TRUE, with two named elements 'data' and 'highlighting', or a list of lists if returndf=FALSE.
  • All search functions (searchplos(), plosabstract(), plosauthor(), plosfigtabcaps(), plossubject(), and plostitle()) gain sort argument. You can pass a field to sort by even if you don't return that field in the output, e.g., sort='counter_total_month desc'.
  • A tiny function parsehighlight() added to parse out html code from highlighting output.
  • Some examples in docs didn't work - fixed them.
  • Fixed bug in searchplos() that was causing elements of a return field to cause failure because they were longer than 1 (e.g., authors). Now concatenating elements of length > 1.
  • Fixed bug in searchplos() that was causing elements of length 0 to cause failure. Now removing elements of length 0.
  • Fixed parsehighlight function to return NA if highlighting return of length 0.
  • Fixed broken test for plosauthor(), plosabstract(), and plot_throughtime().
  • Added httr::stop_for_status() calls to a few functions to give informative http status errors when they happen
  • Fixed bug in plot_throughtime() that was throwing errors and preventing fxn from working, thanks to Ben Bolker for the fix.
  • Simplified code in many functions to have cleaner and simpler code.
  • ... parameter in many functions changed to callopts=list(), which passes in curl options to a call to either RCurl::getForm() or httr::GET()
  • Fixed bug in function plosviews() that caused errors in some calls. Now forces full document searches, so that you get views data back for full papers only, not sections of papers. See package alm (https://github.com/ropensci/alm) for more in depth PLOS article-level metrics.
  • All functions for interacting with the PLOS ALM (altmetrics) API have been removed, and are now in a separate package called alm (http://github.com/ropensci/alm).
  • Convenience functions plosabstract, plosauthor, plosfigtabcaps, plossubject, and plostitle, that search specifically within those sections of papers now wrap searchplos, so they should behave the same way.
  • ldfast() fxn added as an attempt to do ldply faster
  • performance improvements in searchplos
  • Dependency on assertthat removed since it's not on CRAN.
  • Fixed namespace conflicts by importing only functions needed from some packages.
  • searchplos() now removes leading, trailing, and internal whitespace from character strings
  • remove alm*() functions so that this package now only wraps the PLoS search API.
  • The almdateupdated function has been deprecated - use almupdated instead.

  • The articlelength function has been deprecated - didn't see the usefulness any longer.

  • In general simplified and prettified code.

  • Changed from using RCurl to httr in many functions, but not all.

  • Added more examples for many functions.

  • Added three internal functions: concat_todf, addmissing, and getkey.

  • Added Karthik Ram as a package author.

  • All url arguments in functions put inside functions as they are not likely to change that often.

  • Fixed crossref function, and added more examples.

  • The alm function (previously almplosallviews) gains many ### new features: now allows up to 50 DOIs per call; you can specify the source you want to get alm data from as an argument; you can specify the year you want to get alm data from as an argument.

  • Added the plosfields data file to get all the possible fields to use in function calls.

  • almplosallviews changed to alm.

  • almplotallviews changed to almplot.

  • almevents added to specifically search and get detailed events data for a specific source or N sources.

  • crossref_r gets 20 random DOIs from Crossref.org.

  • Added package startup message.

  • journalnamekey function to get the short name keys for each PLoS Journal.

  • ALM functions (any functions starting with alm) received updated arguments/parameters according to the ALM API version 3.0 changes.

  • Added tests.

  • almplosallviews now outputs different output - two data.frames, one total metrics (summed across time), and history (for metrics for each time period specified in the search)

  • crossref function returns R's native bibtype format. See examples in crossref function documentation

  • almpub changed to almdatepub

  • changed help file rplos to help - use help('rplos') in R

  • changed URL from http://ropensci.org/ to https://github.com/ropensci/rplos

  • added sleep argument to plosallviews function to allow pauses between API calls when running plosallviews in a loop - this is an attempt to limit hitting the PLoS API too hard

  • various other fixed to functions

  • more examples added to some functions

  • added function journalnamekey to get short keys for journals to use in searching for specific journals

rplos 0.0-1

  • released to CRAN

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("rplos")

0.6.4 by Scott Chamberlain, 10 months ago


https://github.com/ropensci/rplos


Report a bug at https://github.com/ropensci/rplos/issues


Browse source code at https://github.com/cran/rplos


Authors: Scott Chamberlain [aut, cre], Carl Boettiger [aut], Karthik Ram [aut]


Documentation:   PDF Manual  


Task views: Web Technologies and Services


MIT + file LICENSE license


Imports ggplot2, httr, jsonlite, dplyr, plyr, lubridate, reshape2, whisker, solr

Suggests XML, testthat, knitr, covr

Enhances tm


Imported by fulltext.


See at CRAN