General Purpose R Interface to 'Solr'

Provides a set of functions for querying and parsing data from 'Solr' (< http://lucene.apache.org/solr>) 'endpoints' (local and remote), including search, 'faceting', 'highlighting', 'stats', and 'more like this'. In addition, some functionality is included for creating, deleting, and updating documents in a 'Solr' 'database'.


Build Status codecov.io rstudio mirror downloads cran version

A general purpose R interface to Solr

Development is now following Solr v7 and greater - which introduced many changes, which means many functions here may not work with your Solr installation older than v7.

Be aware that currently some functions will only work in certain Solr modes, e.g, collection_create() won't work when you are not in Solrcloud mode. But, you should get an error message stating that you aren't.

Note that we recently changed the package name to solrium. A previous version of this package is on CRAN as solr, but next version will be up as solrium.

Package API and ways of using the package

The first thing to look at is SolrClient to instantiate a client connection to your Solr instance. ping and schema are helpful functions to look at after instantiating your client.

There are two ways to use solrium:

  1. Call functions on the SolrClient object
  2. Pass the SolrClient object to functions

For example, if we instantiate a client like conn <- SolrClient$new(), then to use the first way we can do conn$search(...), and the second way by doing solr_search(conn, ...). These two ways of using the package hopefully make the package more user friendly for more people, those that prefer a more object oriented approach, and those that prefer more of a functional approach.

Collections

Functions that start with collection work with Solr collections when in cloud mode. Note that these functions won't work when in Solr standard mode

Cores

Functions that start with core work with Solr cores when in standard Solr mode. Note that these functions won't work when in Solr cloud mode

Documents

The following functions work with documents in Solr

#>  - add
#>  - delete_by_id
#>  - delete_by_query
#>  - update_atomic_json
#>  - update_atomic_xml
#>  - update_csv
#>  - update_json
#>  - update_xml

Search

Search functions, including solr_parse for parsing results from different functions appropriately

#>  - solr_all
#>  - solr_facet
#>  - solr_get
#>  - solr_group
#>  - solr_highlight
#>  - solr_mlt
#>  - solr_parse
#>  - solr_search
#>  - solr_stats

Install

Stable version from CRAN

install.packages("solrium")

Or development version from GitHub

devtools::install_github("ropensci/solrium")
library("solrium")

Setup

Use SolrClient$new() to initialize your connection. These examples use a remote Solr server, but work on any local Solr server.

(cli <- SolrClient$new(host = "api.plos.org", path = "search", port = NULL))
#> <Solr Client>
#>   host: api.plos.org
#>   path: search
#>   port: 
#>   scheme: http
#>   errors: simple
#>   proxy:

You can also set whether you want simple or detailed error messages (via errors), and whether you want URLs used in each function call or not (via verbose), and your proxy settings (via proxy) if needed. For example:

SolrClient$new(errors = "complete")

Your settings are printed in the print method for the connection object

cli
#> <Solr Client>
#>   host: api.plos.org
#>   path: search
#>   port: 
#>   scheme: http
#>   errors: simple
#>   proxy:

For local Solr server setup:

bin/solr start -e cloud -noprompt
bin/post -c gettingstarted example/exampledocs/*.xml

Search

cli$search(params = list(q='*:*', rows=2, fl='id'))
#> # A tibble: 2 x 1
#>                                      id
#>                                   <chr>
#> 1    10.1371/journal.pone.0079536/title
#> 2 10.1371/journal.pone.0079536/abstract

Search grouped data

Most recent publication by journal

cli$group(params = list(q='*:*', group.field='journal', rows=5, group.limit=1,
                        group.sort='publication_date desc',
                        fl='publication_date, score'))
#>                         groupValue numFound start     publication_date
#> 1                         plos one  1572163     0 2017-11-01T00:00:00Z
#> 2 plos neglected tropical diseases    47510     0 2017-11-01T00:00:00Z
#> 3                    plos genetics    59871     0 2017-11-01T00:00:00Z
#> 4                   plos pathogens    53246     0 2017-11-01T00:00:00Z
#> 5                             none    63561     0 2012-10-23T00:00:00Z
#>   score
#> 1     1
#> 2     1
#> 3     1
#> 4     1
#> 5     1

First publication by journal

cli$group(params = list(q = '*:*', group.field = 'journal', group.limit = 1,
                        group.sort = 'publication_date asc',
                        fl = c('publication_date', 'score'),
                        fq = "publication_date:[1900-01-01T00:00:00Z TO *]"))
#>                          groupValue numFound start     publication_date
#> 1                          plos one  1572163     0 2006-12-20T00:00:00Z
#> 2  plos neglected tropical diseases    47510     0 2007-08-30T00:00:00Z
#> 3                    plos pathogens    53246     0 2005-07-22T00:00:00Z
#> 4        plos computational biology    45582     0 2005-06-24T00:00:00Z
#> 5                              none    57532     0 2005-08-23T00:00:00Z
#> 6              plos clinical trials      521     0 2006-04-21T00:00:00Z
#> 7                     plos genetics    59871     0 2005-06-17T00:00:00Z
#> 8                     plos medicine    23519     0 2004-09-07T00:00:00Z
#> 9                      plos medicin        9     0 2012-04-17T00:00:00Z
#> 10                     plos biology    32513     0 2003-08-18T00:00:00Z
#>    score
#> 1      1
#> 2      1
#> 3      1
#> 4      1
#> 5      1
#> 6      1
#> 7      1
#> 8      1
#> 9      1
#> 10     1

Search group query : Last 3 publications of 2013.

gq <- 'publication_date:[2013-01-01T00:00:00Z TO 2013-12-31T00:00:00Z]'
cli$group(
  params = list(q='*:*', group.query = gq,
                group.limit = 3, group.sort = 'publication_date desc',
                fl = 'publication_date'))
#>   numFound start     publication_date
#> 1   307076     0 2013-12-31T00:00:00Z
#> 2   307076     0 2013-12-31T00:00:00Z
#> 3   307076     0 2013-12-31T00:00:00Z

Search group with format simple

cli$group(params = list(q='*:*', group.field='journal', rows=5,
                        group.limit=3, group.sort='publication_date desc',
                        group.format='simple', fl='journal, publication_date'))
#>   numFound start     publication_date  journal
#> 1  1898495     0 2012-10-23T00:00:00Z     <NA>
#> 2  1898495     0 2012-10-23T00:00:00Z     <NA>
#> 3  1898495     0 2012-10-23T00:00:00Z     <NA>
#> 4  1898495     0 2017-11-01T00:00:00Z PLOS ONE
#> 5  1898495     0 2017-11-01T00:00:00Z PLOS ONE

Facet

cli$facet(params = list(q='*:*', facet.field='journal', facet.query=c('cell', 'bird')))
#> $facet_queries
#> # A tibble: 2 x 2
#>    term  value
#>   <chr>  <int>
#> 1  cell 157652
#> 2  bird  16385
#> 
#> $facet_fields
#> $facet_fields$journal
#> # A tibble: 9 x 2
#>                               term   value
#>                              <chr>   <chr>
#> 1                         plos one 1572163
#> 2                    plos genetics   59871
#> 3                   plos pathogens   53246
#> 4 plos neglected tropical diseases   47510
#> 5       plos computational biology   45582
#> 6                     plos biology   32513
#> 7                    plos medicine   23519
#> 8             plos clinical trials     521
#> 9                     plos medicin       9
#> 
#> 
#> $facet_pivot
#> NULL
#> 
#> $facet_dates
#> NULL
#> 
#> $facet_ranges
#> NULL

Highlight

cli$highlight(params = list(q='alcohol', hl.fl = 'abstract', rows=2))
#> # A tibble: 2 x 2
#>                          names
#>                          <chr>
#> 1 10.1371/journal.pone.0185457
#> 2 10.1371/journal.pone.0071284
#> # ... with 1 more variables: abstract <chr>

Stats

out <- cli$stats(params = list(q='ecology', stats.field=c('counter_total_all','alm_twitterCount'), stats.facet='journal'))
out$data
#>                   min    max count missing       sum sumOfSquares
#> counter_total_all   0 920716 40497       0 219020039 7.604567e+12
#> alm_twitterCount    0   3401 40497       0    281128 7.300081e+07
#>                          mean      stddev
#> counter_total_all 5408.302813 12591.07462
#> alm_twitterCount     6.941946    41.88646

More like this

solr_mlt is a function to return similar documents to the one

out <- cli$mlt(params = list(q='title:"ecology" AND body:"cell"', mlt.fl='title', mlt.mindf=1, mlt.mintf=1, fl='counter_total_all', rows=5))
out$docs
#> # A tibble: 5 x 2
#>                             id counter_total_all
#>                          <chr>             <int>
#> 1 10.1371/journal.pbio.1001805             21824
#> 2 10.1371/journal.pbio.0020440             25424
#> 3 10.1371/journal.pbio.1002559              9746
#> 4 10.1371/journal.pone.0087217             11502
#> 5 10.1371/journal.pbio.1002191             22013
out$mlt
#> $`10.1371/journal.pbio.1001805`
#> # A tibble: 5 x 4
#>   numFound start                           id counter_total_all
#>      <int> <int>                        <chr>             <int>
#> 1     3822     0 10.1371/journal.pone.0098876              3590
#> 2     3822     0 10.1371/journal.pone.0082578              2893
#> 3     3822     0 10.1371/journal.pone.0102159              2028
#> 4     3822     0 10.1371/journal.pcbi.1002652              3819
#> 5     3822     0 10.1371/journal.pcbi.1003408              9920
#> 
#> $`10.1371/journal.pbio.0020440`
#> # A tibble: 5 x 4
#>   numFound start                           id counter_total_all
#>      <int> <int>                        <chr>             <int>
#> 1     1115     0 10.1371/journal.pone.0162651              2828
#> 2     1115     0 10.1371/journal.pone.0003259              3225
#> 3     1115     0 10.1371/journal.pntd.0003377              4267
#> 4     1115     0 10.1371/journal.pone.0101568              4603
#> 5     1115     0 10.1371/journal.pone.0068814              9042
#> 
#> $`10.1371/journal.pbio.1002559`
#> # A tibble: 5 x 4
#>   numFound start                           id counter_total_all
#>      <int> <int>                        <chr>             <int>
#> 1     5482     0 10.1371/journal.pone.0155989              2519
#> 2     5482     0 10.1371/journal.pone.0023086              8442
#> 3     5482     0 10.1371/journal.pone.0155028              1547
#> 4     5482     0 10.1371/journal.pone.0041684             22057
#> 5     5482     0 10.1371/journal.pone.0164330               969
#> 
#> $`10.1371/journal.pone.0087217`
#> # A tibble: 5 x 4
#>   numFound start                           id counter_total_all
#>      <int> <int>                        <chr>             <int>
#> 1     4576     0 10.1371/journal.pone.0175497              1088
#> 2     4576     0 10.1371/journal.pone.0159131              4937
#> 3     4576     0 10.1371/journal.pcbi.0020092             24786
#> 4     4576     0 10.1371/journal.pone.0133941              1336
#> 5     4576     0 10.1371/journal.pone.0131665              1207
#> 
#> $`10.1371/journal.pbio.1002191`
#> # A tibble: 5 x 4
#>   numFound start                           id counter_total_all
#>      <int> <int>                        <chr>             <int>
#> 1    12585     0 10.1371/journal.pbio.1002232              3055
#> 2    12585     0 10.1371/journal.pone.0070448              2203
#> 3    12585     0 10.1371/journal.pone.0131700              2493
#> 4    12585     0 10.1371/journal.pone.0121680              4980
#> 5    12585     0 10.1371/journal.pone.0041534              5701

Parsing

solr_parse is a general purpose parser function with extension methods solr_parse.sr_search, solr_parse.sr_facet, and solr_parse.sr_high, for parsing solr_search, solr_facet, and solr_highlight function output, respectively. solr_parse is used internally within those three functions (solr_search, solr_facet, solr_highlight) to do parsing. You can optionally get back raw json or xml from solr_search, solr_facet, and solr_highlight setting parameter raw=TRUE, and then parsing after the fact with solr_parse. All you need to know is solr_parse can parse

For example:

(out <- cli$highlight(params = list(q='alcohol', hl.fl = 'abstract', rows=2),
                      raw=TRUE))
#> [1] "{\"response\":{\"numFound\":25987,\"start\":0,\"maxScore\":4.705177,\"docs\":[{\"id\":\"10.1371/journal.pone.0185457\",\"journal\":\"PLOS ONE\",\"eissn\":\"1932-6203\",\"publication_date\":\"2017-09-28T00:00:00Z\",\"article_type\":\"Research Article\",\"author_display\":[\"Jacqueline Willmore\",\"Terry-Lynne Marko\",\"Darcie Taing\",\"Hugues Sampasa-Kanyinga\"],\"abstract\":[\"Objectives: Alcohol-related morbidity and mortality are significant public health issues. The purpose of this study was to describe the prevalence and trends over time of alcohol consumption and alcohol-related morbidity and mortality; and public attitudes of alcohol use impacts on families and the community in Ottawa, Canada. Methods: Prevalence (2013–2014) and trends (2000–2001 to 2013–2014) of alcohol use were obtained from the Canadian Community Health Survey. Data on paramedic responses (2015), emergency department (ED) visits (2013–2015), hospitalizations (2013–2015) and deaths (2007–2011) were used to quantify the acute and chronic health effects of alcohol in Ottawa. Qualitative data were obtained from the “Have Your Say” alcohol survey, an online survey of public attitudes on alcohol conducted in 2016. Results: In 2013–2014, an estimated 595,300 (83%) Ottawa adults 19 years and older drank alcohol, 42% reported binge drinking in the past year. Heavy drinking increased from 15% in 2000–2001 to 20% in 2013–2014. In 2015, the Ottawa Paramedic Service responded to 2,060 calls directly attributable to alcohol. Between 2013 and 2015, there were an average of 6,100 ED visits and 1,270 hospitalizations per year due to alcohol. Annually, alcohol use results in at least 140 deaths in Ottawa. Men have higher rates of alcohol-attributable paramedic responses, ED visits, hospitalizations and deaths than women, and young adults have higher rates of alcohol-attributable paramedic responses. Qualitative data of public attitudes indicate that alcohol misuse has greater repercussions not only on those who drink, but also on the family and community. Conclusions: Results highlight the need for healthy public policy intended to encourage a culture of drinking in moderation in Ottawa to support lower risk alcohol use, particularly among men and young adults. \"],\"title_display\":\"The burden of alcohol-related morbidity and mortality in Ottawa, Canada\",\"score\":4.705177},{\"id\":\"10.1371/journal.pone.0071284\",\"journal\":\"PLoS ONE\",\"eissn\":\"1932-6203\",\"publication_date\":\"2013-08-20T00:00:00Z\",\"article_type\":\"Research Article\",\"author_display\":[\"Petra Suchankova\",\"Pia Steensland\",\"Ida Fredriksson\",\"Jörgen A. Engel\",\"Elisabet Jerlhag\"],\"abstract\":[\"\\nAlcohol dependence is a heterogeneous disorder where several signalling systems play important roles. Recent studies implicate that the gut-brain hormone ghrelin, an orexigenic peptide, is a potential mediator of alcohol related behaviours. Ghrelin increases whereas a ghrelin receptor (GHS-R1A) antagonist decreases alcohol consumption as well as operant self-administration of alcohol in rodents that have consumed alcohol for twelve weeks. In the present study we aimed at investigating the effect of acute and repeated treatment with the GHS-R1A antagonist JMV2959 on alcohol intake in a group of rats following voluntarily alcohol consumption for two, five and eight months. After approximately ten months of voluntary alcohol consumption the expression of the GHS-R1A gene (Ghsr) as well as the degree of methylation of a CpG island found in Ghsr was examined in reward related brain areas. In a separate group of rats, we examined the effect of the JMV2959 on alcohol relapse using the alcohol deprivation paradigm. Acute JMV2959 treatment was found to decrease alcohol intake and the effect was more pronounced after five, compared to two months of alcohol exposure. In addition, repeated JMV2959 treatment decreased alcohol intake without inducing tolerance or rebound increase in alcohol intake after the treatment. The GHS-R1A antagonist prevented the alcohol deprivation effect in rats. There was a significant down-regulation of the Ghsr expression in the ventral tegmental area (VTA) in high- compared to low-alcohol consuming rats after approximately ten months of voluntary alcohol consumption. Further analysis revealed a negative correlation between Ghsr expression in the VTA and alcohol intake. No differences in methylation degree were found between high- compared to low-alcohol consuming rats. These findings support previous studies showing that the ghrelin signalling system may constitute a potential target for development of novel treatment strategies for alcohol dependence.\\n\"],\"title_display\":\"Ghrelin Receptor (GHS-R1A) Antagonism Suppresses Both Alcohol Consumption and the Alcohol Deprivation Effect in Rats following Long-Term Voluntary Alcohol Consumption\",\"score\":4.7050986}]},\"highlighting\":{\"10.1371/journal.pone.0185457\":{\"abstract\":[\"Objectives: <em>Alcohol</em>-related morbidity and mortality are significant public health issues\"]},\"10.1371/journal.pone.0071284\":{\"abstract\":[\"\\n<em>Alcohol</em> dependence is a heterogeneous disorder where several signalling systems play important\"]}}}\n"
#> attr(,"class")
#> [1] "sr_high"
#> attr(,"wt")
#> [1] "json"

Then parse

solr_parse(out, 'df')
#> # A tibble: 2 x 2
#>                          names
#>                          <chr>
#> 1 10.1371/journal.pone.0185457
#> 2 10.1371/journal.pone.0071284
#> # ... with 1 more variables: abstract <chr>

Advanced: Function Queries

Function Queries allow you to query on actual numeric fields in the SOLR database, and do addition, multiplication, etc on one or many fields to stort results. For example, here, we search on the product of counter_total_all and alm_twitterCount, using a new temporary field "val"

cli$search(params = list(q='_val_:"product(counter_total_all,alm_twitterCount)"',
  rows=5, fl='id,title', fq='doc_type:full'))
#> # A tibble: 5 x 2
#>                             id
#>                          <chr>
#> 1 10.1371/journal.pmed.0020124
#> 2 10.1371/journal.pone.0141854
#> 3 10.1371/journal.pone.0073791
#> 4 10.1371/journal.pone.0153419
#> 5 10.1371/journal.pone.0115069
#> # ... with 1 more variables: title <chr>

Here, we search for the papers with the most citations

cli$search(params = list(q='_val_:"max(counter_total_all)"',
    rows=5, fl='id,counter_total_all', fq='doc_type:full'))
#> # A tibble: 5 x 2
#>                                                        id
#>                                                     <chr>
#> 1                            10.1371/journal.pmed.0020124
#> 2 10.1371/annotation/80bd7285-9d2d-403a-8e6f-9c375bf977ca
#> 3                            10.1371/journal.pcbi.1003149
#> 4                            10.1371/journal.pone.0141854
#> 5                            10.1371/journal.pcbi.0030102
#> # ... with 1 more variables: counter_total_all <int>

Or with the most tweets

cli$search(params = list(q='_val_:"max(alm_twitterCount)"',
    rows=5, fl='id,alm_twitterCount', fq='doc_type:full'))
#> # A tibble: 5 x 2
#>                             id alm_twitterCount
#>                          <chr>            <int>
#> 1 10.1371/journal.pone.0141854             3401
#> 2 10.1371/journal.pmed.0020124             3207
#> 3 10.1371/journal.pone.0115069             2873
#> 4 10.1371/journal.pmed.1001953             2821
#> 5 10.1371/journal.pone.0061981             2392

Using specific data sources

USGS BISON service

The occurrences service

conn <- SolrClient$new(scheme = "https", host = "bison.usgs.gov", path = "solr/occurrences/select", port = NULL)
conn$search(params = list(q = '*:*', fl = c('decimalLatitude','decimalLongitude','scientificName'), rows = 2))
#> # A tibble: 2 x 3
#>   decimalLongitude         scientificName decimalLatitude
#>              <dbl>                  <chr>           <dbl>
#> 1        -116.5694 Zonotrichia leucophrys        34.05072
#> 2        -116.5694    Tyrannus vociferans        34.05072

The species names service

conn <- SolrClient$new(scheme = "https", host = "bison.usgs.gov", path = "solr/scientificName/select", port = NULL)
conn$search(params = list(q = '*:*'))
#> # A tibble: 10 x 2
#>                scientificName  `_version_`
#>                         <chr>        <dbl>
#>  1 Dictyopteris polypodioides 1.565325e+18
#>  2           Lonicera iberica 1.565325e+18
#>  3            Epuraea ambigua 1.565325e+18
#>  4   Pseudopomala brachyptera 1.565325e+18
#>  5    Didymosphaeria populina 1.565325e+18
#>  6                   Sanoarca 1.565325e+18
#>  7     Celleporina ventricosa 1.565325e+18
#>  8         Trigonurus crotchi 1.565325e+18
#>  9       Ceraticelus laticeps 1.565325e+18
#> 10           Micraster acutus 1.565325e+18

PLOS Search API

Most of the examples above use the PLOS search API... :)

Solr server management

This isn't as complete as searching functions show above, but we're getting there.

Cores

conn <- SolrClient$new()

Many functions, e.g.:

  • core_create()
  • core_rename()
  • core_status()
  • ...

Create a core

conn$core_create(name = "foo_bar")

Collections

Many functions, e.g.:

  • collection_create()
  • collection_list()
  • collection_addrole()
  • ...

Create a collection

conn$collection_create(name = "hello_world")

Add documents

Add documents, supports adding from files (json, xml, or csv format), and from R objects (including data.frame and list types so far)

df <- data.frame(id = c(67, 68), price = c(1000, 500000000))
conn$add(df, name = "books")

Delete documents, by id

conn$delete_by_id(name = "books", ids = c(3, 4))

Or by query

conn$delete_by_query(name = "books", query = "manu:bank")

Meta

  • Please report any issues or bugs
  • License: MIT
  • Get citation information for solrium in R doing citation(package = 'solrium')
  • Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

ropensci_footer

News

solrium 1.0.0

This is v1, indicating breaking changes from the previous version!

NEW FEATURES

  • Package has been reworked to allow control over what parameters are sent as query parameters and which as body. If only query parameters given, we do a GET request, but if any body parameters given (even if query params given) we do a POST request. This means that all solr_* functions have more or less the same parameters, and you now pass query parameters to params and body parameters to body. This definitely breaks previous code, apologies for that, but the bump in major version is a big indicator of the breakage.
  • As part of overhaual, moved to using an R6 setup for the Solr connection object. The connection object deals with connection details, and you can call all methods on the object created. Additionally, you can simply pass the connection object to standalone methods. This change means you can create connection objects to >1 Solr instance, so you can use many Solr instances in one R session. (#100)
  • gains new functions update_atomic_json and update_atomic_xml for doing atomic updates (#97) thanks @yinghaoh
  • solr_search and solr_all gain attributes that include numFound, start, and maxScore (#94)
  • solr_search/solr_all/solr_mlt gain new feature where we automically check for and adjust rows parameter for you if you allow us to. You can toggle this behavior and you can set a minimum number for rows to be optimized with minOptimizedRows. See (#102) (#104) (#105) for discussion. Thanks @1havran

MINOR IMPROVEMENTS

  • Replaced httr with crul. Should only be noticeable with respect to specifying curl options (#98)
  • Added more tests (#56)
  • optimize renamed to solr_optimize (#107)
  • now solr_facet fails better when no facet.* fields given (#103)

BUG FIXES

  • Fixed solr_highlight parsing to data.frame bug (#109)

solrium 0.4.0

MINOR IMPROVEMENTS

  • Change dplyr::rbind_all() (deprecated) to dplyr::bind_rows() (#90)
  • Added additional examples of using pivot facetting to solr_facet() (#91)
  • Fix to solr_group() (#92)
  • Replaced dependency XML with xml2 (#57)
  • Added examples and tests for a few more public Solr instances (#30)
  • Now using tibble to give back compact data.frame's
  • namespace all base package calls
  • Many changes to internal parsers to use xml2 instead of XML, and improvements

solrium 0.3.0

NEW FEATURES

  • released to CRAN

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("solrium")

1.0.0 by Scott Chamberlain, 4 months ago


https://github.com/ropensci/solrium


Report a bug at https://github.com/ropensci/solrium/issues


Browse source code at https://github.com/cran/solrium


Authors: Scott Chamberlain [aut, cre]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports utils, dplyr, plyr, crul, xml2, jsonlite, tibble, R6

Suggests roxygen2, testthat, knitr


Imported by rdatacite, rdryad, ritis, rplos.


See at CRAN