Interface to the Geoparser.io API for Identifying and Disambiguating Places Mentioned in Text

A wrapper for the Geoparser.io API version 0.4.0 (see < https://geoparser.io/>), which is a web service that identifies places mentioned in text, disambiguates those places, and returns detailed data about the places found in the text. Basic, limited API access is free with paid plans to accommodate larger workloads.


geoparser

This package is an interface to the geoparser.io API that identifies places mentioned in text, disambiguates those places, and returns data about the places found in the text.

Installation

To install the package, you will need the devtools package.

library("devtools")
install_github("ropenscilabs/geoparser")

To get an API key, you need to register at https://geoparser.io/pricing.html. With an hobbyist account, you can make up to 1,000 calls a month to the API. Please note that the API is currently in beta and thus totally free! For ease of use, save your API key as an environment variable as described at https://stat545-ubc.github.io/bit003_api-key-env-var.html.

The package will conveniently look for your API key using Sys.getenv("GEOPARSER_KEY") so if your API key is an environment variable called "GEOPARSER_KEY" you don't need to input it manually.

What is geoparsing?

According to Wikipedia, geoparsing is the process of converting free-text descriptions of places (such as "Springfield") into unambiguous geographic identifiers (such as lat-lon coordinates). A geoparser is a tool that helps in this process. Geoparsing goes beyond geocoding in that, rather than analyzing structured location references like mailing addresses and numerical coordinates, geoparsing handles ambiguous place names in unstructured text.

Geoparser.io works best on complete sentences in English. If you have a very short text, such as a partial address like "Auckland New Zealand," you probably want to use a geocoder tool instead of a geoparser. In R, you can use the opencage package for geocoding!

How to use the package

You need to input a text whose size is less than 8KB.

library("geoparser")
output <- geoparser_q("I was born in Vannes and I live in Barcelona")

The output is list of 2 data.frames (dply::tbl_dfs). The first one is called properties and contains

  • the api version called apiVersion

  • the source of the results

  • the id of the query

  • text_md5 is the MD5 hash of the text that was sent to the API.

output$properties
```     ##   apiVersion       source                    id
    ## *     <fctr>       <fctr>                <fctr>
    ## 1      0.4.0 geoparser.io 4Wdx9A9sOQ3ecpwd7bY7l
    ## # ... with 1 more variables: text_md5 <chr>
 
The second data.frame contains the results and is called results:
 
``` r
knitr::kable(output$results)
countryconfidencenameadmin1typegeometry.typelongitudelatitudereference1reference2text_md5
FR1VannesA2seat of a second-order administrative divisionPoint-2.7500047.66667142051e05aeb3366e55795a9729dd74ae901
ES1Barcelona56seat of a first-order administrative divisionPoint2.1589941.38879354451e05aeb3366e55795a9729dd74ae901
  • country is the ISO-3166 2-letter country code for the country in which this place is located, or NULL for features outside any sovereign territory.

  • confidence is a confidence score produced by the place name disambiguation algorithm. Currently returns a placeholder value; subject to change.

  • name is the best name for the specified location, with a preference for official/short name forms (e.g., "New York" over "NYC," and "California" over "State of California"), which may be different from exactly what appears in the text.

  • admin1 is a code representing the state/province-level administrative division containing this place. (From GeoNames.org: "Most adm1 are FIPS codes. ISO codes are used for US, CH, BE and ME. UK and Greece are using an additional level between country and fips code. The code '00' stands for general features where no specific adm1 code is defined.").

  • type is a text description of the geographic feature type — see <GeoNames.org> for a complete list. Subject to change.

  • geometry.type is the type of the geographical feature, e.g. "Point".

  • longitude is the longitude.

  • latitude is the latitude.

  • reference1 is the start (index of the first character in the place reference) -- each reference to this place name found in the input text is on one distinct line.

  • reference2 the end (index of the first character after the place reference) -- each reference to the place name found in the input text is on one distinct line.

  • text_md5 is the MD5 hash of the text that was sent to the API.

You can input a vector of characters since the function is vectorized. This is the case where the MD5 hash of each text can be useful for further analysis.

library("geoparser")
output_v <- geoparser_q(text_input = c("I was born in Vannes but I live in Barcelona.",
"France is the most beautiful place in the world.", "No place here."))
knitr::kable(output_v$results)
countryconfidencenameadmin1typegeometry.typelongitudelatitudereference1reference2text_md5
FR1VannesA2seat of a second-order administrative divisionPoint-2.7500047.66667142090aba603d6b3f6b916c634f74ebc3a05
ES1Barcelona56seat of a first-order administrative divisionPoint2.1589941.38879354490aba603d6b3f6b916c634f74ebc3a05
FR1France00independent political entityPoint2.0000046.000000633247ffc493ca57619549e512c7b5c59
knitr::kable(output_v$properties)
apiVersionsourceidtext_md5
0.4.0geoparser.iop2OeVDVhrK1Jue0K18ny390aba603d6b3f6b916c634f74ebc3a05
0.4.0geoparser.io6n6jQ5QuBqw0HVb9Wp3OL33247ffc493ca57619549e512c7b5c59
0.4.0geoparser.ioyq2eRnRiKbLeTVO3LqaXXa9b35a32dc022502c943daa55520bfc0

How does it work?

The API uses the Geonames.org gazetteer data. Geoparser.io uses a variety of named entity recognition tools to extract location names from the raw text input, and then applies a proprietary disambiguation algorithm to resolve location names to specific gazetteer records.

What happens if the same place occurs several times in the text?

If the input text contains several times the same placename, there is one line for each repetition, the only difference between lines being the values of reference1 and reference2.

output2 <- geoparser_q("I like Paris and Paris and Paris and yeah it is in France!")
knitr::kable(output2$results)
countryconfidencenameadmin1typegeometry.typelongitudelatitudereference1reference2text_md5
FR1France00independent political entityPoint2.000046.00000515734ac61cd71faef0cc4b336b706a7e545
FR1ParisA8capital of a political entityPoint2.348848.8534171234ac61cd71faef0cc4b336b706a7e545
FR1ParisA8capital of a political entityPoint2.348848.85341172234ac61cd71faef0cc4b336b706a7e545
FR1ParisA8capital of a political entityPoint2.348848.85341273234ac61cd71faef0cc4b336b706a7e545

What happens if there are no results for the text?

In this case the results table is empty.

output_nothing <- geoparser_q("No placename can be found.")
output_nothing$results
## # A tibble: 0 x 1
## # ... with 1 variables: text_md5 <chr>

How well does it work?

The API team has tested the API un-scientifically and noticed a performance similar to other existing geoparsing tools. A scientific evaluation is under way. The public Geoparser.io API works best with professionally-written, professionally-edited news articles, but for Enterprise customers the API team says that it can be tuned/tweaked for other kinds of input (e.g., social media).

Let's look at this example:

output3 <- geoparser_q("I live in Hyderabad, India. My mother would prefer living in Hyderabad near Islamabad!")
knitr::kable(output3$results)
countryconfidencenameadmin1typegeometry.typelongitudelatitudereference1reference2text_md5
IN1Hyderabad40seat of a first-order administrative divisionPoint78.4563617.384051019645d890dde2bce1092338f0cbc7af011
IN1Hyderabad40seat of a first-order administrative divisionPoint78.4563617.384056170645d890dde2bce1092338f0cbc7af011
IN1India00independent political entityPoint79.0000022.000002126645d890dde2bce1092338f0cbc7af011
BD1Chittagong84seat of a first-order administrative divisionPoint91.8316822.338407685645d890dde2bce1092338f0cbc7af011

Geoparser.io typically assumes two mentions of the same name appearing so closely together in the same input text refer to the same place. So, because it saw "Hyderabad" (India) in the first sentence, it assumes "Hyderabad" in the second sentence refers to the same city. Also, "Islamabad" is an alternate name for Chittagong, which has a higher population than Islamabad (Pakistan) and is closer to Hyderabad (India).

Here is another example with a longer text.

text <- "Aliwagwag is situated in the Eastern Mindanao Biodiversity \
Corridor which contains one of the largest remaining blocks of tropical lowland \
rainforest in the Philippines. It covers an area of 10,491.33 hectares (25,924.6 \
acres) and a buffer zone of 420.6 hectares (1,039 acres) in the hydrologically \
rich mountainous interior of the municipalities of Cateel and Boston in Davao \
Oriental as well as a portion of the municipality of Compostela in Compostela \
Valley. It is also home to the tallest trees in the Philippines, the Philippine \
rosewood, known locally as toog. In the waters of the upper Cateel River, a rare \
species of fish can be found called sawugnun by locals which is harvested as a \
delicacy." 
 
output4 <- geoparser_q(text)
knitr::kable(output4$results)
countryconfidencenameadmin1typegeometry.typelongitudelatitudereference1reference2text_md5
PH1Philippines0independent political entityPoint122.000013.00000159170d89e347a998b58c6a8e54bc9f9abc073
PH1Philippines0independent political entityPoint122.000013.00000513524d89e347a998b58c6a8e54bc9f9abc073
PH1Cateel11populated placePoint126.45337.79139354360d89e347a998b58c6a8e54bc9f9abc073
PH1Boston11populated placePoint126.36427.87111365371d89e347a998b58c6a8e54bc9f9abc073
PH1Province of Davao Oriental11second-order administrative divisionPoint126.33337.16667375390d89e347a998b58c6a8e54bc9f9abc073
PH1Compostela ValleyvalleyPoint125.95867.60755449467d89e347a998b58c6a8e54bc9f9abc073
PH1Cateel River11streamPoint126.45337.78750602614d89e347a998b58c6a8e54bc9f9abc073

What can I do with the results?

You might want to map them using leaflet or ggmap or anything you like. The API website provides suggestions of use for inspiration.

  • Please report any issues or bugs.
  • License: GPL
  • Get citation information for geoparser in R doing citation(package = 'geoparser')
  • Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

News

geoparser 0.2.0 (2016-06-10)

  • geoparser_q now accepts a vector of character.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("geoparser")

0.1.1 by Maëlle Salmon, 6 months ago


http://github.com/ropensci/geoparser


Report a bug at http://github.com/ropensci/geoparser/issues


Browse source code at https://github.com/cran/geoparser


Authors: Maëlle Salmon [aut, cre], Bob Rudis [ctb] (Bob Rudis reviewed the package for rOpenSci, see https://github.com/ropensci/onboarding/issues/43)


Documentation:   PDF Manual  


Task views: Web Technologies and Services


GPL (>= 2) license


Imports dplyr, httr, jsonlite, lazyeval, tidyr, utils, purrr, stringr, digest

Suggests testthat, knitr, rmarkdown


See at CRAN