A 'robots.txt' Parser and 'Webbot'/'Spider'/'Crawler' Permissions Checker

Provides functions to download and parse 'robots.txt' files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers, ...) are allowed to access specific resources on a domain.


Peter Meißner
r Sys.time()

Ubuntu build
Version on CRAN
Downloads from CRAN.RStudio   

Status: Feature complete and part of the ROpenSci network.

Author: Peter Meißner

Contributer: Oliver Keys (code review and improvements), Rich FitzJohn (code review and improvements)

Licence: MIT

Description:

The robotstxt package provides functions to download and parse robots.txt files. Ultimatly the package makes it easy to check if bots (spiders, scrapers, ...) are allowed to access specific resources on a domain.

Installation and start - stable version

install.packages("robotstxt")
library(robotstxt)

Installation and start - development version

devtools::install_github("petermeissner/robotstxt")
library(robotstxt)

Robotstxt class documentation

?robotstxt

Usage

library(robotstxt)
 
paths_allowed(
  paths  = c("/api/rest_v1/?doc", "/w/"), 
  domain = "wikipedia.org", 
  bot    = "*"
)
## [1]  TRUE FALSE
paths_allowed(
  paths = c(
    "https://wikipedia.org/api/rest_v1/?doc", 
    "https://wikipedia.org/w/"
  )
)
## [1]  TRUE FALSE

... or use it that way ...

library(robotstxt)
 
rtxt <- robotstxt(domain = "wikipedia.org")
rtxt$check(paths = c("/api/rest_v1/?doc", "/w/"), bot= "*")
## /api/rest_v1/?doc               /w/ 
##              TRUE             FALSE

More information

vignette

Contribution - AKA The-Think-Twice-Be-Nice-Rule

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms: contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.

We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.

Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.

Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.

This Code of Conduct is adapted from the Contributor Covenant (http:contributor-covenant.org), version 1.0.0, available at http://contributor-covenant.org/version/1/0/0/

News

NEWS robotstxt

  • RESTRUCTURING
  • https://github.com/ropensci/onboarding/issues/25
  • first feature complete version on CRAN

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("robotstxt")

0.4.1 by Peter Meissner, 20 days ago


https://github.com/ropenscilabs/robotstxt


Report a bug at https://github.com/ropenscilabs/robotstxt/issues


Browse source code at https://github.com/cran/robotstxt


Authors: Peter Meissner [aut, cre], Oliver Keys [ctb], Rich Fitz John [ctb]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports stringr, httr, magrittr

Suggests knitr, rmarkdown, dplyr, testthat, covr


See at CRAN