Wrappers around the 'xml2' and 'httr' packages to make it easy to download, then manipulate, HTML and XML.
library(rvest)lego_movie <- read_html("")rating <- lego_movie %>%html_nodes("strong span") %>%html_text() %>%as.numeric()ratingcast <- lego_movie %>%html_nodes("#titleCast .itemprop span") %>%html_text()cast#>  "Will Arnett" "Elizabeth Banks" "Craig Berry"#>  "Alison Brie" "David Burrows" "Anthony Daniels"#>  "Charlie Day" "Amanda Farinos" "Keith Ferguson"#>  "Will Ferrell" "Will Forte" "Dave Franco"#>  "Morgan Freeman" "Todd Hansen" "Jonah Hill"poster <- lego_movie %>%html_nodes(".poster img") %>%html_attr("src")poster#>  ""
The most important functions in rvest are:
Create an html document from a url, a file on disk or a string containing html with
Select parts of a document using css selectors:
html_nodes(doc, "table td") (or if you've a glutton for punishment, use xpath selectors with
html_nodes(doc, xpath = "//table//td")). If you haven't heard of selectorgadget, make sure to read
vignette("selectorgadget") to learn about it.
Extract components with
html_tag() (the name of the tag),
html_text() (all text inside the tag),
html_attr() (contents of a single attribute) and
html_attrs() (all attributes).
(You can also use rvest with XML files: parse with
xml(), then extract components using
Parse tables into data frames with
Extract, modify and submit forms with
Detect and repair encoding problems with
Navigate around a website as if you're in a browser with
submit_form() and so on. (This is still a work in progress, so I'd love your feedback.)
To see examples of these function in use, check out the demos.
Install the release version from CRAN:
Or the development version from github
back() to correctly manage session history.
If you're using xml2 1.0.0,
html_node() will now return a "missing node".
Parse rowspans and colspans effectively by filling using repetition from left to right (for colspan) and top to bottom (rowspan) (#111)
Updated a few examples and demos where the website structure has changed.
Made compatible with both xml2 0.1.2 and 1.0.0.
Fix invalid link for SSA example.
<options> that don't have value attribute (#85).
Remove all remaining uses of
html() in favor of
rvest has been rewritten to take advantage of the new xml2 package. xml2 provides a fresh binding to libxml2, avoiding many of the work-arounds previously needed for the XML package. Now rvest depends on the xml2 package, so all the xml functions are available, and rvest adds a thin wrapper for html.
A number of functions have change names. The old versions still work, but are deprecated and will be removed in rvest 0.4.0.
html_node() now throws an error if there are no matches, and a warning
if there's more than one match. I think this should make it more likely to
fail clearly when the structure of the page changes.
xml_structure() has been moved to xml2. New
html_structure() (also in
xml2) highlights id and class attributes (#78).
submit_form() now works with forms that use GET (#66).
submit_request() (and hence
submit_form()) is now case-insensitive,
and so will find
<input type=SUBMIT> as well as
submit_request() (and hence
submit_form()) recognizes forms with
<input type="image"> as a valid form submission button per
... on to
httr::GET() so you can more
finely control the request (#48).
Add xml support: parse with
xml(), then work with using
xml_structure(): new function that displays the structure (i.e. tag
and attribute names) of a xml/html object (#10).
follow_link() now accepts css and xpath selectors. (#38, #41, #42)
html() does a better job of dealing with encodings (passing the
problem on to
XML::parseHTML()) instead of trying to do it itself
html_attr() returns default value when input is NULL (#49)
html_node() method for session.
html_nodes() now returns an empty list if no elements are found (#31).
submit_form() converts relative paths to absolute URLs (#52).
It also deals better with 0-length inputs (#29).