Work with XML files using a simple, consistent interface. Built on top of the 'libxml2' C library.
You can install xml2 from CRAN,
or you can install the development version from github, using
library("xml2")x <- read_xml("<foo> <bar> text <baz/> </bar> </foo>")xxml_name(x)xml_children(x)xml_text(x)xml_find_all(x, ".//baz")h <- read_html("<html><p>Hi <b>!")hxml_name(h)xml_text(h)
There are three key classes:
xml_node: a single node in a document.
xml_doc: the complete document. Acting on a document is usually the same
as acting on the root node of the document.
xml_nodeset: a set of nodes within the document. Operations on
xml_nodesets are vectorised, apply the operation over each node in the set.
xml2 has similar goals to the XML package. The main differences are:
xml2 takes care of memory management for you. It will automatically free the memory used by an XML document as soon as the last reference to it goes away.
xml2 has a very simple class hierarchy so don't need to think about exactly what type of object you have, xml2 will just do the right thing.
More convenient handling of namespaces in Xpath expressions - see
xml_ns_strip() to get started.
xml_documentobjects did not properly include the root node in the returned list. Previous behavior can be obtained by using
as_list()[[1L]]in place of
download_html() helper functions to make it easy to
download files (#193).
xml_attr() can now set attributes with no value (#198).
xml_unserialize() now create file connections when
given character input (#179).
xml_find_first() no longer de-duplicates results, so the results are always
the same length as the inputs (as documented) (#194).
xml2 can now build using libxml2 2.7.0
Use Rcpp symbol registration and visibility to prevent symbol conflicts on Linux
xml_add_child() now requires less resources to insert a node when called
.where = 0L (@heckendorfc, #175).
Fixed failing examples due to a change in an external resource.
write_html() now accept connections as well as filenames
for output. (#157)
xml_add_child() now takes a
.where argument specifying where to add the
new children. (#138)
as_xml() generic function to convert R objects to xml. The most important
method is for lists and enables full roundtrip support for going to and back
from xml for lists and enables full roundtrip support to and from XML. (#137, #143)
xml_new_root() can be used to create a new document and a root node in one step (#131).
xml_add_parent() inserts a new node between the node and its parent (#129)
xml_validate() to validate a document against an xml schema (#31, @jeroenooms).
xml2_types.h to allow for extension packages such as xslt.
xml_comment() allows you to add comment nodes to a document. (#111)
xml_cdata() allows you to add CDATA nodes to a document. (#128)
xml_set_name() equivalent to
xml_set_attrs() equivalent to
xml_attrs<-. (#109, #130)
write_html() method (#133).
xml_new_document() now explicitly sets the encoding (default UTF-8) (#142)
Document formatting options for
Add missing methods for xml_missing objects. (#134)
Bugfix for xml_length.xml_nodeset that caused it to fail unconditionally. (#140)
is.na() now returns
xml_missing objects. (#139)
Trim non-breaking spaces in
xml_text(trim = TRUE) (#151).
Allow setting non-character attributes (values are coerced to characters). (@sjp, #117, #122).
Fixed return value in call to vapply in xml_integer.xml_nodeset. (@ddiez, #146, #147).
Allow docs missing a root element to be created and printed. (@sjp, #126, #121).
xml_add_* methods now return invisibly. (@sjp, #124)
as_list() now preserves element names when attributes exist, and escapes
XML attributes that conflict with special R attributes (@peterfoley, #115).
All C++ functions now use
checked_get() instead of
get() where possible,
so NULL XPtrs properly throw an error rather than crashing. (@jimhester,
xml_double() functions to make it easy to extract
integer and double text from nodes (@jimhester, #97, #99).
xml2 now supports modification and creation of XML nodes. New functions
and replacement methods for
xml_text() (@jimhester, #9 #76)
xml_ns() now keeps namespace prefixes that point to the same URI
(@jimhester, #35, #95).
read_html() methods added for
(@jimhester, #63, #93)
xml_child() function to make selecting children a little easier
(@jimhester, #23, #94)
xml_find_one() has been deprecated in favor of
(@jimhester, #58, #92)
xml_read() functions now default to passing the document's namespace
object. Namespace definitions can now be removed as well as added and
xml_ns_strip() added to remove all default namespaces from a document.
(@jimhester, #28, #89)
xml_read() gains a
options argument to control all available parsing
HUGE to turn off limits for parsing very large
documents and now drops blank text nodes by default, mimicking default
behavior of XML package. (@jimhester, #49, #62, #85, #88)
xml_write() expands the path on filenames, so directories can be specified
with '~/' (@jimhester, #86, #80)
xml_find_one() now returns a 'xml_missing' node object if there are 0
matches (@jimhester, #55, #53, hadley/rvest#82).
xml_find_lgl() functions added to
return numeric, character and logical results from XPath expressions. (@jimhester, #55)
xml_text() always correctly encode returned value as
Improved configure script - now works again on R-devel on windows.
Compiles with older versions of libxml2.,
Make configure script more cross platform.
xml_length() to count the number of children (#32).