Collection of high-level, robust, machine- and OS-independent tools for making deeply reproducible and reusable content in R. The three workhorse functions are Cache, prepInputs, and Require; these allow for nested caching, robust to environments, and objects with environments (like functions), and data retrieval and processing, and package handling in continuous workflow environments. In all cases, efforts are made to make the first and subsequent calls of functions have the same result, but vastly faster at subsequent times by way of checksums and digesting. Several features are still under active development, including cloud storage of cached objects, allowing for sharing between users.
A set of tools for R that enhance reproducibility beyond package management.
Built on top of
archivist, this package aims at making high-level, robust, machine and OS independent tools for making deeply reproducible and reusable content in R.
This extends beyond the package management utilities of
checkpoint by including tools for caching and accessing GitHub repositories.
Install from CRAN:
Install from GitHub:
#install.packages("devtools")library("devtools")install_github("PredictiveEcology/reproducible", dependencies = TRUE)
Install from GitHub:
#install.packages("devtools")library("devtools")install_github("PredictiveEcology/reproducible", ref = "development", dependencies = TRUE)
Known issues: https://github.com/PredictiveEcology/reproducible/issues
CHECKSUMS.txtshould now be ordered consistently across operating systems (note:
base::orderwill not succeed in doing this --> now using
cloudSyncCachehas a new argument:
cacheIds. Now user can control entries by
cacheId, so can delete/upload individual objects by
cloudCachebugfixes for more cases
tibblefrom Imports as it's no longer being used
%>%pipe that was long ago deprecated. User should use
%C%if they want a pipe that is Cache-aware. See examples.
optionsdescriptions now in
options("reproducible.cachePath")can take a vector of paths. Similar to how .libPaths() works for libraries,
Cachewill search first in the first entry in the
cacheRepo, then the second etc. until it finds an entry. It will only write to the first entry.
options("reproducible.useCache" = "devMode"). The point of this mode is to facilitate using the Cache when functions and datasets are continually in flux, and old Cache entries are likely stale very often. In
devMode, the cache mechanism will work as normal if the Cache call is the first time for a function OR if it successfully finds a copy in the cache based on the normal Cache mechanism. It differs from the normal Cache if the Cache call does not find a copy in the
cacheRepo, but it does find an entry that matches based on
userTags. In this case, it will delete the old entry in the
cacheRepo(identified based on matching
userTags), then continue with normal
Cache. For this to work correctly,
userTagsmust be unique for each function call. This should be used with caution as it is still experimental.
options("reproducible.useNewDigestAlgorithm" = FALSE). There is a message of this change on package load.
cloudCachewhich allows sharing of Cache among collaborators. Currently only works with
assessDataTypeinto single function (#71, @ianmseddy)
cc: new function -- a shortcut for some commonly used options for
.rararchives, on systems with correct binaries to deal with them (#86, @tati-micheletti)
fastdigest::fastdigestas it is not return the identical hash across operating systems
prepInputson GIS objects that don't use
raster::rasterto load object were skipping
prepInputswould cause virtually all entries in
CHECKSUMS.txtto be deleted. 2 cases where this happened were identified and corrected.
data.tableclass objects would give an error sometimes due to use of
attr(DT). Internally, attributes are now added with
data.table::setattrto deal with this.
prostProcessnow correctly matches extent (#73, @tati-micheletti)
remotesto Imports and removed
New value possible for
options(reproducible.useCache = 'overwrite'), which
allows use of
Cache in cases where the function call has an entry in the
will purge it and add the output of the current call instead.
FALSE), which will be used in
prepInputs as possible directory sources
(searched recursively or not) for files being downloaded/extracted/prepared.
This allows the using of local copies of files in (an)other location(s) instead
of downloading them. If local location does not have the required files,
it will proceed to download so there is little cost in setting this option.
If files do exist on local system, the function will attempt to use a hardlink before making a copy.
dlGoogle() now sets
options(httr_oob_default = TRUE) if using Rstudio Server.
CHECKSUMS now sorted alphabetically.
Checksums can now have a
CHECKSUMS.txt file located in a different place than the
Attempt to select raster resampling method based on raster type if no method supplied (#63, @ianmseddy)
assessDataTypeGDAL, used in
postProcess, to identify smallest
datatype for large Raster* objects passed to GDAL system call
gdalwarpsystem call if
raster::canProcessInMemory(x,4) = FALSEfor faster and memory-safe processing
Rasterobjects, including factor rasters
extractFromArchivefor large (>2GB) zip files. In the
unzipfails for zip files >2GB. This uses a system call if the zip file is too large and fails using
Cache()when deeply nested, due to
grep(sys.calls(), ...)that would take long and hang.
preProcess(url = NULL)(#65, @tati-micheletti)
clearCache(#67), especially for large
Rasterobjects that are stored as binary
rasterpackage changes in development version of
.robustDigestnow does not include
Cachesaving to SQLite database, via
options("reproducible.futurePlan"), if the
futurepackage is installed. This is
do.callfunction is Cached, previously, it would be labelled in the database as
do.call. Now it attempts to extract the actual function being called by the
do.call. Messaging is similarly changed.
reproducible.ask, logical, indicating whether
clearCacheshould ask for deletions when in an interactive session
dlFun, to pass a custom function for downloading (e.g., "raster::getData")
prepInputswill automatically use
readRDSif the file is a
prepInputswill return a
fun = "base::load", with a message; can still pass an
envirto obtain standard behaviour of
clearCache- new argument
assessDataType, used in
postProcess, to identify smallest
datatypefor Raster* objects, if user does not pass an explicit
git2rupdate (@stewid, #36).
.prepareRasterBackedFile-- now will postpend an incremented numeric to a cached copy of a file-backed Raster object, if it already exists. This mirrors the behaviour of the
.rdafile. Previously, if two Cache events returned the same file name backing a Raster object, even if the content was different, it would allow the same file name. If either cached object was deleted, therefore, it would cause the other one to break as its file-backing would be missing.
spades.XXXand should have been
copyFiledid not perform correctly under all cases; now better handling of these cases, often sending to
file.copy(slower, but more reliable)
extractFromArchiveneeded a new
Checksumfunction call under some circumstances
extractFromArchive-- when dealing with nested zips, not all args were passed in recursively (#37, @CeresBarros)
prepInputs-- arguments that were same as
Cachewere not being correctly passed internally to
Cache, and if wrapped in Cache, it was not passed into prepInputs. Fixed.
.prepareFileBackedRasterwas failing in some cases (specifically if it was inside a
do.call) (#40, @CeresBarros).
Cachewas failing under some cases of
Cache(do.call, ...). Fixed.
Cache-- when arguments to Cache were the same as the arguments in
FUN, Cache would "take" them. Now, they are correctly passed to the
preProcess-- writing to checksums may have produced a warning if
CHECKSUMS.txtwas not present. Now it does not.
convertRasterPathsto assist with renaming moved files.
prepInputs -- new features
alsoExtractnow has more options (
"similar") and defaults to extracting all files in an archive (
postProcessaltogether if no
rasterToMatch. Previously, this would invoke Cache even if there was nothing to
copyFilecorrectly handles directory names containing spaces.
makeMemoisablefixed to handle additional edge cases.
prepInputsto aid in data downloading and preparation problems, solved in a reproducible, Cache-aware way.
postProcesswhich is a wrapper for sequences of several other new functions (
downloadFilecan handle Google Drive and ftp/http(s) files
compareNAdoes comparisons with NA as a possible value e.g.,
compareNA(c(1,NA), c(2, NA))returns
Cache -- new features:
verbosewhich can help with debugging
useCachewhich allows turning caching on and off at a high level (e.g., options("useCache"))
cacheIdwhich allows user to hard code a result from a Cache
Cachefunction calls, unless explicitly set on the inner functions
userTagsadded automatically to cache entries so much more powerful searching via
checksums now returns a data.table with the same columns whether
write = TRUE or
write = FALSE.
showCache now give messages and require user intervention if request to
clearCache would be large quantities of data deleted
memoise::memoise now used on 3rd run through an identical
Cache call, dramatically speeding up in most cases
asPath has a new argument indicating how deep should the path be considered when included in caching (only relevant when
quick = TRUE)
New vignette on using Cache
parallel-safe, meaning there are
tryCatch around every attempt at writing to SQLite database so it can be used safely on multi-threaded machines
bug fixes, unit tests, more
imports for packages e.g.,
updates for R 3.6.0 compact storage of sequence vectors
experimental pipes (
%C%) and assign
several performance enhancements
mergeCache: a new function to merge two different Cache repositories
memoise::memoise is now used on
loadFromLocalRepo, meaning that the 3rd time
Cache() is run on the same arguments (and the 2nd time in a session), the returned Cache will be from a RAM object via memoise. To stop this behaviour and use only disk-based Caching, set
options(reproducible.useMemoise = FALSE) .
Cache assign --
%<% can be used instead of normal assign, equivalent to
lhs <- Cache(rhs).
new option: reproducible.verbose, set to FALSE by default, but if set to true may help understand caching behaviour, especially for complex highly nested code.
all options now described in
All Cache arguments other than FUN and ... will now propagate to internal, nested Cache calls, if they are not specified explicitly in each of the inner Cache calls.
Cached pipe operator
%C% -- use to begin a pipe sequence, e.g.,
Cache() %C% ...
sideEffect can now be a path
digestPathContent default changed from FALSE (was for speed) to TRUE (for content accuracy)
searchFull, which shows the full search path, known alternatively as "scope", or "binding environments". It is where R will search for a function when requested by a user.
memoise::memoise for several functions (
available.packages) for speed -- will impact memory at the expense of speed.
requireon those 20 packages, but
requiredoes not check for dependencies and deal with them if missing: it just errors. This speed should be fast enough for many purposes.
dplyr from Imports
RCurl to Imports
change name of
digestRasteraffecting in-memory rasters