A predictable and pipeable framework for performing ETL (extract-transform-load) operations on publicly-accessible medium-sized data set. This package sets up the method structure and implements generic functions. Packages that depend on this package download specific data sets from the Internet, clean them up, and import them into a local or remote relational database management system.
etl
is an R package to facilitate Extract - Transform - Load (ETL) operations for medium data. The end result is generally a populated SQL database, but the user interaction takes place solely within R.
etl
is on CRAN, so you can install it in the usual way, then load it.
install.packages("etl")
library(etl)
Instantiate an etl
object using a string that determines the class of the resulting object, and the package that provides access to that data. The trivial mtcars
database is built into etl
.
cars <- etl("mtcars")
## /tmp/RtmpIMUIel/file68b36216d2b2.sqlite3
class(cars)
## [1] "etl_mtcars" "etl" "src_dbi" "src_sql" "src"
etl
works with a local or remote database to store your data. Every etl
object extends a dplyr::src_dbi
object. If, as in the example above, you do not specify a SQL source, a local RSQLite
database will be created for you. However, you can also specify any source that inherits from dplyr::src_dbi
.
# For PostgreSQLlibrary(RPostgreSQL)db <- src_postgres(dbname = "mtcars", user = "postgres", host = "localhost")# Alternatively, for MySQLlibrary(RMySQL)db <- src_mysql(dbname = "mtcars", user = "r-user", password = "mypass", host = "localhost")cars <- etl("mtcars", db)
At the heart of etl
are three functions: etl_extract()
, etl_transform()
, and etl_load()
.
The first step is to acquire data from an online source.
cars %>%etl_extract()
## Extracting raw data...
This creates a local store of raw data.
These data may need to be transformed from their raw form to files suitable for importing into SQL (usually CSVs).
cars %>%etl_transform()
## Transforming raw data...
Populate the SQL database with the transformed data.
cars %>%etl_load()
## Loading 1 file(s) into the database...
To populate the whole database from scratch, use etl_create
.
cars %>%etl_create()
## Running SQL script at /home/bbaumer/R/x86_64-pc-linux-gnu-library/3.4/etl/sql/init.sqlite
## Extracting raw data...
## Transforming raw data...
## Loading 1 file(s) into the database...
You can also update an existing database without re-initializing, but watch out for primary key collisions.
cars %>%etl_update()
Now that your database is populated, you can work with it as a src
data table just like any other dplyr
source.
cars %>%tbl("mtcars") %>%group_by(cyl) %>%summarise(N = n(), mean_mpg = mean(mpg))
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.9.1 [/tmp/RtmpIMUIel/file68b36216d2b2.sqlite3]
## cyl N mean_mpg
## <int> <int> <dbl>
## 1 4 11 26.66364
## 2 6 7 19.74286
## 3 8 14 15.10000
Suppose you want to create your own ETL package called pkgname
. All you have to do is write a package that requires etl
, and then you have to write two S3 methods:
etl_extract.etl_pkgname()etl_load.etl_pkgname()
Please see the "Extending etl" vignette for more information.
Please see the full manuscript for additional details.
citation("etl")
##
## To cite package 'etl' in publications use:
##
## Ben Baumer (2017). etl: Extract-Transform-Load Framework for
## Medium Data. R package version 0.3.6.9000.
## http://github.com/beanumber/etl
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {etl: Extract-Transform-Load Framework for Medium Data},
## author = {Ben Baumer},
## year = {2017},
## note = {R package version 0.3.6.9000},
## url = {http://github.com/beanumber/etl},
## }
etl_cities
methodsdefault
methods for any packageclobber
option to smart_download
db_type
for easy typing of connection objectssmart_upload
for pushing files to databasedbplyr
to Suggestsdplyr::src_sql
in documentationsrc_mysql_cnf
as shorthand for connecting to MySQLfile.path
uniformly (#7)smart_download
to downloader
for HTTPSis.etl
to main documentation for etl
(30dee378)etl_load.etl_mtcars
by making etl_transform
saferverify_con
messages easier to readnew_filenames
argument to smart_download
etl_init
(#7)get_schema
to find_schema
(1c0a4e3)NEWS.md
file to track changes to the package.