Extract-Transform-Load Framework for Medium Data

A predictable and pipeable framework for performing ETL (extract-transform-load) operations on publicly-accessible medium-sized data set. This package sets up the method structure and implements generic functions. Packages that depend on this package download specific data sets from the Internet, clean them up, and import them into a local or remote relational database management system.


Travis-CI Build Status CRAN_Status_Badge CRAN RStudio mirror downloads

etl is an R package to facilitate Extract - Transform - Load (ETL) operations for medium data. The end result is generally a populated SQL database, but the user interaction takes place solely within R.

etl is on CRAN, so you can install it in the usual way, then load it.

install.packages("etl")
library(etl)

Instantiate an etl object using a string that determines the class of the resulting object, and the package that provides access to that data. The trivial mtcars database is built into etl.

cars <- etl("mtcars")
## /tmp/RtmpIMUIel/file68b36216d2b2.sqlite3
class(cars)
## [1] "etl_mtcars" "etl"        "src_dbi"    "src_sql"    "src"

Connect to a local or remote database

etl works with a local or remote database to store your data. Every etl object extends a dplyr::src_dbi object. If, as in the example above, you do not specify a SQL source, a local RSQLite database will be created for you. However, you can also specify any source that inherits from dplyr::src_dbi.

# For PostgreSQL
library(RPostgreSQL)
db <- src_postgres(dbname = "mtcars", user = "postgres", host = "localhost")
 
# Alternatively, for MySQL
library(RMySQL)
db <- src_mysql(dbname = "mtcars", user = "r-user", password = "mypass", host = "localhost")
cars <- etl("mtcars", db)

At the heart of etl are three functions: etl_extract(), etl_transform(), and etl_load().

Extract

The first step is to acquire data from an online source.

cars %>%
  etl_extract()
## Extracting raw data...

This creates a local store of raw data.

Transform

These data may need to be transformed from their raw form to files suitable for importing into SQL (usually CSVs).

cars %>%
  etl_transform()
## Transforming raw data...

Load

Populate the SQL database with the transformed data.

cars %>%
  etl_load()
## Loading 1 file(s) into the database...

Do it all at once

To populate the whole database from scratch, use etl_create.

cars %>%
  etl_create()
## Running SQL script at /home/bbaumer/R/x86_64-pc-linux-gnu-library/3.4/etl/sql/init.sqlite

## Extracting raw data...

## Transforming raw data...

## Loading 1 file(s) into the database...

You can also update an existing database without re-initializing, but watch out for primary key collisions.

cars %>%
  etl_update()

Do Your Analysis

Now that your database is populated, you can work with it as a src data table just like any other dplyr source.

cars %>%
  tbl("mtcars") %>%
  group_by(cyl) %>%
  summarise(N = n(), mean_mpg = mean(mpg))
## # Source:   lazy query [?? x 3]
## # Database: sqlite 3.9.1 [/tmp/RtmpIMUIel/file68b36216d2b2.sqlite3]
##     cyl     N mean_mpg
##   <int> <int>    <dbl>
## 1     4    11 26.66364
## 2     6     7 19.74286
## 3     8    14 15.10000

Create your own ETL packages

Suppose you want to create your own ETL package called pkgname. All you have to do is write a package that requires etl, and then you have to write two S3 methods:

etl_extract.etl_pkgname()
etl_load.etl_pkgname()

Please see the "Extending etl" vignette for more information.

Use other ETL packages

  • macleish Travis-CI Build Status CRAN_Status_Badge : Weather and spatial data from the MacLeish Field Station in Whately, MA.
  • airlines Travis-CI Build Status : On-time flight arrival data from the Bureau of Transportation Statistics
  • citibike Travis-CI Build Status : Municipal bike-sharing system in New York City
  • nyc311 Travis-CI Build Status : Phone calls to New York City's feedback hotline
  • fec Travis-CI Build Status : Campaign contribution data from the Federal Election Commission
  • imdb Travis-CI Build Status : Mirror of the Internet Movie Database

Cite

Please see the full manuscript for additional details.

citation("etl")
## 
## To cite package 'etl' in publications use:
## 
##   Ben Baumer (2017). etl: Extract-Transform-Load Framework for
##   Medium Data. R package version 0.3.6.9000.
##   http://github.com/beanumber/etl
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {etl: Extract-Transform-Load Framework for Medium Data},
##     author = {Ben Baumer},
##     year = {2017},
##     note = {R package version 0.3.6.9000},
##     url = {http://github.com/beanumber/etl},
##   }

News

etl 0.3.7 (2017-09-25)

MINOR IMPROVEMENTS

  • Added etl_cities methods
  • Added default methods for any package
  • Added "Extending ETL" vignette

etl 0.3.6 (2017-07-20)

MINOR IMPROVEMENTS

  • Added clobber option to smart_download
  • Added db_type for easy typing of connection objects
  • Added smart_upload for pushing files to database
  • Added dbplyr to Suggests
  • Fixed broken link to dplyr::src_sql in documentation

etl 0.3.5 (2016-11-28)

MINOR IMPROVEMENTS

  • Fixed CRAN failures on Solaris (thanks Brian Ripley)

etl 0.3.4 (2016-11-07)

MINOR IMPROVEMENTS

  • Added src_mysql_cnf as shorthand for connecting to MySQL
  • Fixed CRAN failures on Solaris (thanks Brian Ripley)
  • Moved to file.path uniformly (#7)
  • Moved smart_download to downloader for HTTPS

etl 0.3.3 (2016-07-27)

MINOR IMPROVEMENTS

  • Moved is.etl to main documentation for etl (30dee378)
  • Fixed typo in DESCRIPTION (4e77fba2)
  • Fixed bug in etl_load.etl_mtcars by making etl_transform safer
  • Made verify_con messages easier to read
  • Added new functions for help with computing dates and matching filenames to dates
  • Added several tests
  • Added new_filenames argument to smart_download
  • Re-implemented etl_init (#7)
  • Renamed get_schema to find_schema (1c0a4e3)

etl 0.3.1 (2016-06-07)

NEW FEATURES

  • released to CRAN
  • Added a NEWS.md file to track changes to the package.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("etl")

0.3.7 by Ben Baumer, 9 months ago


http://github.com/beanumber/etl


Report a bug at https://github.com/beanumber/etl/issues


Browse source code at https://github.com/cran/etl


Authors: Ben Baumer [aut, cre], Carson Sievert [ctb]


Documentation:   PDF Manual  


CC0 license


Imports DBI, datasets, downloader, lubridate, methods, stringr, readr, rlang, rvest, tibble, utils, xml2

Depends on dplyr

Suggests devtools, dbplyr, knitr, macleish, RSQLite, RPostgreSQL, RMySQL, MonetDBLite, ggplot2, testthat, rmarkdown


Depended on by macleish, nyctaxi.

Suggested by mdsr.


See at CRAN