Easily talk to Google's 'BigQuery' database from R.
The bigrquery package makes it easy to work with data stored in Google BigQuery by allowing you to query BigQuery tables and retrieve metadata about your projects, datasets, tables, and jobs. The bigrquery package provides three levels of abstraction on top of BigQuery:
The low-level API provides thin wrappers over the underlying REST
API. All the low-level functions start with
bq_, and mostly have
bq_noun_verb(). This level of abstraction is most
appropriate if you’re familiar with the REST API and you want do
something not supported in the higher-level APIs.
The DBI interface wraps the low-level API and makes working with BigQuery like working with any other database system. This is most convenient layer if you want to execute SQL queries in BigQuery or upload smaller amounts (i.e. <100 MB) of data.
The dplyr interface lets you treat BigQuery tables as if they are in-memory data frames. This is the most convenient layer if you don’t want to write SQL, but instead want dbplyr to write it for you.
The current bigrquery release can be installed from CRAN:
The newest development release can be installed from GitHub:
library(bigrquery)billing <- bq_test_project() # replace this with your project IDsql <- "SELECT year, month, day, weight_pounds FROM `publicdata.samples.natality`"tb <- bq_project_query(billing, sql)bq_table_download(tb, max_results = 10)#> # A tibble: 10 x 4#> year month day weight_pounds#> <int> <int> <int> <dbl>#> 1 1969 1 20 7.87#> 2 1969 6 27 8.00#> 3 1969 2 14 6.62#> 4 1969 2 1 7.56#> 5 1969 6 9 7.50#> 6 1969 10 21 6.31#> 7 1969 1 14 5.69#> 8 1969 6 5 7.94#> 9 1969 5 8 7.94#> 10 1969 1 3 6.31
library(DBI)con <- dbConnect(bigrquery::bigquery(),project = "publicdata",dataset = "samples",billing = billing)con#> <BigQueryConnection>#> Dataset: publicdata.samples#> Billing: bigrquery-examplesdbListTables(con)#>  "github_nested" "github_timeline" "gsod" "natality"#>  "shakespeare" "trigrams" "wikipedia"dbGetQuery(con, sql, n = 10)#> # A tibble: 10 x 4#> year month day weight_pounds#> <int> <int> <int> <dbl>#> 1 1969 1 20 7.87#> 2 1969 6 27 8.00#> 3 1969 2 14 6.62#> 4 1969 2 1 7.56#> 5 1969 6 9 7.50#> 6 1969 10 21 6.31#> 7 1969 1 14 5.69#> 8 1969 6 5 7.94#> 9 1969 5 8 7.94#> 10 1969 1 3 6.31
library(dplyr)natality <- tbl(con, "natality")natality %>%select(year, month, day, weight_pounds) %>%head(10) %>%collect()#> # A tibble: 10 x 4#> year month day weight_pounds#> <int> <int> <int> <dbl>#> 1 1969 11 29 6.00#> 2 1969 2 6 8.94#> 3 1970 9 4 7.13#> 4 1970 1 24 7.63#> 5 1970 6 6 9.00#> 6 1970 10 30 6.50#> 7 1971 3 18 5.75#> 8 1971 8 11 6.19#> 9 1971 1 23 5.75#> 10 1969 5 16 6.88
When using bigquery interactively, you’ll be prompted to authorize
bigrquery in the
browser. Your credentials will be cached across sessions in
.httr-oauth. For non-interactive usage, you’ll need to download a
service token JSON file and use
bigrquery requests permission to modify your data; but it
will never do so unless you explicitly request it (e.g. by calling
If you just want to play around with the bigquery API, it’s easiest to start with the Google’s free sample data. You’ll still need to create a project, but if you’re just playing around, it’s unlikely that you’ll go over the free limit (1 TB of queries / 10 GB of storage).
To create a project:
Open https://console.cloud.google.com/ and create a project. Make a note of the “Project ID” in the “Project info” box.
Click on “APIs & Services”, then “Dashboard” in the left the left menu.
Click on “Enable Apis and Services” at the top of the page, then search for “BigQuery API” and “Cloud storage”.
Use your project ID as the
billing project whenever you work with free
sample data; and as the
project when you work with your own data.
bq_table_download() and the
DBI::dbConnect method now has a
argument which governs how BigQuery integer columns are imported into R. As
before, the default is
bigint = "integer". You can set
bigint = "integer64" to import BigQuery integer columns as
bit64::integer64 columns in R which allows for values outside the range of
2147483647) (@rasmusab, #94).
bq_table_download() now treats NUMERIC columns the same was as FLOAT
columns (@paulsendavidjay, #282).
bq_table_upload() works with POSIXct/POSIXct varibles (#251)
as.character() now translated to
SAFE_CAST(x AS STRING) (#268).
median() now translates to
APPROX_QUANTILES(x, 2)[SAFE_ORDINAL(2)] (@valentinumbach, #267).
Jobs now print their ids while running (#252)
bq_job() tracks location so bigrquery now works painlessly with non-US/EU
bq_perform_upload() will only autodetect a schema if the table does
not already exist.
bq_table_download() correctly computes page ranges if both
start_index are supplied (#248)
Unparseable date times return NA (#285)
The system for downloading data from BigQuery into R has been rewritten from the ground up to give considerable improvements in performance and flexibility.
The two steps, downloading and parsing, now happen in sequence, rather than interleaved. This means that you'll now see two progress bars: one for downloading JSON from BigQuery and one for parsing that JSON into a data frame.
Downloads now occur in parallel, using up to 6 simultaneous connections by default.
The parsing code has been rewritten in C++. As well as considerably improving performance, this also adds support for nested (record/struct) and repeated (array) columns (#145). These columns will yield list-columns in the following forms:
Results are now returned as tibbles, not data frames, because the base print method does not handle list columns well.
I can now download the first million rows of
publicdata.samples.natality in about a minute. This data frame is about 170 MB in BigQuery and 140 MB in R; a minute to download this much data seems reasonable to me. The bottleneck for loading BigQuery data is now parsing BigQuery's json format. I don't see any obvious way to make this faster as I'm already using the fastest C++ json parser, RapidJson. If this is still too slow for you (i.e. you're downloading GBs of data), see
?bq_table_download for an alternative approach.
dplyr::compute() now works (@realAkhmed, #52).
tbl() now accepts fully (or partially) qualified table names, like
"publicdata.samples.shakespeare" or "samples.shakespeare". This makes it
possible to join tables across datasets (#219).
dbConnect() now defaults to standard SQL, rather than legacy SQL. Use
use_legacy_sql = TRUE if you need the previous behaviour (#147).
dbConnect() now allows
dataset to be omitted; this is natural when you
want to use tables from multiple datasets.
dbReadTable() now accept fully (or partially)
qualified table names.
dbi_driver() is deprecated; please use
The low-level API has been completely overhauled to make it easier to use. The primary motivation was to make bigrquery development more enjoyable for me, but it should also be helpful to you when you need to go outside of the features provided by higher-level DBI and dplyr interfaces. The old API has been soft-deprecated - it will continue to work, but no further development will occur (including bug fixes). It will be formally deprecated in the next version, and then removed in the version after that.
Consistent naming scheme:
All API functions now have the form
constructor functions create S3 objects corresponding to important BigQuery
objects (#150). These are paired with
as_ coercion functions and used throughout
the new API.
Easier local testing:
bq_test_dataset() make it easier to run
bigrquery tests locally. To run the tests yourself, you need to create a
BigQuery project, and then follow the instructions in
More efficient data transfer:
The new API makes extensive use of the
fields query parameter, ensuring
that functions only download data that they actually use (#153).
Tighter GCS connection:
bq_table_load() loads data from a Google Cloud Storage URI, pairing
bq_table_save() which saves data to a GCS URI (#155).
The dplyr interface can work with literal SQL once more (#218).
Improved SQL translation for
(#176, #179, @jarodmeng). And for
If you have the development version of dbplyr installed,
a BigQuery table will not perform an unneeded query, but will instead
download directly from the table (#226).
Request error messages now contain the "reason", which can contain useful information for debugging (#209).
bq_project_query() can now supply query parameters
bq_table_create() can now specify
bq_perform_query() no longer fails with empty results (@byapparov, #206).
dplyr support has been updated to require dplyr 0.7.0 and use dbplyr. This means that you can now more naturally work directly with DBI connections. dplyr now also uses modern BigQuery SQL which supports a broader set of translations. Along the way I've also fixed some SQL generation bugs (#48).
The DBI driver gets a new name:
insert_extract_job() make it possible to extract data and save in
google storage (@realAkhmed, #119).
insert_table() allows you to insert empty tables into a dataset.
All POST requests (inserts, updates, copies and
.... This allows you to add arbitrary additional data to the
request body making it possible to use parts of the BigQuery API
that are otherwise not exposed (#149).
snake_case argument names are
automatically converted to
camelCase so you can stick consistently
to snake case in your R code.
Full support for DATE, TIME, and DATETIME types (#128).
All bigrquery requests now have a custom user agent that specifies the versions of bigrquery and httr that are used (#151).
dbConnect() gains new
that are passed onto
query_exec(). These allow you to control query options
at the connection level.
insert_upload_job() now sends data in newline-delimited JSON instead
of csv (#97). This should be considerably faster and avoids character
encoding issues (#45).
POSIXlt columns are now also correctly
coerced to TIMESTAMPS (#98).
query_exec() gain new arguments:
quiet = TRUEwill suppress the progress bars if needed.
use_legacy_sql = FALSEoption allows you to opt-out of the legacy SQL system (#124, @backlin)
list_tables() (#108) and
list_datasets() (#141) are now paginated.
By default they retrieve 50 items per page, and will iterate until they
query_exec() now give a nicer progress bar,
including estimated time remaining (#100).
query_exec() should be considerably faster because profiling revealed that
~40% of the time taken by was a single line inside a function that helps
parse BigQuery's json into an R data frame. I replaced the slow R code with
a faster C function.
set_oauth2.0_cred() allows user to supply their own Google OAuth
application when setting credentials (#130, @jarodmeng)
wait_for() uses now reports the query total bytes billed, which is
more accurate because it takes into account caching and other factors.
list_tabledata returns empty table on max_pages=0 (#184, @ras44 @byapparov)
set_service_token() allows you to use OAuth service token instead of
^ is correctly translated to
Provide full DBI compliant interface (@krlmlr).
Backend now translates
IF (@realAkhmed, #53).
Compatiable with latest httr.
Computation of the SQL data type that corresponds to a given R object is now more robust against unknown classes. (#95, @krlmlr)
A data frame with full schema information is returned for zero-row results. (#88, @krlmlr)
exists_table(). (#91, @krlmlr)
insert_upload_job(). (#92, @krlmlr)
bigrquery.quiet. (#89, @krlmlr)
format_table(). (#81, @krlmlr)
list_tabledata_iter() that allows fetching a table in chunks of
varying size. (#77, #87, @krlmlr)
Add support for API keys via the
BIGRQUERY_API_KEY environment variable.