R Interface to Apache Spark

R interface to Apache Spark, a fast and general engine for big data processing, see < http://spark.apache.org>. This package supports connecting to local and remote Apache Spark clusters, provides a 'dplyr' compatible back-end, and provides an interface to Spark's built-in machine learning algorithms.


Build Status CRAN_Status_Badge codecov Join the chat at https://gitter.im/rstudio/sparklyr

  • Connect to Spark from R. The sparklyr package provides a
    complete dplyr backend.
  • Filter and aggregate Spark datasets then bring them into R for
    analysis and visualization.
  • Use Spark's distributed machine learning library from R.
  • Create extensions that call the full Spark API and provide
    interfaces to Spark packages.

Installation

You can install the sparklyr package from CRAN as follows:

install.packages("sparklyr")

You should also install a local version of Spark for development purposes:

library(sparklyr)
spark_install(version = "2.1.0")

To upgrade to the latest version of sparklyr, run the following command and restart your r session:

devtools::install_github("rstudio/sparklyr")

If you use the RStudio IDE, you should also download the latest preview release of the IDE which includes several enhancements for interacting with Spark (see the RStudio IDE section below for more details).

Connecting to Spark

You can connect to both local instances of Spark as well as remote Spark clusters. Here we'll connect to a local instance of Spark via the spark_connect function:

library(sparklyr)
sc <- spark_connect(master = "local")

The returned Spark connection (sc) provides a remote dplyr data source to the Spark cluster.

For more information on connecting to remote Spark clusters see the Deployment section of the sparklyr website.

Using dplyr

We can now use all of the available dplyr verbs against the tables within the cluster.

We'll start by copying some datasets from R into the Spark cluster (note that you may need to install the nycflights13 and Lahman packages in order to execute this code):

install.packages(c("nycflights13", "Lahman"))
library(dplyr)
iris_tbl <- copy_to(sc, iris)
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
batting_tbl <- copy_to(sc, Lahman::Batting, "batting")
src_tbls(sc)

To start with here's a simple filtering example:

# filter by departure delay and print the first few records
flights_tbl %>% filter(dep_delay == 2)
## # Source:   lazy query [?? x 19]
## # Database: spark_connection
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      542            540         2      923
##  3  2013     1     1      702            700         2     1058
##  4  2013     1     1      715            713         2      911
##  5  2013     1     1      752            750         2     1025
##  6  2013     1     1      917            915         2     1206
##  7  2013     1     1      932            930         2     1219
##  8  2013     1     1     1028           1026         2     1350
##  9  2013     1     1     1042           1040         2     1325
## 10  2013     1     1     1231           1229         2     1523
## # ... with more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dbl>

Introduction to dplyr provides additional dplyr examples you can try. For example, consider the last example from the tutorial which plots data on flight delays:

delay <- flights_tbl %>% 
  group_by(tailnum) %>%
  summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
  filter(count > 20, dist < 2000, !is.na(delay)) %>%
  collect
 
# plot delays
library(ggplot2)
ggplot(delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area(max_size = 2)
## `geom_smooth()` using method = 'gam'

Window Functions

dplyr window functions are also supported, for example:

batting_tbl %>%
  select(playerID, yearID, teamID, G, AB:H) %>%
  arrange(playerID, yearID, teamID) %>%
  group_by(playerID) %>%
  filter(min_rank(desc(H)) <= 2 & H > 0)
## # Source:     lazy query [?? x 7]
## # Database:   spark_connection
## # Groups:     playerID
## # Ordered by: playerID, yearID, teamID
##     playerID yearID teamID     G    AB     R     H
##        <chr>  <int>  <chr> <int> <int> <int> <int>
##  1 aaronha01   1959    ML1   154   629   116   223
##  2 aaronha01   1963    ML1   161   631   121   201
##  3 abbotji01   1999    MIL    20    21     0     2
##  4 abnersh01   1992    CHA    97   208    21    58
##  5 abnersh01   1990    SDN    91   184    17    45
##  6 acklefr01   1963    CHA     2     5     0     1
##  7 acklefr01   1964    CHA     3     1     0     1
##  8 adamecr01   2016    COL   121   225    25    49
##  9 adamecr01   2015    COL    26    53     4    13
## 10 adamsac01   1943    NY1    70    32     3     4
## # ... with more rows

For additional documentation on using dplyr with Spark see the dplyr section of the sparklyr website.

Using SQL

It's also possible to execute SQL queries directly against tables within a Spark cluster. The spark_connection object implements a DBI interface for Spark, so you can use dbGetQuery to execute SQL and return the result as an R data frame:

library(DBI)
iris_preview <- dbGetQuery(sc, "SELECT * FROM iris LIMIT 10")
iris_preview
##    Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa

Machine Learning

You can orchestrate machine learning algorithms in a Spark cluster via the machine learning functions within sparklyr. These functions connect to a set of high-level APIs built on top of DataFrames that help you create and tune machine learning workflows.

Here's an example where we use ml_linear_regression to fit a linear regression model. We'll use the built-in mtcars dataset, and see if we can predict a car's fuel consumption (mpg) based on its weight (wt), and the number of cylinders the engine contains (cyl). We'll assume in each case that the relationship between mpg and each of our features is linear.

# copy mtcars into spark
mtcars_tbl <- copy_to(sc, mtcars)
 
# transform our data set, and then partition into 'training', 'test'
partitions <- mtcars_tbl %>%
  filter(hp >= 100) %>%
  mutate(cyl8 = cyl == 8) %>%
  sdf_partition(training = 0.5, test = 0.5, seed = 1099)
 
# fit a linear model to the training dataset
fit <- partitions$training %>%
  ml_linear_regression(response = "mpg", features = c("wt", "cyl"))
fit
## Call: ml_linear_regression.tbl_spark(., response = "mpg", features = c("wt", "cyl"))  
## 
## Formula: mpg ~ wt + cyl
## 
## Coefficients:
## (Intercept)          wt         cyl 
##   33.499452   -2.818463   -0.923187

For linear regression models produced by Spark, we can use summary() to learn a bit more about the quality of our fit, and the statistical significance of each of our predictors.

summary(fit)
## Call: ml_linear_regression.tbl_spark(., response = "mpg", features = c("wt", "cyl"))  
## 
## Deviance Residuals:
##    Min     1Q Median     3Q    Max 
## -1.752 -1.134 -0.499  1.296  2.282 
## 
## Coefficients:
## (Intercept)          wt         cyl 
##   33.499452   -2.818463   -0.923187 
## 
## R-Squared: 0.8274
## Root Mean Squared Error: 1.422

Spark machine learning supports a wide array of algorithms and feature transformations and as illustrated above it's easy to chain these functions together with dplyr pipelines. To learn more see the machine learning section.

Reading and Writing Data

You can read and write data in CSV, JSON, and Parquet formats. Data can be stored in HDFS, S3, or on the local filesystem of cluster nodes.

temp_csv <- tempfile(fileext = ".csv")
temp_parquet <- tempfile(fileext = ".parquet")
temp_json <- tempfile(fileext = ".json")
 
spark_write_csv(iris_tbl, temp_csv)
iris_csv_tbl <- spark_read_csv(sc, "iris_csv", temp_csv)
 
spark_write_parquet(iris_tbl, temp_parquet)
iris_parquet_tbl <- spark_read_parquet(sc, "iris_parquet", temp_parquet)
 
spark_write_json(iris_tbl, temp_json)
iris_json_tbl <- spark_read_json(sc, "iris_json", temp_json)
 
src_tbls(sc)
## [1] "batting"      "flights"      "iris"         "iris_csv"    
## [5] "iris_json"    "iris_parquet" "mtcars"

Distributed R

You can execute arbitrary r code across your cluster using spark_apply. For example, we can apply rgamma over iris as follows:

spark_apply(iris_tbl, function(data) {
  data[1:4] + rgamma(1,2)
})
## # Source:   table<sparklyr_tmp_115c74acb6510> [?? x 4]
## # Database: spark_connection
##    Sepal_Length Sepal_Width Petal_Length Petal_Width
##           <dbl>       <dbl>        <dbl>       <dbl>
##  1     5.336757    3.736757     1.636757   0.4367573
##  2     5.136757    3.236757     1.636757   0.4367573
##  3     4.936757    3.436757     1.536757   0.4367573
##  4     4.836757    3.336757     1.736757   0.4367573
##  5     5.236757    3.836757     1.636757   0.4367573
##  6     5.636757    4.136757     1.936757   0.6367573
##  7     4.836757    3.636757     1.636757   0.5367573
##  8     5.236757    3.636757     1.736757   0.4367573
##  9     4.636757    3.136757     1.636757   0.4367573
## 10     5.136757    3.336757     1.736757   0.3367573
## # ... with more rows

You can also group by columns to perform an operation over each group of rows and make use of any package within the closure:

spark_apply(
  iris_tbl,
  function(e) broom::tidy(lm(Petal_Width ~ Petal_Length, e)),
  names = c("term", "estimate", "std.error", "statistic", "p.value"),
  group_by = "Species"
)
## # Source:   table<sparklyr_tmp_115c73965f30> [?? x 6]
## # Database: spark_connection
##      Species         term    estimate  std.error  statistic      p.value
##        <chr>        <chr>       <dbl>      <dbl>      <dbl>        <dbl>
## 1 versicolor  (Intercept) -0.08428835 0.16070140 -0.5245029 6.023428e-01
## 2 versicolor Petal_Length  0.33105360 0.03750041  8.8279995 1.271916e-11
## 3  virginica  (Intercept)  1.13603130 0.37936622  2.9945505 4.336312e-03
## 4  virginica Petal_Length  0.16029696 0.06800119  2.3572668 2.253577e-02
## 5     setosa  (Intercept) -0.04822033 0.12164115 -0.3964146 6.935561e-01
## 6     setosa Petal_Length  0.20124509 0.08263253  2.4354220 1.863892e-02

Extensions

The facilities used internally by sparklyr for its dplyr and machine learning interfaces are available to extension packages. Since Spark is a general purpose cluster computing system there are many potential applications for extensions (e.g. interfaces to custom machine learning pipelines, interfaces to 3rd party Spark packages, etc.).

Here's a simple example that wraps a Spark text file line counting function with an R function:

# write a CSV 
tempfile <- tempfile(fileext = ".csv")
write.csv(nycflights13::flights, tempfile, row.names = FALSE, na = "")
 
# define an R interface to Spark line counting
count_lines <- function(sc, path) {
  spark_context(sc) %>% 
    invoke("textFile", path, 1L) %>% 
      invoke("count")
}
 
# call spark to count the lines of the CSV
count_lines(sc, tempfile)
## [1] 336777

To learn more about creating extensions see the Extensions section of the sparklyr website.

Table Utilities

You can cache a table into memory with:

tbl_cache(sc, "batting")

and unload from memory using:

tbl_uncache(sc, "batting")

Connection Utilities

You can view the Spark web console using the spark_web function:

spark_web(sc)

You can show the log using the spark_log function:

spark_log(sc, n = 10)
## 17/11/09 15:55:18 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 69 (/var/folders/fz/v6wfsg2x1fb1rw4f6r0x4jwm0000gn/T//RtmpyR8oP9/file115c74b94924.csv MapPartitionsRDD[258] at textFile at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0))
## 17/11/09 15:55:18 INFO TaskSchedulerImpl: Adding task set 69.0 with 1 tasks
## 17/11/09 15:55:18 INFO TaskSetManager: Starting task 0.0 in stage 69.0 (TID 140, localhost, executor driver, partition 0, PROCESS_LOCAL, 4904 bytes)
## 17/11/09 15:55:18 INFO Executor: Running task 0.0 in stage 69.0 (TID 140)
## 17/11/09 15:55:18 INFO HadoopRDD: Input split: file:/var/folders/fz/v6wfsg2x1fb1rw4f6r0x4jwm0000gn/T/RtmpyR8oP9/file115c74b94924.csv:0+33313106
## 17/11/09 15:55:18 INFO Executor: Finished task 0.0 in stage 69.0 (TID 140). 832 bytes result sent to driver
## 17/11/09 15:55:18 INFO TaskSetManager: Finished task 0.0 in stage 69.0 (TID 140) in 126 ms on localhost (executor driver) (1/1)
## 17/11/09 15:55:18 INFO TaskSchedulerImpl: Removed TaskSet 69.0, whose tasks have all completed, from pool 
## 17/11/09 15:55:18 INFO DAGScheduler: ResultStage 69 (count at NativeMethodAccessorImpl.java:0) finished in 0.126 s
## 17/11/09 15:55:18 INFO DAGScheduler: Job 47 finished: count at NativeMethodAccessorImpl.java:0, took 0.131380 s

Finally, we disconnect from Spark:

spark_disconnect(sc)

RStudio IDE

The latest RStudio Preview Release of the RStudio IDE includes integrated support for Spark and the sparklyr package, including tools for:

  • Creating and managing Spark connections
  • Browsing the tables and columns of Spark DataFrames
  • Previewing the first 1,000 rows of Spark DataFrames

Once you've installed the sparklyr package, you should find a new Spark pane within the IDE. This pane includes a New Connection dialog which can be used to make connections to local or remote Spark instances:

Once you've connected to Spark you'll be able to browse the tables contained within the Spark cluster and preview Spark DataFrames using the standard RStudio data viewer:

You can also connect to Spark through Livy through a new connection dialog:

The RStudio IDE features for sparklyr are available now as part of the RStudio Preview Release.

Using H2O

rsparkling is a CRAN package from H2O that extends sparklyr to provide an interface into Sparkling Water. For instance, the following example installs, configures and runs h2o.glm:

options(rsparkling.sparklingwater.version = "2.1.14")
 
library(rsparkling)
library(sparklyr)
library(dplyr)
library(h2o)
 
sc <- spark_connect(master = "local", version = "2.1.0")
mtcars_tbl <- copy_to(sc, mtcars, "mtcars")
 
mtcars_h2o <- as_h2o_frame(sc, mtcars_tbl, strict_version_check = FALSE)
 
mtcars_glm <- h2o.glm(x = c("wt", "cyl"), 
                      y = "mpg",
                      training_frame = mtcars_h2o,
                      lambda_search = TRUE)
mtcars_glm
## Model Details:
## ==============
## 
## H2ORegressionModel: glm
## Model ID:  GLM_model_R_1510271749678_1 
## GLM Model: summary
##     family     link                              regularization
## 1 gaussian identity Elastic Net (alpha = 0.5, lambda = 0.1013 )
##                                                                lambda_search
## 1 nlambda = 100, lambda.max = 10.132, lambda.min = 0.1013, lambda.1se = -1.0
##   number_of_predictors_total number_of_active_predictors
## 1                          2                           2
##   number_of_iterations                                training_frame
## 1                  100 frame_rdd_29_b907d4915799eac74fb1ea60ad594bbf
## 
## Coefficients: glm coefficients
##       names coefficients standardized_coefficients
## 1 Intercept    38.941654                 20.090625
## 2       cyl    -1.468783                 -2.623132
## 3        wt    -3.034558                 -2.969186
## 
## H2ORegressionMetrics: glm
## ** Reported on training data. **
## 
## MSE:  6.017684
## RMSE:  2.453097
## MAE:  1.940985
## RMSLE:  0.1114801
## Mean Residual Deviance :  6.017684
## R^2 :  0.8289895
## Null Deviance :1126.047
## Null D.o.F. :31
## Residual Deviance :192.5659
## Residual D.o.F. :29
## AIC :156.2425
spark_disconnect(sc)

Connecting through Livy

Livy enables remote connections to Apache Spark clusters. Connecting to Spark clusters through Livy is under experimental development in sparklyr. Please post any feedback or questions as a GitHub issue as needed.

Before connecting to Livy, you will need the connection information to an existing service running Livy. Otherwise, to test livy in your local environment, you can install it and run it locally as follows:

livy_install()
livy_service_start()

To connect, use the Livy service address as master and method = "livy" in spark_connect. Once connection completes, use sparklyr as usual, for instance:

sc <- spark_connect(master = "http://localhost:8998", method = "livy")
copy_to(sc, iris)
## # Source:   table<iris> [?? x 5]
## # Database: spark_connection
##    Sepal_Length Sepal_Width Petal_Length Petal_Width Species
##           <dbl>       <dbl>        <dbl>       <dbl>   <chr>
##  1          5.1         3.5          1.4         0.2  setosa
##  2          4.9         3.0          1.4         0.2  setosa
##  3          4.7         3.2          1.3         0.2  setosa
##  4          4.6         3.1          1.5         0.2  setosa
##  5          5.0         3.6          1.4         0.2  setosa
##  6          5.4         3.9          1.7         0.4  setosa
##  7          4.6         3.4          1.4         0.3  setosa
##  8          5.0         3.4          1.5         0.2  setosa
##  9          4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## # ... with more rows
spark_disconnect(sc)

Once you are done using livy locally, you should stop this service with:

livy_service_stop()

To connect to remote livy clusters that support basic authentication connect as:

config <- livy_config_auth("<username>", "<password">)
sc <- spark_connect(master = "<address>", method = "livy", config = config)
spark_disconnect(sc)

News

Sparklyr 0.7 (UNRELEASED)

  • Added support for Spark 2.2.1.

  • Switched copy_to serializer to use Scala implementation, this change can be reverted by setting the sparklyr.copy.serializer option to csv_file.

  • Added support for spark_web() for Livy and Databricks connections when using Spark 2.X.

  • Fixed SIGPIPE error under spark_connect() immediately after a spark_disconnect() operation.

  • spark_web() is is more reliable under Spark 2.X by making use of a new API to programmatically find the right address.

  • Added support in dbWriteTable() for temporary = FALSE to allow persisting table across connections. Changed default value for temporary to TRUE to match DBI specification, for compatibility, default value can be reverted back to FALSE using the sparklyr.dbwritetable.temp option.

  • ncol() now returns the number of columns instead of NA, and nrow() now returns NA_real_.

  • Added support to collect VectorUDT column types with nested arrays.

  • Fixed issue in which connecting to Livy would fail due to long user names or long passwords.

  • Fixed error in the Spark connection dialog for clusters using a proxy.

  • Improved support for Spark 2.X under Cloudera clusters by prioritizing use of spark2-submit over spark-submit.

  • Livy new connection dialog now prompts for password using rstudioapi::askForPassword().

  • Added schema parameter to spark_read_parquet() that enables reading a subset of the schema to increase performance.

  • Implemented sdf_describe() to easily compute summary statistics for data frames.

  • Fixed data frames with dates in spark_apply() retrieved as Date instead of doubles.

  • Added support to use invoke() with arrays of POSIXlt and POSIXct.

  • Added support for context parameter in spark_apply() to allow callers to pass additional contextual information to the f() closure.

  • Implemented workaround to support in spark_write_table() for mode = 'append'.

  • Various ML improvements, including support for pipelines, additional algorithms, hyper-parameter tuning, and better model persistence.

  • Added spark_read_libsvm() for reading libsvm files.

  • Added support for separating struct columns in sdf_separate_column().

  • Fixed collection of short, float and byte to properly return NAs.

  • Added sparklyr.collect.datechars option to enable collecting DateType and TimestampTime as characters to support compatibility with previos versions.

  • Fixed collection of DateType and TimestampTime from character to proper Date and POSIXct types.

Sparklyr 0.6.4

  • Added support for HTTPS for yarn-cluster which is activated by setting yarn.http.policy to HTTPS_ONLY in yarn-site.xml.

  • Added support for sparklyr.yarn.cluster.accepted.timeout under yarn-cluster to allow users to wait for resources under cluster with high waiting times.

  • Fix to spark_apply() when package distribution deadlock triggers in environments where multiple executors run under the same node.

  • Added support in spark_apply() for specifying a list of packages to distribute to each worker node.

  • Added support inyarn-cluster for sparklyr.yarn.cluster.lookup.prefix, sparklyr.yarn.cluster.lookup.username and sparklyr.yarn.cluster.lookup.byname to control the new application lookup behavior.

Sparklyr 0.6.3

  • Enabled support for Java 9 for clusters configured with Hadoop 2.8. Java 9 blocked on 'master=local' unless 'options(sparklyr.java9 = TRUE)' is set.

  • Fixed issue in spark_connect() where using set.seed() before connection would cause session ids to be duplicates and connections to be reused.

  • Fixed issue in spark_connect() blocking gateway port when connection was never started to the backend, for isntasnce, while interrupting the r session while connecting.

  • Performance improvement for quering field names from tables impacting tables and dplyr queries, most noticeable in na.omit with several columns.

  • Fix to spark_apply() when closure returns a data.frame that contains no rows and has one or more columns.

  • Fix to spark_apply() while using tryCatch() within closure and increased callstack printed to logs when error triggers within closure.

  • Added support for the SPARKLYR_LOG_FILE environment variable to specify the file used for log output.

  • Fixed regression for union_all() affecting Spark 1.6.X.

  • Added support for na.omit.cache option that when set to FALSE will prevent na.omit from caching results when rows are dropped.

  • Added support in spark_connect() for yarn-cluster with hight-availability enabled.

  • Added support for spark_connect() with master="yarn-cluster" to query YARN resource manager API and retrieve the correct container host name.

  • Fixed issue in invoke() calls while using integer arrays that contain NA which can be commonly experienced while using spark_apply().

  • Added topics.description under ml_lda() result.

  • Added support for ft_stop_words_remover() to strip out stop words from tokens.

  • Feature transformers (ft_* functions) now explicitly require input.col and output.col to be specified.

  • Added support for spark_apply_log() to enable logging in worker nodes while using spark_apply().

  • Fix to spark_apply() for SparkUncaughtExceptionHandler exception while running over large jobs that may overlap during an, now unnecesary, unregister operation.

  • Fix race-condition first time spark_apply() is run when more than one partition runs in a worker and both processes try to unpack the packages bundle at the same time.

  • spark_apply() now adds generic column names when needed and validates f is a function.

  • Improved documentation and error cases for metric argument in ml_classification_eval() and ml_binary_classification_eval().

  • Fix to spark_install() to use the /logs subfolder to store local log4j logs.

  • Fix to spark_apply() when R is used from a worker node since worker node already contains packages but still might be triggering different R session.

  • Fix connection from closing when invoke() attempts to use a class with a method that contains a reference to an undefined class.

  • Implemented all tuning options from Spark ML for ml_random_forest(), ml_gradient_boosted_trees(), and ml_decision_tree().

  • Avoid tasks failing under spark_apply() and multiple concurrent partitions running while selecting backend port.

  • Added support for numeric arguments for n in lead() for dplyr.

  • Added unsupported error message to sample_n() and sample_frac() when Spark is not 2.0 or higher.

  • Fixed SIGPIPE error under spark_connect() immediately after a spark_disconnect() operation.

  • Added support for sparklyr.apply.env. under spark_config() to allow spark_apply() to initializae environment varaibles.

  • Added support for spark_read_text() and spark_write_text() to read from and to plain text files.

  • Addesd support for RStudio project templates to create an "R Package using sparklyr".

  • Fix compute() to trigger refresh of the connections view.

  • Added a k argument to ml_pca() to enable specification of number of principal components to extract. Also implemented sdf_project() to project datasets using the results of ml_pca() models.

  • Added support for additional livy session creation parameters using the livy_config() function.

Sparklyr 0.6.2

  • Fix connection_spark_shinyapp() under RStudio 1.1 to avoid error while listing Spark installation options for the first time.

Sparklyr 0.6.1

  • Fixed error in spark_apply() that may triggered when multiple CPUs are used in a single node due to race conditions while accesing the gateway service and another in the JVMObjectTracker.

  • spark_apply() now supports explicit column types using the columns argument to avoid sampling types.

  • spark_apply() with group_by no longer requires persisting to disk nor memory.

  • Added support for Spark 1.6.3 under spark_install().

  • Added support for Spark 1.6.3 under spark_install()

  • spark_apply() now logs the current callstack when it fails.

  • Fixed error triggered while processing empty partitions in spark_apply().

  • Fixed slow printing issue caused by print calculating the total row count, which is expensive for some tables.

  • Fixed sparklyr 0.6 issue blocking concurrent sparklyr connections, which required to set config$sparklyr.gateway.remote = FALSE as workaround.

Sparklyr 0.6.0

Distributed R

  • Added packages parameter to spark_apply() to distribute packages across worker nodes automatically.

  • Added sparklyr.closures.rlang as a spark_config() value to support generic closures provided by the rlang package.

  • Added config options sparklyr.worker.gateway.address and sparklyr.worker.gateway.port to configure gateway used under worker nodes.

  • Added group_by parameter to spark_apply(), to support operations over groups of dataframes.

  • Added spark_apply(), allowing users to use R code to directly manipulate and transform Spark DataFrames.

External Data

  • Added spark_write_source(). This function writes data into a Spark data source which can be loaded through an Spark package.

  • Added spark_write_jdbc(). This function writes from a Spark DataFrame into a JDBC connection.

  • Added columns parameter to spark_read_*() functions to load data with named columns or explicit column types.

  • Added partition_by parameter to spark_write_csv(), spark_write_json(), spark_write_table() and spark_write_parquet().

  • Added spark_read_source(). This function reads data from a Spark data source which can be loaded through an Spark package.

  • Added support for mode = "overwrite" and mode = "append" to spark_write_csv().

  • spark_write_table() now supports saving to default Hive path.

  • Improved performance of spark_read_csv() reading remote data when infer_schema = FALSE.

  • Added spark_read_jdbc(). This function reads from a JDBC connection into a Spark DataFrame.

  • Renamed spark_load_table() and spark_save_table() into spark_read_table() and spark_write_table() for consistency with existing spark_read_*() and spark_write_*() functions.

  • Added support to specify a vector of column names in spark_read_csv() to specify column names without having to set the type of each column.

  • Improved copy_to(), sdf_copy_to() and dbWriteTable() performance under yarn-client mode.

dplyr

  • Support for cumprod() to calculate cumulative products.

  • Support for cor(), cov(), sd() and var() as window functions.

  • Support for Hive built-in operators %like%, %rlike%, and %regexp% for matching regular expressions in filter() and mutate().

  • Support for dplyr (>= 0.6) which among many improvements, increases performance in some queries by making use of a new query optimizer.

  • sample_frac() takes a fraction instead of a percent to match dplyr.

  • Improved performance of sample_n() and sample_frac() through the use of TABLESAMPLE in the generated query.

Databases

  • Added src_databases(). This function list all the available databases.

  • Added tbl_change_db(). This function changes current database.

DataFrames

  • Added sdf_len(), sdf_seq() and sdf_along() to help generate numeric sequences as Spark DataFrames.

  • Added spark_set_checkpoint_dir(), spark_get_checkpoint_dir(), and sdf_checkpoint() to enable checkpointing.

  • Added sdf_broadcast() which can be used to hint the query optimizer to perform a broadcast join in cases where a shuffle hash join is planned but not optimal.

  • Added sdf_repartition(), sdf_coalesce(), and sdf_num_partitions() to support repartitioning and getting the number of partitions of Spark DataFrames.

  • Added sdf_bind_rows() and sdf_bind_cols() -- these functions are the sparklyr equivalent of dplyr::bind_rows() and dplyr::bind_cols().

  • Added sdf_separate_column() -- this function allows one to separate components of an array / vector column into separate scalar-valued columns.

  • sdf_with_sequential_id() now supports from parameter to choose the starting value of the id column.

  • Added sdf_pivot(). This function provides a mechanism for constructing pivot tables, using Spark's 'groupBy' + 'pivot' functionality, with a formula interface similar to that of reshape2::dcast().

MLlib

  • Added vocabulary.only to ft_count_vectorizer() to retrieve the vocabulary with ease.

  • GLM type models now support weights.column to specify weights in model fitting. (#217)

  • ml_logistic_regression() now supports multinomial regression, in addition to binomial regression [requires Spark 2.1.0 or greater]. (#748)

  • Implemented residuals() and sdf_residuals() for Spark linear regression and GLM models. The former returns a R vector while the latter returns a tbl_spark of training data with a residuals column added.

  • Added ml_model_data(), used for extracting data associated with Spark ML models.

  • The ml_save() and ml_load() functions gain a meta argument, allowing users to specify where R-level model metadata should be saved independently of the Spark model itself. This should help facilitate the saving and loading of Spark models used in non-local connection scenarios.

  • ml_als_factorization() now supports the implicit matrix factorization and nonnegative least square options.

  • Added ft_count_vectorizer(). This function can be used to transform columns of a Spark DataFrame so that they might be used as input to ml_lda(). This should make it easier to invoke ml_lda() on Spark data sets.

Broom

  • Implemented tidy(), augment(), and glance() from tidyverse/broom for ml_model_generalized_linear_regression and ml_model_linear_regression models.

R Compatibility

  • Implemented cbind.tbl_spark(). This method works by first generating index columns using sdf_with_sequential_id() then performing inner_join(). Note that dplyr _join() functions should still be used for DataFrames with common keys since they are less expensive.

Connections

  • Increased default number of concurrent connections by setting default for spark.port.maxRetries from 16 to 128.

  • Support for gateway connections sparklyr://hostname:port/session and using spark-submit --class sparklyr.Shell sparklyr-2.1-2.11.jar <port> <id> --remote.

  • Added support for sparklyr.gateway.service and sparklyr.gateway.remote to enable/disable the gateway in service and to accept remote connections required for Yarn Cluster mode.

  • Added support for Yarn Cluster mode using master = "yarn-cluster". Either, explicitly set config = list(sparklyr.gateway.address = "<driver-name>") or implicitly sparklyr will read the site-config.xml for the YARN_CONF_DIR environment variable.

  • Added spark_context_config() and hive_context_config() to retrieve runtime configurations for the Spark and Hive contexts.

  • Added sparklyr.log.console to redirect logs to console, useful to troubleshooting spark_connect.

  • Added sparklyr.backend.args as config option to enable passing parameters to the sparklyr backend.

  • Improved logging while establishing connections to sparklyr.

  • Improved spark_connect() performance.

  • Implemented new configuration checks to proactively report connection errors in Windows.

  • While connecting to spark from Windows, setting the sparklyr.verbose option to TRUE prints detailed configuration steps.

  • Added custom_headers to livy_config() to add custom headers to the REST call to the Livy server

Compilation

  • Added support for jar_dep in the compilation specification to support additional jars through spark_compile().

  • spark_compile() now prints deprecation warnings.

  • Added download_scalac() to assist downloading all the Scala compilers required to build using compile_package_jars and provided support for using any scalac minor versions while looking for the right compiler.

Backend

  • Improved backend logging by adding type and session id prefix.

Miscellaneous

  • copy_to() and sdf_copy_to() auto generate a name when an expression can't be transformed into a table name.

  • Implemented type_sum.jobj() (from tibble) to enable better printing of jobj objects embedded in data frames.

  • Added the spark_home_set() function, to help facilitate the setting of the SPARK_HOME environment variable. This should prove useful in teaching environments, when teaching the basics of Spark and sparklyr.

  • Added support for the sparklyr.ui.connections option, which adds additional connection options into the new connections dialog. The rstudio.spark.connections option is now deprecated.

  • Implemented the "New Connection Dialog" as a Shiny application to be able to support newer versions of RStudio that deprecate current connections UI.

Bug Fixes

  • When using spark_connect() in local clusters, it validates that java exists under JAVA_HOME to help troubleshoot systems that have an incorrect JAVA_HOME.

  • Improved argument is of length zero error triggered while retrieving data with no columns to display.

  • Fixed Path does not exist referencing hdfs exception during copy_to under systems configured with HADOOP_HOME.

  • Fixed session crash after "No status is returned" error by terminating invalid connection and added support to print log trace during this error.

  • compute() now caches data in memory by default. To revert this beavior use sparklyr.dplyr.compute.nocache set to TRUE.

  • spark_connect() with master = "local" and a given version overrides SPARK_HOME to avoid existing installation mismatches.

  • Fixed spark_connect() under Windows issue when newInstance0 is present in the logs.

  • Fixed collecting long type columns when NAs are present (#463).

  • Fixed backend issue that affects systems where localhost does not resolve properly to the loopback address.

  • Fixed issue collecting data frames containing newlines \n.

  • Spark Null objects (objects of class NullType) discovered within numeric vectors are now collected as NAs, rather than lists of NAs.

  • Fixed warning while connecting with livy and improved 401 message.

  • Fixed issue in spark_read_parquet() and other read methods in which spark_normalize_path() would not work in some platforms while loading data using custom protocols like s3n:// for Amazon S3.

  • Resolved issue in spark_save() / load_table() to support saving / loading data and added path parameter in spark_load_table() for consistency with other functions.

Sparklyr 0.5.5

  • Implemented support for connectionViewer interface required in RStudio 1.1 and spark_connect with mode="databricks".

Sparklyr 0.5.4

  • Implemented support for dplyr 0.6 and Spark 2.1.x.

Sparklyr 0.5.3

  • Implemented support for DBI 0.6.

Sparklyr 0.5.2

  • Fix to spark_connect affecting Windows users and Spark 1.6.x.

  • Fix to Livy connections which would cause connections to fail while connection is on 'waiting' state.

Sparklyr 0.5.0

  • Implemented basic authorization for Livy connections using livy_config_auth().

  • Added support to specify additional spark-submit parameters using the sparklyr.shell.args environment variable.

  • Renamed sdf_load() and sdf_save() to spark_read() and spark_write() for consistency.

  • The functions tbl_cache() and tbl_uncache() can now be using without requiring the dplyr namespace to be loaded.

  • spark_read_csv(..., columns = <...>, header = FALSE) should now work as expected -- previously, sparklyr would still attempt to normalize the column names provided.

  • Support to configure Livy using the livy. prefix in the config.yml file.

  • Implemented experimental support for Livy through: livy_install(), livy_service_start(), livy_service_stop() and spark_connect(method = "livy").

  • The ml routines now accept data as an optional argument, to support calls of the form e.g. ml_linear_regression(y ~ x, data = data). This should be especially helpful in conjunction with dplyr::do().

  • Spark DenseVector and SparseVector objects are now deserialized as R numeric vectors, rather than Spark objects. This should make it easier to work with the output produced by sdf_predict() with Random Forest models, for example.

  • Implemented dim.tbl_spark(). This should ensure that dim(), nrow() and ncol() all produce the expected result with tbl_sparks.

  • Improved Spark 2.0 installation in Windows by creating spark-defaults.conf and configuring spark.sql.warehouse.dir.

  • Embedded Apache Spark package dependencies to avoid requiring internet connectivity while connecting for the first through spark_connect. The sparklyr.csv.embedded config setting was added to configure a regular expression to match Spark versions where the embedded package is deployed.

  • Increased exception callstack and message length to include full error details when an exception is thrown in Spark.

  • Improved validation of supported Java versions.

  • The spark_read_csv() function now accepts the infer_schema parameter, controlling whether the columns schema should be inferred from the underlying file itself. Disabling this should improve performance when the schema is known beforehand.

  • Added a do_.tbl_spark implementation, allowing for the execution of dplyr::do statements on Spark DataFrames. Currently, the computation is performed in serial across the different groups specified on the Spark DataFrame; in the future we hope to explore a parallel implementation. Note that do_ always returns a tbl_df rather than a tbl_spark, as the objects produced within a do_ query may not necessarily be Spark objects.

  • Improved errors, warnings and fallbacks for unsupported Spark versions.

  • sparklyr now defaults to tar = "internal" in its calls to untar(). This should help resolve issues some Windows users have seen related to an inability to connect to Spark, which ultimately were caused by a lack of permissions on the Spark installation.

  • Resolved an issue where copy_to() and other R => Spark data transfer functions could fail when the last column contained missing / empty values. (#265)

  • Added sdf_persist() as a wrapper to the Spark DataFrame persist() API.

  • Resolved an issue where predict() could produce results in the wrong order for large Spark DataFrames.

  • Implemented support for na.action with the various Spark ML routines. The value of getOption("na.action") is used by default. Users can customize the na.action argument through the ml.options object accepted by all ML routines.

  • On Windows, long paths, and paths containing spaces, are now supported within calls to spark_connect().

  • The lag() window function now accepts numeric values for n. Previously, only integer values were accepted. (#249)

  • Added support to configure Ppark environment variables using spark.env.* config.

  • Added support for the Tokenizer and RegexTokenizer feature transformers. These are exported as the ft_tokenizer() and ft_regex_tokenizer() functions.

  • Resolved an issue where attempting to call copy_to() with an R data.frame containing many columns could fail with a Java StackOverflow. (#244)

  • Resolved an issue where attempting to call collect() on a Spark DataFrame containing many columns could produce the wrong result. (#242)

  • Added support to parameterize network timeouts using the sparklyr.backend.timeout, sparklyr.gateway.start.timeout and sparklyr.gateway.connect.timeout config settings.

  • Improved logging while establishing connections to sparklyr.

  • Added sparklyr.gateway.port and sparklyr.gateway.address as config settings.

  • The spark_log() function now accepts the filter parameter. This can be used to filter entries within the Spark log.

  • Increased network timeout for sparklyr.backend.timeout.

  • Moved spark.jars.default setting from options to Spark config.

  • sparklyr now properly respects the Hive metastore directory with the sdf_save_table() and sdf_load_table() APIs for Spark < 2.0.0.

  • Added sdf_quantile() as a means of computing (approximate) quantiles for a column of a Spark DataFrame.

  • Added support for n_distinct(...) within the dplyr interface, based on call to Hive function count(DISTINCT ...). (#220)

Sparklyr 0.4.0

  • First release to CRAN.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("sparklyr")

0.8.4 by Javier Luraschi, 2 days ago


http://spark.rstudio.com


Report a bug at https://github.com/rstudio/sparklyr/issues


Browse source code at https://github.com/cran/sparklyr


Authors: Javier Luraschi [aut, cre], Kevin Kuo [aut] (<https://orcid.org/0000-0001-7803-7901>), Kevin Ushey [aut], JJ Allaire [aut], RStudio [cph], The Apache Software Foundation [aut, cph]


Documentation:   PDF Manual  


Task views: Model Deployment with R


Apache License 2.0 | file LICENSE license


Imports assertthat, base64enc, broom, config, DBI, dplyr, dbplyr, digest, httr, jsonlite, lazyeval, methods, openssl, rappdirs, readr, rlang, rprojroot, rstudioapi, shiny, withr, xml2, tidyr

Suggests ggplot2, glmnet, janeaustenr, Lahman, mlbench, nnet, nycflights13, RCurl, reshape2, testthat, tibble

System requirements: Spark: 1.6.x or 2.x


Imported by graphframes, mleap, rsparkling, spark.sas7bdat, sparkavro, sparkwarc.

Suggested by replyr.


See at CRAN