R interface to Apache Spark, a fast and general engine for big data processing, see < http://spark.apache.org>. This package supports connecting to local and remote Apache Spark clusters, provides a 'dplyr' compatible back-end, and provides an interface to Spark's built-in machine learning algorithms.
You can install the sparklyr package from CRAN as follows:
install.packages("sparklyr")
You should also install a local version of Spark for development purposes:
library(sparklyr)spark_install()
To upgrade to the latest version of sparklyr, run the following command and restart your r session:
install.packages("devtools")devtools::install_github("rstudio/sparklyr")
If you use the RStudio IDE, you should also download the latest preview release of the IDE which includes several enhancements for interacting with Spark (see the RStudio IDE section below for more details).
You can connect to both local instances of Spark as well as remote Spark clusters. Here we’ll connect to a local instance of Spark via the spark_connect function:
library(sparklyr)sc <- spark_connect(master = "local")
The returned Spark connection (sc
) provides a remote dplyr data source
to the Spark cluster.
For more information on connecting to remote Spark clusters see the Deployment section of the sparklyr website.
We can now use all of the available dplyr verbs against the tables within the cluster.
We’ll start by copying some datasets from R into the Spark cluster (note that you may need to install the nycflights13 and Lahman packages in order to execute this code):
install.packages(c("nycflights13", "Lahman"))
library(dplyr)iris_tbl <- copy_to(sc, iris)flights_tbl <- copy_to(sc, nycflights13::flights, "flights")batting_tbl <- copy_to(sc, Lahman::Batting, "batting")src_tbls(sc)
## [1] "batting" "flights" "iris"
To start with here’s a simple filtering example:
# filter by departure delay and print the first few recordsflights_tbl %>% filter(dep_delay == 2)
## # Source: spark<?> [?? x 19]
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 542 540 2 923
## 3 2013 1 1 702 700 2 1058
## 4 2013 1 1 715 713 2 911
## 5 2013 1 1 752 750 2 1025
## 6 2013 1 1 917 915 2 1206
## 7 2013 1 1 932 930 2 1219
## 8 2013 1 1 1028 1026 2 1350
## 9 2013 1 1 1042 1040 2 1325
## 10 2013 1 1 1231 1229 2 1523
## # … with more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Introduction to dplyr provides additional dplyr examples you can try. For example, consider the last example from the tutorial which plots data on flight delays:
delay <- flights_tbl %>%group_by(tailnum) %>%summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%filter(count > 20, dist < 2000, !is.na(delay)) %>%collect# plot delayslibrary(ggplot2)ggplot(delay, aes(dist, delay)) +geom_point(aes(size = count), alpha = 1/2) +geom_smooth() +scale_size_area(max_size = 2)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
dplyr window functions are also supported, for example:
batting_tbl %>%select(playerID, yearID, teamID, G, AB:H) %>%arrange(playerID, yearID, teamID) %>%group_by(playerID) %>%filter(min_rank(desc(H)) <= 2 & H > 0)
## # Source: spark<?> [?? x 7]
## # Groups: playerID
## # Ordered by: playerID, yearID, teamID
## playerID yearID teamID G AB R H
## <chr> <int> <chr> <int> <int> <int> <int>
## 1 aaronha01 1959 ML1 154 629 116 223
## 2 aaronha01 1963 ML1 161 631 121 201
## 3 abadfe01 2012 HOU 37 7 0 1
## 4 abbated01 1905 BSN 153 610 70 170
## 5 abbated01 1904 BSN 154 579 76 148
## 6 abbeych01 1894 WAS 129 523 95 164
## 7 abbeych01 1895 WAS 132 511 102 141
## 8 abbotji01 1999 MIL 20 21 0 2
## 9 abnersh01 1992 CHA 97 208 21 58
## 10 abnersh01 1990 SDN 91 184 17 45
## # … with more rows
For additional documentation on using dplyr with Spark see the dplyr section of the sparklyr website.
It’s also possible to execute SQL queries directly against tables within
a Spark cluster. The spark_connection
object implements a
DBI interface for Spark, so you can
use dbGetQuery
to execute SQL and return the result as an R data
frame:
library(DBI)iris_preview <- dbGetQuery(sc, "SELECT * FROM iris LIMIT 10")iris_preview
## Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
You can orchestrate machine learning algorithms in a Spark cluster via the machine learning functions within sparklyr. These functions connect to a set of high-level APIs built on top of DataFrames that help you create and tune machine learning workflows.
Here’s an example where we use
ml_linear_regression
to fit a linear regression model. We’ll use the built-in mtcars
dataset, and see if we can predict a car’s fuel consumption (mpg
)
based on its weight (wt
), and the number of cylinders the engine
contains (cyl
). We’ll assume in each case that the relationship
between mpg
and each of our features is linear.
# copy mtcars into sparkmtcars_tbl <- copy_to(sc, mtcars)# transform our data set, and then partition into 'training', 'test'partitions <- mtcars_tbl %>%filter(hp >= 100) %>%mutate(cyl8 = cyl == 8) %>%sdf_partition(training = 0.5, test = 0.5, seed = 1099)# fit a linear model to the training datasetfit <- partitions$training %>%ml_linear_regression(response = "mpg", features = c("wt", "cyl"))fit
## Formula: mpg ~ wt + cyl
##
## Coefficients:
## (Intercept) wt cyl
## 33.499452 -2.818463 -0.923187
For linear regression models produced by Spark, we can use summary()
to learn a bit more about the quality of our fit, and the statistical
significance of each of our predictors.
summary(fit)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.752 -1.134 -0.499 1.296 2.282
##
## Coefficients:
## (Intercept) wt cyl
## 33.499452 -2.818463 -0.923187
##
## R-Squared: 0.8274
## Root Mean Squared Error: 1.422
Spark machine learning supports a wide array of algorithms and feature transformations and as illustrated above it’s easy to chain these functions together with dplyr pipelines. To learn more see the machine learning section.
You can read and write data in CSV, JSON, and Parquet formats. Data can be stored in HDFS, S3, or on the local filesystem of cluster nodes.
temp_csv <- tempfile(fileext = ".csv")temp_parquet <- tempfile(fileext = ".parquet")temp_json <- tempfile(fileext = ".json")spark_write_csv(iris_tbl, temp_csv)iris_csv_tbl <- spark_read_csv(sc, "iris_csv", temp_csv)spark_write_parquet(iris_tbl, temp_parquet)iris_parquet_tbl <- spark_read_parquet(sc, "iris_parquet", temp_parquet)spark_write_json(iris_tbl, temp_json)iris_json_tbl <- spark_read_json(sc, "iris_json", temp_json)src_tbls(sc)
## [1] "batting" "flights" "iris" "iris_csv"
## [5] "iris_json" "iris_parquet" "mtcars"
You can execute arbitrary r code across your cluster using
spark_apply
. For example, we can apply rgamma
over iris
as
follows:
spark_apply(iris_tbl, function(data) {data[1:4] + rgamma(1,2)})
## # Source: spark<?> [?? x 4]
## Sepal_Length Sepal_Width Petal_Length Petal_Width
## <dbl> <dbl> <dbl> <dbl>
## 1 6.90 5.30 3.20 2.00
## 2 6.70 4.80 3.20 2.00
## 3 6.50 5.00 3.10 2.00
## 4 6.40 4.90 3.30 2.00
## 5 6.80 5.40 3.20 2.00
## 6 7.20 5.70 3.50 2.20
## 7 6.40 5.20 3.20 2.10
## 8 6.80 5.20 3.30 2.00
## 9 6.20 4.70 3.20 2.00
## 10 6.70 4.90 3.30 1.90
## # … with more rows
You can also group by columns to perform an operation over each group of rows and make use of any package within the closure:
spark_apply(iris_tbl,function(e) broom::tidy(lm(Petal_Width ~ Petal_Length, e)),columns = c("term", "estimate", "std.error", "statistic", "p.value"),group_by = "Species")
## # Source: spark<?> [?? x 6]
## Species term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 versicolor (Intercept) -0.0843 0.161 -0.525 6.02e- 1
## 2 versicolor Petal_Length 0.331 0.0375 8.83 1.27e-11
## 3 virginica (Intercept) 1.14 0.379 2.99 4.34e- 3
## 4 virginica Petal_Length 0.160 0.0680 2.36 2.25e- 2
## 5 setosa (Intercept) -0.0482 0.122 -0.396 6.94e- 1
## 6 setosa Petal_Length 0.201 0.0826 2.44 1.86e- 2
The facilities used internally by sparklyr for its dplyr and machine learning interfaces are available to extension packages. Since Spark is a general purpose cluster computing system there are many potential applications for extensions (e.g. interfaces to custom machine learning pipelines, interfaces to 3rd party Spark packages, etc.).
Here’s a simple example that wraps a Spark text file line counting function with an R function:
# write a CSVtempfile <- tempfile(fileext = ".csv")write.csv(nycflights13::flights, tempfile, row.names = FALSE, na = "")# define an R interface to Spark line countingcount_lines <- function(sc, path) {spark_context(sc) %>%invoke("textFile", path, 1L) %>%invoke("count")}# call spark to count the lines of the CSVcount_lines(sc, tempfile)
## [1] 336777
To learn more about creating extensions see the Extensions section of the sparklyr website.
You can cache a table into memory with:
tbl_cache(sc, "batting")
and unload from memory using:
tbl_uncache(sc, "batting")
You can view the Spark web console using the spark_web
function:
spark_web(sc)
You can show the log using the spark_log
function:
spark_log(sc, n = 10)
## 19/02/22 14:13:08 INFO ContextCleaner: Cleaned shuffle 18
## 19/02/22 14:13:08 INFO ContextCleaner: Cleaned accumulator 1860
## 19/02/22 14:13:08 INFO ContextCleaner: Cleaned accumulator 1907
## 19/02/22 14:13:08 INFO ContextCleaner: Cleaned accumulator 613
## 19/02/22 14:13:08 INFO ContextCleaner: Cleaned accumulator 1626
## 19/02/22 14:13:08 INFO Executor: Finished task 0.0 in stage 70.0 (TID 94). 875 bytes result sent to driver
## 19/02/22 14:13:08 INFO TaskSetManager: Finished task 0.0 in stage 70.0 (TID 94) in 209 ms on localhost (executor driver) (1/1)
## 19/02/22 14:13:08 INFO TaskSchedulerImpl: Removed TaskSet 70.0, whose tasks have all completed, from pool
## 19/02/22 14:13:08 INFO DAGScheduler: ResultStage 70 (count at NativeMethodAccessorImpl.java:0) finished in 0.215 s
## 19/02/22 14:13:08 INFO DAGScheduler: Job 47 finished: count at NativeMethodAccessorImpl.java:0, took 0.220383 s
Finally, we disconnect from Spark:
spark_disconnect(sc)
## NULL
The latest RStudio Preview Release of the RStudio IDE includes integrated support for Spark and the sparklyr package, including tools for:
Once you’ve installed the sparklyr package, you should find a new Spark pane within the IDE. This pane includes a New Connection dialog which can be used to make connections to local or remote Spark instances:
Once you’ve connected to Spark you’ll be able to browse the tables contained within the Spark cluster and preview Spark DataFrames using the standard RStudio data viewer:
You can also connect to Spark through Livy through a new connection dialog:
The RStudio IDE features for sparklyr are available now as part of the RStudio Preview Release.
rsparkling is a CRAN package from H2O that extends sparklyr to provide an interface into Sparkling Water. For instance, the following example installs, configures and runs h2o.glm:
library(rsparkling)library(sparklyr)library(dplyr)library(h2o)sc <- spark_connect(master = "local", version = "2.3.2")mtcars_tbl <- copy_to(sc, mtcars, "mtcars")mtcars_h2o <- as_h2o_frame(sc, mtcars_tbl, strict_version_check = FALSE)mtcars_glm <- h2o.glm(x = c("wt", "cyl"),y = "mpg",training_frame = mtcars_h2o,lambda_search = TRUE)
mtcars_glm
## Model Details:
## ==============
##
## H2ORegressionModel: glm
## Model ID: GLM_model_R_1527265202599_1
## GLM Model: summary
## family link regularization
## 1 gaussian identity Elastic Net (alpha = 0.5, lambda = 0.1013 )
## lambda_search
## 1 nlambda = 100, lambda.max = 10.132, lambda.min = 0.1013, lambda.1se = -1.0
## number_of_predictors_total number_of_active_predictors
## 1 2 2
## number_of_iterations training_frame
## 1 100 frame_rdd_31_ad5c4e88ec97eb8ccedae9475ad34e02
##
## Coefficients: glm coefficients
## names coefficients standardized_coefficients
## 1 Intercept 38.941654 20.090625
## 2 cyl -1.468783 -2.623132
## 3 wt -3.034558 -2.969186
##
## H2ORegressionMetrics: glm
## ** Reported on training data. **
##
## MSE: 6.017684
## RMSE: 2.453097
## MAE: 1.940985
## RMSLE: 0.1114801
## Mean Residual Deviance : 6.017684
## R^2 : 0.8289895
## Null Deviance :1126.047
## Null D.o.F. :31
## Residual Deviance :192.5659
## Residual D.o.F. :29
## AIC :156.2425
spark_disconnect(sc)
Livy enables remote connections to
Apache Spark clusters. Connecting to Spark clusters through Livy is
under experimental development in sparklyr
. Please post any
feedback or questions as a GitHub issue as needed.
Before connecting to Livy, you will need the connection information to
an existing service running Livy. Otherwise, to test livy
in your
local environment, you can install it and run it locally as follows:
livy_install(version = "2.4.0")
livy_service_start()
To connect, use the Livy service address as master
and method = "livy"
in spark_connect
. Once connection completes, use sparklyr
as
usual, for
instance:
sc <- spark_connect(master = "http://localhost:8998", method = "livy", version = "2.4.0")copy_to(sc, iris)
## # Source: spark<iris> [?? x 5]
## Sepal_Length Sepal_Width Petal_Length Petal_Width Species
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # … with more rows
spark_disconnect(sc)
## NULL
Once you are done using livy
locally, you should stop this service
with:
livy_service_stop()
To connect to remote livy
clusters that support basic authentication
connect as:
config <- livy_config(username="<username>", password="<password>")sc <- spark_connect(master = "<address>", method = "livy", config = config)spark_disconnect(sc)
arrow
package.The dataset
parameter for estimator feature transformers has been deprecated (#1891).
ml_multilayer_perceptron_classifier()
gains probabilistic classifier parameters (#1798).
Removed support for all undocumented/deprecated parameters. These are mostly dot case parameters from pre-0.7.
Remove support for deprecated function(pipeline_stage, data)
signature in sdf_predict/transform/fit
functions.
Soft deprecate sdf_predict/transform/fit
functions. Users are advised to use ml_predict/transform/fit
functions instead.
Utilize the ellipsis package to provide warnings when unsupported arguments are specified in ML functions.
Support for sparklyr extensions when using Livy.
Significant performance improvements by using version
in
spark_connect()
which enables using the sparklyr JAR rather than
sources.
Improved memory use in Livy by using string builders and avoid print backs.
Fix for DBI::sqlInterpolate()
and related methods to properly
quote parameterized queries.
copy_to()
names tables sparklyr_tmp_
instead of sparklyr_
for
consistency with other temp tables and to avoid rendering them under
the connections pane.
copy_to()
and collect()
are not re-exported since they are commonly
used even when using DBI
or outside data analysis use cases.
Support for reading path
as the second parameter in spark_read_*()
when no name is specified (e.g. spark_read_csv(sc, "data.csv")
).
Support for batches in sdf_collect()
and dplyr::collect()
to retrieve
data incrementally using a callback function provided through a
callback
parameter. Useful when retrieving larger datasets.
Support for batches in sdf_copy_to()
and dplyr::copy_to()
by passing
a list of callbacks that retrieve data frames. Useful when uploading
larger datasets.
spark_read_source()
now has a path
parameter for specifying file path.
Support for whole
parameter for spark_read_text()
to read an
entire text file without splitting contents by line.
tidy()
, augment()
, and glance()
for ml_lda()
and ml_als()
models (@samuelmacedo83)Local connection defaults now to 2GB.
Support to install and connect based on major Spark versions, for
instance: spark_connect(master = "local", version = "2.4")
.
Support for installing and connecting to Spark 2.4.
New YARN action under RStudio connection pane extension to launch YARN
UI. Configurable through the sparklyr.web.yarn
configuration setting.
Support for property expansion in yarn-site.xml
(@lgongmsft, #1876).
memory
parameter in spark_apply()
now defaults to FALSE
when
the name
parameter is not specified.Removed dreprecated sdf_mutate()
.
Remove exported ensure_
functions which were deprecated.
Fixed missing Hive tables not rendering under some Spark distributions (#1823).
Remove dependency on broom.
Fixed re-entrancy job progress issues when running RStudio 1.2.
Tables with periods supported by setting
sparklyr.dplyr.period.splits
to FALSE
.
sdf_len()
, sdf_along()
and sdf_seq()
default to 32 bit integers
but allow support for 64 bits through bits
parameter.
Support for detecting Spark version using spark-submit
.
Improved multiple streaming documentation examples (#1801, #1805, #1806).
Fix issue while printing Spark data frames under tibble
2.0.0 (#1829).
Support for stream_write_console()
to write to console log.
Support for stream_read_scoket()
to read socket streams.
Fix to spark_read_kafka()
to remove unused path
.
Fix to make spark_config_kubernetes()
work with variable jar
parameters.
Support to install and use Spark 2.4.0.
Improvements and fixes to spark_config_kubernetes()
parameters.
Support for sparklyr.connect.ondisconnect
config setting to
allow cleanup of resources when using kubernetes.
spark_apply()
and spark_apply_bundle()
properly dereference
symlinks when creating package bundle (@awblocker, #1785)
Fix tableName
warning triggered while connecting.
Deprecate sdf_mutate()
(#1754).
Fix requirement to specify SPARK_HOME_VERSION
when version
parameter is set in spark_connect()
.
Cloudera autodetect Spark version improvements.
Fixed default for session
in reactiveSpark()
.
Removed stream_read_jdbc()
and stream_write_jdbc()
since they are
not yet implemented in Spark.
Support for collecting NA values from logical columns (#1729).
Proactevely clean JVM objects when R object is deallocated.
Support for Spark 2.3.2.
Fix installation error with older versions of rstudioapi
(#1716).
Fix missing callstack and error case while logging in
spark_apply()
.
Proactevely clean JVM objects when R object is deallocated.
tidy()
, augment()
, and glance()
for ml_linear_svc()
and ml_pca()
models (@samuelmacedo83)Support for Spark 2.3.2.
Fix installation error with older versions of rstudioapi
(#1716).
Fix missing callstack and error case while logging in
spark_apply()
.
Fix regression in sdf_collect()
failing to collect tables.
Fix new connection RStudio selectors colors when running under OS X Mojave.
Support for launching Livy logs from connection pane.
Removed overwrite
parameter in spark_read_table()
(#1698).
Fix regression preventing using R 3.2 (#1695).
Additional jar search paths under Spark 2.3.1 (#1694)
Terminate streams when Shiny app terminates.
Fix dplyr::collect()
with Spark streams and improve printing.
Fix regression in sparklyr.sanitize.column.names.verbose
setting
which would cause verbose column renames.
Fix to stream_write_kafka()
and stream_write_jdbc()
.
Support for stream_read_*()
and stream_write_*()
to read from and
to Spark structured streams.
Support for dplyr
, sdf_sql()
, spark_apply()
and scoring pipeline
in Spark streams.
Support for reactiveSpark()
to create a shiny
reactive over a Spark
stream.
Support for convenience functions stream_*()
to stop, change triggers,
print, generate test streams, etc.
Support for interrupting long running operations and recover gracefully using the same connection.
Support cancelling Spark jobs by interrupting R session.
Support for monitoring job progress within RStudio, required RStudio 1.2.
Progress reports can be turned off by setting sparklyr.progress
to FALSE
in spark_config()
.
Added config sparklyr.gateway.routing
to avoid routing to ports since
Kubernetes clusters have unique spark masters.
Change backend ports to be choosen deterministically by searching for
free ports starting on sparklyr.gateway.port
which default to 8880
. This
allows users to enable port forwarding with kubectl port-forward
.
Added support to set config sparklyr.events.aftersubmit
to a function
that is called after spark-submit
which can be used to automatically
configure port forwarding.
spark_submit()
to assist submitting non-interactive
Spark jobs.0
being mapped to "1"
and vice versa. This means that if the largest numeric label is N
, Spark will fit a N+1
-class classification model, regardless of how many distinct labels there are in the provided training set (#1591).ml_logistic_regression()
(@shabbybanks, #1596).lazy val
and def
attributes have been converted to closures, so they are not evaluated at object instantiation (#1453).ml_binary_classification_eval()
ml_classification_eval()
ml_multilayer_perceptron()
ml_survival_regression()
ml_als_factorization()
sdf_transform()
and ml_transform()
families of methods; the former should take a tbl_spark
as the first argument while the latter should take a model object as the first argument.Implemented support for DBI::db_explain()
(#1623).
Fixed for timestamp
fields when using copy_to()
(#1312, @yutannihilation).
Added support to read and write ORC files using spark_read_orc()
and
spark_write_orc()
(#1548).
Fixed must share the same src
error for sdf_broadcast()
and other
functions when using Livy connections.
Added support for logging sparklyr
server events and logging sparklyr
invokes as comments in the Livy UI.
Added support to open the Livy UI from the connections viewer while using RStudio.
Improve performance in Livy for long execution queries, fixed
livy.session.command.timeout
and support for
livy.session.command.interval
to control max polling while waiting
for command response (#1538).
Fixed Livy version with MapR distributions.
Removed install
column from livy_available_versions()
.
Added name
parameter to spark_apply()
to optionally name resulting
table.
Fix to spark_apply()
to retain column types when NAs are present (#1665).
spark_apply()
now supports rlang
anonymous functions. For example,
sdf_len(sc, 3) %>% spark_apply(~.x+1)
.
Breaking Change: spark_apply()
no longer defaults to the input
column names when the columns
parameter is nos specified.
Support for reading column names from the R data frame
returned by spark_apply()
.
Fix to support retrieving empty data frames in grouped
spark_apply()
operations (#1505).
Added support for sparklyr.apply.packages
to configure default
behavior for spark_apply()
parameters (#1530).
Added support for spark.r.libpaths
to configure package library in
spark_apply()
(#1530).
Default to Spark 2.3.1 for installation and local connections (#1680).
ml_load()
no longer keeps extraneous table views which was cluttering up the RStudio Connections pane (@randomgambit, #1549).
Avoid preparing windows environment in non-local connections.
The ensure_*
family of functions is deprecated in favor of forge which doesn't use NSE and provides more informative errors messages for debugging (#1514).
Support for sparklyr.invoke.trace
and sparklyr.invoke.trace.callstack
configuration
options to trace all invoke()
calls.
Support to invoke methods with char
types using single character strings (@lawremi, #1395).
Date
types to support correct local JVM timezone to UTC ().ft_binarizer()
, ft_bucketizer()
, ft_min_max_scaler
, ft_max_abs_scaler()
, ft_standard_scaler()
, ml_kmeans()
, ml_pca()
, ml_bisecting_kmeans()
, ml_gaussian_mixture()
, ml_naive_bayes()
, ml_decision_tree()
, ml_random_forest()
, ml_multilayer_perceptron_classifier()
, ml_linear_regression()
, ml_logistic_regression()
, ml_gradient_boosted_trees()
, ml_generalized_linear_regression()
, ml_cross_validator()
, ml_evaluator()
, ml_clustering_evaluator()
, ml_corr()
, ml_chisquare_test()
and sdf_pivot()
(@samuelmacedo83).tidy()
, augment()
, and glance()
for ml_aft_survival_regression()
, ml_isotonic_regression()
, ml_naive_bayes()
, ml_logistic_regression()
, ml_decision_tree()
, ml_random_forest()
, ml_gradient_boosted_trees()
, ml_bisecting_kmeans()
, ml_kmeans()
and ml_gaussian_mixture()
models (@samuelmacedo83)Deprecated configuration option sparklyr.dplyr.compute.nocache
.
Added spark_config_settings()
to list all sparklyr
configuration settings and
describe them, cleaned all settings and grouped by area while maintaining support
for previous settings.
Static SQL configuration properties are now respected for Spark 2.3, and spark.sql.catalogImplementation
defaults to hive
to maintain Hive support (#1496, #415).
spark_config()
values can now also be specified as options()
.
Support for functions as values in entries to spark_config()
to enable advanced
configuration workflows.
Added support for spark_session_config()
to modify spark session settings.
Added support for sdf_debug_string()
to print execution plan for a Spark DataFrame.
Fixed DESCRIPTION file to include test packages as requested by CRAN.
Support for sparklyr.spark-submit
as config
entry to allow customizing the spark-submit
command.
Changed spark_connect()
to give precedence to the version
parameter over SPARK_HOME_VERSION
and
other automatic version detection mechanisms, improved automatic version detection in Spark 2.X.
Fixed sdf_bind_rows()
with dplyr 0.7.5
and prepend id column instead of appending it to match
behavior.
broom::tidy()
for linear regression and generalized linear regression models now give correct results (#1501).
Support for resource managers using https
in yarn-cluster
mode (#1459).
Fixed regression for connections using Livy and Spark 1.6.X.
mode
with databricks
.Added ml_validation_metrics()
to extract validation metrics from cross validator and train split validator models.
ml_transform()
now also takes a list of transformers, e.g. the result of ml_stages()
on a PipelineModel
(#1444).
Added collect_sub_models
parameter to ml_cross_validator()
and ml_train_validation_split()
and helper function ml_sub_models()
to allow inspecting models trained for each fold/parameter set (#1362).
Added parallelism
parameter to ml_cross_validator()
and ml_train_validation_split()
to allow tuning in parallel (#1446).
Added support for feature_subset_strategy
parameter in GBT algorithms (#1445).
Added string_order_type
to ft_string_indexer()
to allow control over how strings are indexed (#1443).
Added ft_string_indexer_model()
constructor for the string indexer transformer (#1442).
Added ml_feature_importances()
for extracing feature importances from tree-based models (#1436). ml_tree_feature_importance()
is maintained as an alias.
Added ml_vocabulary()
to extract vocabulary from count vectorizer model and ml_topics_matrix()
to extract matrix from LDA model.
ml_tree_feature_importance()
now works properly with decision tree classification models (#1401).
Added ml_corr()
for calculating correlation matrices and ml_chisquare_test()
for performing chi-square hypothesis testing (#1247).
ml_save()
outputs message when model is successfully saved (#1348).
ml_
routines no longer capture the calling expression (#1393).
Added support for offset
argument in ml_generalized_linear_regression()
(#1396).
Fixed regression blocking use of response-features syntax in some ml_
functions (#1302).
Added support for Huber loss for linear regression (#1335).
ft_bucketizer()
and ft_quantile_discretizer()
now support
multiple input columns (#1338, #1339).
Added ft_feature_hasher()
(#1336).
Added ml_clustering_evaluator()
(#1333).
ml_default_stop_words()
now returns English stop words by default (#1280).
Support the sdf_predict(ml_transformer, dataset)
signature with a deprecation warning. Also added a deprecation warning to the usage of sdf_predict(ml_model, dataset)
. (#1287)
Fixed regression blocking use of ml_kmeans()
in Spark 1.6.x.
invoke*()
method dispatch now supports Char
and Short
parameters. Also, Long
parameters now allow numeric arguments, but integers are supported for backwards compatibility (#1395).
invoke_static()
now supports calling Scala's package objects (#1384).
spark_connection
and spark_jobj
classes are now exported (#1374).
Added support for profile
parameter in spark_apply()
that collects a
profile to measure perpformance that can be rendered using the profvis
package.
Added support for spark_apply()
under Livy connections.
Fixed file not found error in spark_apply()
while working under low
disk space.
Added support for sparklyr.apply.options.rscript.before
to run a custom
command before launching the R worker role.
Added support for sparklyr.apply.options.vanilla
to be set to FALSE
to avoid using --vanilla
while launching R worker role.
Fixed serialization issues most commonly hit while using spark_apply()
with NAs (#1365, #1366).
Fixed issue with dates or date-times not roundtripping with `spark_apply() (#1376).
Fixed data frame provided by spark_apply()
to not provide characters not factors (#1313).
Fixed typo in sparklyr.yarn.cluster.hostaddress.timeot
(#1318).
Fixed regression blocking use of livy.session.start.timeout
parameter
in Livy connections.
Added support for Livy 0.4 and Livy 0.5.
Livy now supports Kerberos authentication.
Default to Spark 2.3.0 for installation and local connections (#1449).
yarn-cluster
now supported by connecting with master="yarn"
and
config
entry sparklyr.shell.deploy-mode
set to cluster
(#1404).
sample_frac()
and sample_n()
now work properly in nontrivial queries (#1299)
sdf_copy_to()
no longer gives a spurious warning when user enters a multiline expression for x
(#1386).
spark_available_versions()
was changed to only return available Spark versions, Hadoop versions
can be still retrieved using hadoop = TRUE
.
spark_installed_versions()
was changed to retrieve the full path to the installation folder.
cbind()
and sdf_bind_cols()
don't use NSE internally anymore and no longer output names of mismatched data frames on error (#1363).
Added support for Spark 2.2.1.
Switched copy_to
serializer to use Scala implementation, this change can be
reverted by setting the sparklyr.copy.serializer
option to csv_file
.
Added support for spark_web()
for Livy and Databricks connections when
using Spark 2.X.
Fixed SIGPIPE
error under spark_connect()
immediately after
a spark_disconnect()
operation.
spark_web()
is is more reliable under Spark 2.X by making use of a new API
to programmatically find the right address.
Added support in dbWriteTable()
for temporary = FALSE
to allow persisting
table across connections. Changed default value for temporary
to TRUE
to match
DBI
specification, for compatibility, default value can be reverted back to
FALSE
using the sparklyr.dbwritetable.temp
option.
ncol()
now returns the number of columns instead of NA
, and nrow()
now
returns NA_real_
.
Added support to collect VectorUDT
column types with nested arrays.
Fixed issue in which connecting to Livy would fail due to long user names or long passwords.
Fixed error in the Spark connection dialog for clusters using a proxy.
Improved support for Spark 2.X under Cloudera clusters by prioritizing
use of spark2-submit
over spark-submit
.
Livy new connection dialog now prompts for password using
rstudioapi::askForPassword()
.
Added schema
parameter to spark_read_parquet()
that enables reading
a subset of the schema to increase performance.
Implemented sdf_describe()
to easily compute summary statistics for
data frames.
Fixed data frames with dates in spark_apply()
retrieved as Date
instead
of doubles.
Added support to use invoke()
with arrays of POSIXlt and POSIXct.
Added support for context
parameter in spark_apply()
to allow callers to
pass additional contextual information to the f()
closure.
Implemented workaround to support in spark_write_table()
for
mode = 'append'
.
Various ML improvements, including support for pipelines, additional algorithms, hyper-parameter tuning, and better model persistence.
Added spark_read_libsvm()
for reading libsvm files.
Added support for separating struct columns in sdf_separate_column()
.
Fixed collection of short
, float
and byte
to properly return NAs.
Added sparklyr.collect.datechars
option to enable collecting DateType
and
TimestampTime
as characters
to support compatibility with previos versions.
Fixed collection of DateType
and TimestampTime
from character
to
proper Date
and POSIXct
types.
Added support for HTTPS for yarn-cluster
which is activated by setting
yarn.http.policy
to HTTPS_ONLY
in yarn-site.xml
.
Added support for sparklyr.yarn.cluster.accepted.timeout
under yarn-cluster
to allow users to wait for resources under cluster with high waiting times.
Fix to spark_apply()
when package distribution deadlock triggers in
environments where multiple executors run under the same node.
Added support in spark_apply()
for specifying a list of packages
to
distribute to each worker node.
Added support inyarn-cluster
for sparklyr.yarn.cluster.lookup.prefix
,
sparklyr.yarn.cluster.lookup.username
and sparklyr.yarn.cluster.lookup.byname
to control the new application lookup behavior.
Enabled support for Java 9 for clusters configured with Hadoop 2.8. Java 9 blocked on 'master=local' unless 'options(sparklyr.java9 = TRUE)' is set.
Fixed issue in spark_connect()
where using set.seed()
before connection would cause session ids to be duplicates
and connections to be reused.
Fixed issue in spark_connect()
blocking gateway port when
connection was never started to the backend, for isntasnce,
while interrupting the r session while connecting.
Performance improvement for quering field names from tables
impacting tables and dplyr
queries, most noticeable in
na.omit
with several columns.
Fix to spark_apply()
when closure returns a data.frame
that contains no rows and has one or more columns.
Fix to spark_apply()
while using tryCatch()
within
closure and increased callstack printed to logs when
error triggers within closure.
Added support for the SPARKLYR_LOG_FILE
environment
variable to specify the file used for log output.
Fixed regression for union_all()
affecting Spark 1.6.X.
Added support for na.omit.cache
option that when set to
FALSE
will prevent na.omit
from caching results when
rows are dropped.
Added support in spark_connect()
for yarn-cluster
with
hight-availability enabled.
Added support for spark_connect()
with master="yarn-cluster"
to query YARN resource manager API and retrieve the correct
container host name.
Fixed issue in invoke()
calls while using integer arrays
that contain NA
which can be commonly experienced
while using spark_apply()
.
Added topics.description
under ml_lda()
result.
Added support for ft_stop_words_remover()
to strip out
stop words from tokens.
Feature transformers (ft_*
functions) now explicitly
require input.col
and output.col
to be specified.
Added support for spark_apply_log()
to enable logging in
worker nodes while using spark_apply()
.
Fix to spark_apply()
for SparkUncaughtExceptionHandler
exception while running over large jobs that may overlap
during an, now unnecesary, unregister operation.
Fix race-condition first time spark_apply()
is run when more
than one partition runs in a worker and both processes try to
unpack the packages bundle at the same time.
spark_apply()
now adds generic column names when needed and
validates f
is a function
.
Improved documentation and error cases for metric
argument in
ml_classification_eval()
and ml_binary_classification_eval()
.
Fix to spark_install()
to use the /logs
subfolder to store local
log4j
logs.
Fix to spark_apply()
when R is used from a worker node since worker
node already contains packages but still might be triggering different
R session.
Fix connection from closing when invoke()
attempts to use a class
with a method that contains a reference to an undefined class.
Implemented all tuning options from Spark ML for ml_random_forest()
,
ml_gradient_boosted_trees()
, and ml_decision_tree()
.
Avoid tasks failing under spark_apply()
and multiple concurrent
partitions running while selecting backend port.
Added support for numeric arguments for n
in lead()
for dplyr.
Added unsupported error message to sample_n()
and sample_frac()
when Spark is not 2.0 or higher.
Fixed SIGPIPE
error under spark_connect()
immediately after
a spark_disconnect()
operation.
Added support for sparklyr.apply.env.
under spark_config()
to
allow spark_apply()
to initializae environment varaibles.
Added support for spark_read_text()
and spark_write_text()
to
read from and to plain text files.
Addesd support for RStudio project templates to create an "R Package using sparklyr".
Fix compute()
to trigger refresh of the connections view.
Added a k
argument to ml_pca()
to enable specification of number of
principal components to extract. Also implemented sdf_project()
to project
datasets using the results of ml_pca()
models.
Added support for additional livy session creation parameters using
the livy_config()
function.
Fixed error in spark_apply()
that may triggered when multiple CPUs
are used in a single node due to race conditions while accesing the
gateway service and another in the JVMObjectTracker
.
spark_apply()
now supports explicit column types using the columns
argument to avoid sampling types.
spark_apply()
with group_by
no longer requires persisting to disk
nor memory.
Added support for Spark 1.6.3 under spark_install()
.
Added support for Spark 1.6.3 under spark_install()
spark_apply()
now logs the current callstack when it fails.
Fixed error triggered while processing empty partitions in spark_apply()
.
Fixed slow printing issue caused by print
calculating the total row count,
which is expensive for some tables.
Fixed sparklyr 0.6
issue blocking concurrent sparklyr
connections, which
required to set config$sparklyr.gateway.remote = FALSE
as workaround.
Added packages
parameter to spark_apply()
to distribute packages
across worker nodes automatically.
Added sparklyr.closures.rlang
as a spark_config()
value to support
generic closures provided by the rlang
package.
Added config options sparklyr.worker.gateway.address
and
sparklyr.worker.gateway.port
to configure gateway used under
worker nodes.
Added group_by
parameter to spark_apply()
, to support operations
over groups of dataframes.
Added spark_apply()
, allowing users to use R code to directly
manipulate and transform Spark DataFrames.
Added spark_write_source()
. This function writes data into a
Spark data source which can be loaded through an Spark package.
Added spark_write_jdbc()
. This function writes from a Spark DataFrame
into a JDBC connection.
Added columns
parameter to spark_read_*()
functions to load data with
named columns or explicit column types.
Added partition_by
parameter to spark_write_csv()
, spark_write_json()
,
spark_write_table()
and spark_write_parquet()
.
Added spark_read_source()
. This function reads data from a
Spark data source which can be loaded through an Spark package.
Added support for mode = "overwrite"
and mode = "append"
to
spark_write_csv()
.
spark_write_table()
now supports saving to default Hive path.
Improved performance of spark_read_csv()
reading remote data when
infer_schema = FALSE
.
Added spark_read_jdbc()
. This function reads from a JDBC connection
into a Spark DataFrame.
Renamed spark_load_table()
and spark_save_table()
into spark_read_table()
and spark_write_table()
for consistency with existing spark_read_*()
and
spark_write_*()
functions.
Added support to specify a vector of column names in spark_read_csv()
to
specify column names without having to set the type of each column.
Improved copy_to()
, sdf_copy_to()
and dbWriteTable()
performance under
yarn-client
mode.
Support for cumprod()
to calculate cumulative products.
Support for cor()
, cov()
, sd()
and var()
as window functions.
Support for Hive built-in operators %like%
, %rlike%
, and
%regexp%
for matching regular expressions in filter()
and mutate()
.
Support for dplyr (>= 0.6) which among many improvements, increases performance in some queries by making use of a new query optimizer.
sample_frac()
takes a fraction instead of a percent to match dplyr.
Improved performance of sample_n()
and sample_frac()
through the use of
TABLESAMPLE
in the generated query.
Added src_databases()
. This function list all the available databases.
Added tbl_change_db()
. This function changes current database.
Added sdf_len()
, sdf_seq()
and sdf_along()
to help generate numeric
sequences as Spark DataFrames.
Added spark_set_checkpoint_dir()
, spark_get_checkpoint_dir()
, and
sdf_checkpoint()
to enable checkpointing.
Added sdf_broadcast()
which can be used to hint the query
optimizer to perform a broadcast join in cases where a shuffle
hash join is planned but not optimal.
Added sdf_repartition()
, sdf_coalesce()
, and sdf_num_partitions()
to support repartitioning and getting the number of partitions of Spark
DataFrames.
Added sdf_bind_rows()
and sdf_bind_cols()
-- these functions
are the sparklyr
equivalent of dplyr::bind_rows()
and
dplyr::bind_cols()
.
Added sdf_separate_column()
-- this function allows one to separate
components of an array / vector column into separate scalar-valued
columns.
sdf_with_sequential_id()
now supports from
parameter to choose the
starting value of the id column.
Added sdf_pivot()
. This function provides a mechanism for constructing
pivot tables, using Spark's 'groupBy' + 'pivot' functionality, with a
formula interface similar to that of reshape2::dcast()
.
Added vocabulary.only
to ft_count_vectorizer()
to retrieve the
vocabulary with ease.
GLM type models now support weights.column
to specify weights in model
fitting. (#217)
ml_logistic_regression()
now supports multinomial regression, in
addition to binomial regression [requires Spark 2.1.0 or greater]. (#748)
Implemented residuals()
and sdf_residuals()
for Spark linear
regression and GLM models. The former returns a R vector while
the latter returns a tbl_spark
of training data with a residuals
column added.
Added ml_model_data()
, used for extracting data associated with
Spark ML models.
The ml_save()
and ml_load()
functions gain a meta
argument, allowing
users to specify where R-level model metadata should be saved independently
of the Spark model itself. This should help facilitate the saving and loading
of Spark models used in non-local connection scenarios.
ml_als_factorization()
now supports the implicit matrix factorization
and nonnegative least square options.
Added ft_count_vectorizer()
. This function can be used to transform
columns of a Spark DataFrame so that they might be used as input to ml_lda()
.
This should make it easier to invoke ml_lda()
on Spark data sets.
tidy()
, augment()
, and glance()
from tidyverse/broom for
ml_model_generalized_linear_regression
and ml_model_linear_regression
models.cbind.tbl_spark()
. This method works by first generating
index columns using sdf_with_sequential_id()
then performing inner_join()
.
Note that dplyr _join()
functions should still be used for DataFrames
with common keys since they are less expensive.Increased default number of concurrent connections by setting default for
spark.port.maxRetries
from 16 to 128.
Support for gateway connections sparklyr://hostname:port/session
and using
spark-submit --class sparklyr.Shell sparklyr-2.1-2.11.jar <port> <id> --remote
.
Added support for sparklyr.gateway.service
and sparklyr.gateway.remote
to
enable/disable the gateway in service and to accept remote connections required
for Yarn Cluster mode.
Added support for Yarn Cluster mode using master = "yarn-cluster"
. Either,
explicitly set config = list(sparklyr.gateway.address = "<driver-name>")
or
implicitly sparklyr
will read the site-config.xml
for the YARN_CONF_DIR
environment variable.
Added spark_context_config()
and hive_context_config()
to retrieve
runtime configurations for the Spark and Hive contexts.
Added sparklyr.log.console
to redirect logs to console, useful
to troubleshooting spark_connect
.
Added sparklyr.backend.args
as config option to enable passing
parameters to the sparklyr
backend.
Improved logging while establishing connections to sparklyr
.
Improved spark_connect()
performance.
Implemented new configuration checks to proactively report connection errors in Windows.
While connecting to spark from Windows, setting the sparklyr.verbose
option
to TRUE
prints detailed configuration steps.
Added custom_headers
to livy_config()
to add custom headers to the REST call
to the Livy server
Added support for jar_dep
in the compilation specification to
support additional jars
through spark_compile()
.
spark_compile()
now prints deprecation warnings.
Added download_scalac()
to assist downloading all the Scala compilers
required to build using compile_package_jars
and provided support for
using any scalac
minor versions while looking for the right compiler.
copy_to()
and sdf_copy_to()
auto generate a name
when an expression
can't be transformed into a table name.
Implemented type_sum.jobj()
(from tibble) to enable better printing of jobj
objects embedded in data frames.
Added the spark_home_set()
function, to help facilitate the setting of the
SPARK_HOME
environment variable. This should prove useful in teaching
environments, when teaching the basics of Spark and sparklyr.
Added support for the sparklyr.ui.connections
option, which adds additional
connection options into the new connections dialog. The
rstudio.spark.connections
option is now deprecated.
Implemented the "New Connection Dialog" as a Shiny application to be able to support newer versions of RStudio that deprecate current connections UI.
When using spark_connect()
in local clusters, it validates that java
exists
under JAVA_HOME
to help troubleshoot systems that have an incorrect JAVA_HOME
.
Improved argument is of length zero
error triggered while retrieving data
with no columns to display.
Fixed Path does not exist
referencing hdfs
exception during copy_to
under
systems configured with HADOOP_HOME
.
Fixed session crash after "No status is returned" error by terminating invalid connection and added support to print log trace during this error.
compute()
now caches data in memory by default. To revert this beavior use
sparklyr.dplyr.compute.nocache
set to TRUE
.
spark_connect()
with master = "local"
and a given version
overrides
SPARK_HOME
to avoid existing installation mismatches.
Fixed spark_connect()
under Windows issue when newInstance0
is present in
the logs.
Fixed collecting long
type columns when NAs are present (#463).
Fixed backend issue that affects systems where localhost
does
not resolve properly to the loopback address.
Fixed issue collecting data frames containing newlines \n
.
Spark Null objects (objects of class NullType) discovered within numeric vectors are now collected as NAs, rather than lists of NAs.
Fixed warning while connecting with livy and improved 401 message.
Fixed issue in spark_read_parquet()
and other read methods in which
spark_normalize_path()
would not work in some platforms while loading
data using custom protocols like s3n://
for Amazon S3.
Resolved issue in spark_save()
/ load_table()
to support saving / loading
data and added path parameter in spark_load_table()
for consistency with
other functions.
connectionViewer
interface required in RStudio 1.1
and spark_connect
with mode="databricks"
.dplyr 0.6
and Spark 2.1.x.DBI 0.6
.Fix to spark_connect
affecting Windows users and Spark 1.6.x.
Fix to Livy connections which would cause connections to fail while connection is on 'waiting' state.
Implemented basic authorization for Livy connections using
livy_config_auth()
.
Added support to specify additional spark-submit
parameters using the
sparklyr.shell.args
environment variable.
Renamed sdf_load()
and sdf_save()
to spark_read()
and spark_write()
for consistency.
The functions tbl_cache()
and tbl_uncache()
can now be using without
requiring the dplyr
namespace to be loaded.
spark_read_csv(..., columns = <...>, header = FALSE)
should now work as
expected -- previously, sparklyr
would still attempt to normalize the
column names provided.
Support to configure Livy using the livy.
prefix in the config.yml
file.
Implemented experimental support for Livy through: livy_install()
,
livy_service_start()
, livy_service_stop()
and
spark_connect(method = "livy")
.
The ml
routines now accept data
as an optional argument, to support
calls of the form e.g. ml_linear_regression(y ~ x, data = data)
. This
should be especially helpful in conjunction with dplyr::do()
.
Spark DenseVector
and SparseVector
objects are now deserialized as
R numeric vectors, rather than Spark objects. This should make it easier
to work with the output produced by sdf_predict()
with Random Forest
models, for example.
Implemented dim.tbl_spark()
. This should ensure that dim()
, nrow()
and ncol()
all produce the expected result with tbl_spark
s.
Improved Spark 2.0 installation in Windows by creating spark-defaults.conf
and configuring spark.sql.warehouse.dir
.
Embedded Apache Spark package dependencies to avoid requiring internet
connectivity while connecting for the first through spark_connect
. The
sparklyr.csv.embedded
config setting was added to configure a regular
expression to match Spark versions where the embedded package is deployed.
Increased exception callstack and message length to include full error details when an exception is thrown in Spark.
Improved validation of supported Java versions.
The spark_read_csv()
function now accepts the infer_schema
parameter,
controlling whether the columns schema should be inferred from the underlying
file itself. Disabling this should improve performance when the schema is
known beforehand.
Added a do_.tbl_spark
implementation, allowing for the execution of
dplyr::do
statements on Spark DataFrames. Currently, the computation is
performed in serial across the different groups specified on the Spark
DataFrame; in the future we hope to explore a parallel implementation.
Note that do_
always returns a tbl_df
rather than a tbl_spark
, as
the objects produced within a do_
query may not necessarily be Spark
objects.
Improved errors, warnings and fallbacks for unsupported Spark versions.
sparklyr
now defaults to tar = "internal"
in its calls to untar()
.
This should help resolve issues some Windows users have seen related to
an inability to connect to Spark, which ultimately were caused by a lack
of permissions on the Spark installation.
Resolved an issue where copy_to()
and other R => Spark data transfer
functions could fail when the last column contained missing / empty values.
(#265)
Added sdf_persist()
as a wrapper to the Spark DataFrame persist()
API.
Resolved an issue where predict()
could produce results in the wrong
order for large Spark DataFrames.
Implemented support for na.action
with the various Spark ML routines. The
value of getOption("na.action")
is used by default. Users can customize the
na.action
argument through the ml.options
object accepted by all ML
routines.
On Windows, long paths, and paths containing spaces, are now supported within
calls to spark_connect()
.
The lag()
window function now accepts numeric values for n
. Previously,
only integer values were accepted. (#249)
Added support to configure Ppark environment variables using spark.env.*
config.
Added support for the Tokenizer
and RegexTokenizer
feature transformers.
These are exported as the ft_tokenizer()
and ft_regex_tokenizer()
functions.
Resolved an issue where attempting to call copy_to()
with an R data.frame
containing many columns could fail with a Java StackOverflow. (#244)
Resolved an issue where attempting to call collect()
on a Spark DataFrame
containing many columns could produce the wrong result. (#242)
Added support to parameterize network timeouts using the
sparklyr.backend.timeout
, sparklyr.gateway.start.timeout
and
sparklyr.gateway.connect.timeout
config settings.
Improved logging while establishing connections to sparklyr
.
Added sparklyr.gateway.port
and sparklyr.gateway.address
as config settings.
The spark_log()
function now accepts the filter
parameter. This can be used
to filter entries within the Spark log.
Increased network timeout for sparklyr.backend.timeout
.
Moved spark.jars.default
setting from options to Spark config.
sparklyr
now properly respects the Hive metastore directory with the
sdf_save_table()
and sdf_load_table()
APIs for Spark < 2.0.0.
Added sdf_quantile()
as a means of computing (approximate) quantiles
for a column of a Spark DataFrame.
Added support for n_distinct(...)
within the dplyr
interface, based on
call to Hive function count(DISTINCT ...)
. (#220)