Implements the 'rquery' piped Codd-style query algebra using 'data.table'. This allows for a high-speed in memory implementation of Codd-style data manipulation tools.
rqdatatable is an implementation of the
rquery piped Codd-style relational algebra hosted on
rquery allow the expression of complex transformations as a series of relational operators and
rqdatatable implements the operators using
For example scoring a logistic regression model (which requires grouping, ordering, and ranking) is organized as follows. For more on this example please see "Let’s Have Some Sympathy For The Part-time R User".
# data exampledL <- build_frame("subjectID", "surveyCategory" , "assessmentTotal" |1 , "withdrawal behavior", 5 |1 , "positive re-framing", 2 |2 , "withdrawal behavior", 3 |2 , "positive re-framing", 4 )
scale <- 0.237# example rquery pipelinerquery_pipeline <- local_td(dL) %.>%extend_nse(.,probability :=exp(assessmentTotal * scale)) %.>%normalize_cols(.,"probability",partitionby = 'subjectID') %.>%pick_top_k(.,k = 1,partitionby = 'subjectID',orderby = c('probability', 'surveyCategory'),reverse = c('probability', 'surveyCategory')) %.>%rename_columns(., c('diagnosis' = 'surveyCategory')) %.>%select_columns(., c('subjectID','diagnosis','probability')) %.>%orderby(., cols = 'subjectID')
We can show the expanded form of query tree.
table(dL; subjectID, surveyCategory, assessmentTotal) %.>% extend(., probability := exp(assessmentTotal * 0.237)) %.>% extend(., probability := probability / sum(probability), p= subjectID) %.>% extend(., row_number := row_number(), p= subjectID, o= "probability" DESC, "surveyCategory" DESC) %.>% select_rows(., row_number <= 1) %.>% rename(., c('diagnosis' = 'surveyCategory')) %.>% select_columns(., subjectID, diagnosis, probability) %.>% orderby(., subjectID)
And execute it using
## subjectID diagnosis probability ## 1: 1 withdrawal behavior 0.6706221 ## 2: 2 positive re-framing 0.5589742
One can also apply the pipeline to new tables.
build_frame("subjectID", "surveyCategory" , "assessmentTotal" |7 , "withdrawal behavior", 5 |7 , "positive re-framing", 20 ) %.>%rquery_pipeline
## subjectID diagnosis probability ## 1: 7 positive re-framing 0.9722128
Initial bench-marking of
rqdatatable is very favorable (notes here).
rqdatatable has an "immediate mode" which allows direct application of pipelines stages without pre-assembling the pipeline. "Immediate mode" is a convenience for ad-hoc analyses, and has some negative performance impact, so we encourage users to build pipelines for most work. Some notes on the issue can be found here.
rqdatatable is a fairly complete implementation of
rquery. The main differences are the
rqdatatable implementations of
theta_join() are implemented by round-tripping through a database handle specified by the
rquery.rquery_db_executor option (so it is not they are not very desirable implementation).
rqdatatable please use