Big Data in-Database Analytics with Teradata Aster

A consistent set of tools to perform in-database analytics on Teradata Aster Big Data Discovery Platform. toaster (a.k.a 'to Aster') embraces simple 2-step approach: compute in Aster - visualize and analyze in R. Its `compute` functions use combination of parallel SQL, SQL-MR and SQL-GR executing in Aster database - highly scalable parallel and distributed analytical platform. Then `create` functions visualize results with boxplots, scatterplots, histograms, heatmaps, word clouds, maps, networks, or slope graphs. Advanced options such as faceting, coloring, labeling, and others are supported with most visualizations.


toaster (to Aster) is a set of tools for computing and analyzing data with Teradata Aster Big Data database. It brings the power of Teradata Aster's distributed SQL, MapReduce (SQL-MR), and Graph Engine (SQL-GR) to R on desktop and complements analysis of results with a convenient set of plotting functions.

toaster acheives most tasks in 2 distinct steps:

  • Compute in Aster using Aster's rich, fully scalable set of analyical functions, transparently running in distributed and parallel environement.

  • Deliver and visualize results in R for further exploration and analysis.

toaster performs all big data, processing intensive computations in Aster, making results and visualizations available in R. Summary statistics, aggregates, histograms, heatmaps, and coefficients from linear regression models are among results available in R after processing in Aster. Most results have toaster visualization functions to aid further analysis.

You can install:

  • the latest released version from CRAN with

    install.packages("toaster")
  • the latest development version from github with

    devtools::install_github("toaster", "teradata-aster-field")
  • evaluation version of Aster analytic platform - Aster Express - to run on your PC here and get started with this Tutorial Series.

If you encounter a clear bug, please file a minimal reproducible example on github.

Attribution:

News

NEW FEATURES

  • Both explicit and implicit support for kmeans functions in AAF 6.21. Package will recognize versions based on the function's output or using new argument version. since new output now includes more kmeans statistics computeKmeans will run faster with newer version of AAF (#56).

  • Kmeans clustering can now persist clustered data for both optimized performance and convinience using new argument persist=TRUE (#56).

  • Kmeans clustering now supports initial centers obtained with canopy clustering. Use new functinality computeCanopy to quickly seed initial centroids and run kmeans with canopy object (#61).

MINOR FEATURES

  • Compute functions computeAggregates, computeBarchart, computeSample now allow passing any parameter to sqlQuery via ... syntax to better control performance and data type conversion (#60).

  • computeClusterSample now includes id by default (#60).

  • createClusterPairsPlot added argument include and except to selectively control features to plot (#63).

BUG FIXES

  • createBoxplot with coordFlip=TRUE now labels axises correctly (#58).

  • Updated createClusterPairsPlot to work with the latest version of GGally (#62).

NEW FEATURES

  • Graph function computeGraphClusters performs various types of graph decomposition including connected components and
    modularity (future release) (#33).

  • Graph function computeGraphClustersAsGraphs uses community object produced by computeGraphClusters to create graphs corresponding to its components (#33).

  • Function validateGraph validates and tests for consistency graph tables in Aster (#34).

  • Function computeSample now supports sample stratum based on table column or custom stratum condition (#28).

  • Function getTableCounts is handy when reviewing database tables first time: it reports the number of rows and columns in each table (#50).

  • Function computeCorrelations now supports group columns in Aster (with argument by) (#49).

MINOR FEATURES

  • Graph vertices table is now always read when constructing corresponding network object (#46).

  • Minimal graph must have single vertex and no edges (#42).

  • All plotting functions now support subtitle (#41).

  • Deprecated function theme_empty - use ggplot2::theme_void instead (#53).

BUG FIXES

  • getTableSummary fails when table has one or more temporal column types (#37)

  • getTablesummary fails when numerical data contains 'NaN' or other special values (#39, #51)

  • computeHeatmap argument by now supports multiple columns to group results (#44).

  • Temporary table names allow underscore characters now (#40).

NEW FEATURES

  • New graph functionality expands toaster's reach to Aster graph engine and SQL/GR functions (#32).

    Simply define a metadata 'toagraph' object that describes a graph data in Aster (one object per graph) to compute various graph metrics and distributions:

    • computeGraphHistogram returns degree, clustering, shortest path, and various centrality measure distributions;
    • computeGraphMetric returns top graph vertices for given graph metric including degree, local clustering, and various centrality measures.

    New functions computeGraph, 'showGraph, andcomputeEgoGraph` query Aster graph data to filter, inspect and visualize whole or sub-graphs, including ego graphs (neighborhood graphs).

MINOR FEATURES

  • getTableSummary can skip computing percentiles by setting percentiles value to logical \code{FALSE} (#31).

  • getNullCounts() new argument support percent and data frame with no factors (#29).

  • isTable now supports schema in table name, queries (by returning NA), and expanded result format (#30).

MINOR FEATURES

  • createMap support for:

    • locationName is a vector of the column names containing address, name, etc. suitable to geocode (find latitude and longitude). The columns are used in order of appearance: geocoding tries 1st column's values first, then, for the data points that didn't get resolved, it tries the 2d column's values, and so on.
    • shape's fill using 2d metric with argument metrics. This deprecates parameter metricName (#25);
    • transparency control argument shapeAlpha for the shapes on the map (#24);
    • new shape and stroke arguments shape and shapeStroke to manage appearance of artifacts on the map.
  • all plotting functions gain new guide parameter(s) that control appearance of fill, size and other legend(s) using ggplot2 guide object name or object itself.

BUG FIXES

  • Kmeans cluster and total within sum of squares calculations now work when scale=FALSE. (fixes #23)

  • Kmeans SQL is correct now when id is exactly one of the table columns and default aliasId. (fixes #22)

  • showData now uses same default theme_tufte as the rest of plotting functions (missed it 0.4.1).

  • fixed histograms in showData after upgrading to ggplot2 2.0.0 (missed it 0.4.1).

NEW FEATURES

  • K-means clustering function performs in-database data prep, scaling, clustering, computing standard k-means measures and aggregated metrics on produced clusters. Other functions include the silhouette method of evaluating cluster consistency and validity and variety of visualizations options. Result of k-means function is compatible with stats 'kmeans' object.

  • Utility getNullCounts function returns NULL counts per column in the table.

MINOR FEATURES

  • Upgraded plotting to ggplot2 2.0.0 and fixed test faiures attributed to this upgrade.
  • Changed default theme settings to theme_tufte from ggthemes package for all plotting (create) functions.
  • Updated some examples in the documentation.
  • Added validation of RODBC connection (except checking for live connection to a database).

NEW FEATURES

  • New text analysis functions computeTf and computeTfIdf process corpora in Aster and produce results compatible with package tm, in particular term document matrix.

  • Both computeTf and computeTfIdf rank terms to return top ranked ones. Ranking and number of terms to return are provided by parameters top and rankFunction. Unlimited (all terms) are returned by default with top = NULL.

  • S3 classes nGram and token provide pluggable parsers to extract text tokens to use in the functions 'computeTf' and 'computeTfIdf'.

  • Text functions support stop words in both Aster (installed stopwords file) and R (post-processing of results).

  • Linear regression now is compatible with R standard lm functions returning object of both classes c('toalm', 'lm'). This means methods summary, coefficients, etc. work with the object returned by computeLm. This change is not backward compatible: to obtain result returned in 0.2.5 list contains element old.result.

  • To compute results similar to lm computeLm uses sample (default 1000 rows) to calculate stats like residuals, R-square, etc. in Aster. As before, linear regression coefficients are calculated on full data set with SQL/MR linreg function.

  • getTableSummary is enabled for parallel execution. Simply create and register parallel cluster of your choice with doParallel package and set parameter parallel=TRUE. Performance gains may be up to 50% or better depending on size of the table, number of parallel processes, and number of columns. Run demo("baseball-parallel") for examples.

  • computePercentiles is enabled for parallel execution. Simply create and register parallel cluster of your choice with doParallel package and set parameter parallel=TRUE. Performance gains may be up to 50% or better depending on size of the table, number of parallel processes, and number of columns. Run demo("baseball-parallel") for examples.

  • Added support of temporal Aster data types in getTableSummary and computePercentiles. Temporal types are date, time, timestamp, and interval. in computePercentiles set parameter temporal=TRUE to calculate temporal columns and run it separately from numerical ones.

MINOR FEATURES

  • Added factory functions getDiscretePaletteFactory and getGradientPaletteFactory to dynamically generate palettes with n number of colors.

  • Added utility function isTable that checks if tables exist in Aster database.

  • Parameter formula replaced defunct expr in the function computeLm for consistency with other model-fitting functions.

  • computePercentiles now operates on multiple columns at once.

  • Improved database error handling to be more robust and informative. Error messages now include both ODBC and Aster error message and information (when applicable).

  • Added deprecated warning facility toa_dep similar to ggplot2 gg_dep function.

BUG FIXES

  • Legend position in showData histogram format is completely removed if legendPosition="none".

  • computePercentiles now returns no rows for the column that contains all NULLs. Before it threw error without completing.

  • fixed legend position in plotting functions.

  • Added error when histogram start value is greater than end value in (Issue #33)

DOCUMENTAION

  • Completely reworked demo scripts. Now they contain fully functional examples
    running on baseball and openDallas data sets. The data sets are available from github: https://bitbucket.org/grigory/toaster/downloads

    • baseball demo: https://raw.githubusercontent.com/wiki/teradata-aster-field/toaster/downloads/baseball.zip
    • Dallas demo: https://raw.githubusercontent.com/wiki/teradata-aster-field/toaster/downloads/dallasopendata.zip
  • Baseball Lahman data set now includes 2013 season.

NEW FEATURES

  • computeSample: randomly sample data from the table specifying fraction or size of desired data set.

  • createMap: new visualization function for combining maps with data artifacts from Aster database. Can be used to produce maps of arbitrary scale (with exception of whole world) and type with shapes of size and labels corresponding to data computed in Aster. It uses ggmap and ggplot2 packages and Google API for geocoding data as necessary. It implements smart logic to choose map tiles to place geocoded data appropriately, and it also automatically geocodes data if necessary (Google API restrictions apply).

  • Due to geocoding and map API restrictions createMap supports function caching suggesting function memoise of memoise package. (other functions are fine too). Properly following suggested practices should significantly optimize both peformance and API usage when geocoding or retrieving maps.

  • compute: for executing arbitrary aggregations on Aster tables.

  • computeBarchart: for computing data for barchart visualizations. This is different from computeHistogram as barchart is defined on factors (categorical data) witch doesn't support defining bins like in histograms.

  • computePercentiles: for computing multiple percentiles across one or many subsets of a table in one go. Results are suitable for function createBoxplot (see next).

  • createBoxplot: visualizes boxplots for single column across one or multiple subsets.

  • computeLm: compute linear model coefficients similar to lm function but all performed inside Aster.

ENHANCEMENTS

  • added parameter test to compute- functions (functions that access and manipulate data in Aster) to produce SQL without executing it. Thus, when test=TRUE function returns string containing SQL that would have run in Aster.

  • package depedencies moved from Depends to Imports section of DESCRIPTION file except for RODBC package. Keeping RODBC in Depends because toaster requires access to RODBC connection object and to its function odbcConnect. Other packages are not exposed by toaster functions so accessing them would have been needed only for advanced usage (if any). if you use any function from the packages other than RODBC then those packages should be loaded with library or require or use their namespace.

  • facet parameter now supports both one-value and 2-value vector (if parameter is longer than the rest of values are ignored). Single value defines column name for wrapping facets in 1 or more column lattice. Two values define pair of columns to place facets in 2-dimensional grid for each combination of values found.

  • createHistogram supports trend lines with parameter trendLine=TRUE.

  • computeHeatmap converts dimension and facet columns to factors by default. If undesired set parameter dimAsFactor = FALSE to disable (not recommended with heat maps).

  • computeHeatmap now supports withMelt to melt result using function melt from package reshape2. This option simplifies visualizating with facets.

  • createBubblechart now supports scaling shapes by size (default) or by area. Correspondingly, use shapeSizeRange when scaling by size; and shapeMaxSize when scaling by area.

  • createBubblechart added parameters to control label positioning and formatting. All parameters that position and format label text start with prefix "label" now. Old parameters textSize, textColour, and textVJust renamed to labelSize, labelColour, labelVJust.

  • createPopPyramid support for facets.

  • added utility method to list Aster data types: getNumericTypes, getCharacterTypes, getTemporalTypes.

  • computeAggregates is not an alias anymore and it replaced function compute which is no more.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("toaster")

0.5.4 by Gregory Kanevsky, a month ago


https://github.com/teradata-aster-field/toaster


Report a bug at https://github.com/teradata-aster-field/toaster/issues


Browse source code at https://github.com/cran/toaster


Authors: Gregory Kanevsky [aut, cre]


Documentation:   PDF Manual  


Task views: High-Performance and Parallel Computing with R


GPL-2 license


Imports plyr, reshape2, ggplot2, ggthemes, scales, RColorBrewer, grid, wordcloud, ggmap, GGally, memoise, slam, foreach, stats, network

Depends on RODBC

Suggests testthat, tm, doParallel, ggnetwork

System requirements: Teradata Aster Database 5.20 (AD5.20) or higher, Teradata Aster Analytical Foundation 5.20 (AA5.20) or higher.


See at CRAN