Optimized prediction based on textual sentiment, accounting for the intrinsic challenge that sentiment can be computed and pooled across texts and time in various ways. See Ardia et al. (2020)
The sentometrics package is an integrated framework for textual sentiment time series aggregation and prediction. It accounts for the intrinsic challenge that, for a given text, sentiment can be computed in many different ways, as well as the large number of possibilities to pool sentiment across texts and time. This additional layer of manipulation does not exist in standard text mining and time series analysis packages. The package therefore integrates the fast qualification of sentiment from texts, the aggregation into different sentiment time series and the optimized prediction based on these measures.
See the project page, the vignette and following paper for respectively a brief and an extensive introduction to the package, and a real-life macroeconomic forecasting application.
To install the package from CRAN, simply do:
install.packages("sentometrics")The latest development version of sentometrics is available at https://github.com/sborms/sentometrics. To install this version (which may contain bugs!), execute:
devtools::install_github("sborms/sentometrics")Please cite sentometrics in publications. Use citation("sentometrics").
This software package originates from a Google Summer of Code 2017 project.
peakdates()peakdocs() function and added a peakdates() function to properly handle the entire functionality of extracting peakssentiment_bind(), and to_sentiment()sentolexicons objectlag = 1 in the ctr_agg() function, and set weights to 1 by default for n = 1 in the weights_beta() functionabind package from Importszoo package from Imports, by replacing the single occurrence of the zoo::na.locf() function by the fill_NAs() helper function (written in Rcpp)quanteda::docvars() replacement method to a sentocorpus object"x" output element from a sentomodel object (for large samples, this became too memory consuming)"howWithin" output element from a sentomeasures object, and simplified a sentiment object into a data.table directly instead of a listdo.shrinkage.x argument in the ctr_model() function to a vector argumentdo.lags argument to the attributions() function, to be able to circumvent the most time-consuming part of the computationsento_measures() function on the uniqueness of the names within and across the lexicons, features and time weighting schemesmeasures_merge() function that made full merging not possiblen argument in the peakdocs() function can now also be specified as a quantilenCore argument in the compute_sentiment() and ctr_agg() functions to 1compute_sentiment.sentocorpus() function as a sentiment object, and modified the aggregate() function to aggregate.sentiment()weights_beta(), get_dates(), get_dimensions(), get_measures(), and get_loss_data()to_global() to measures_global(), perform_agg() to aggregate(), almons() to weights_almon(), exponentials() to weights_exponential(), setup_lexicons() to sento_lexicons(), retrieve_attributions() to attributions(), plot_attributions() to plot.attributions()ctr_merge() function, so that all merge parameters have to be passed on directly to the measures_merge() functioncenter and scale arguments in the scale() functiondateBefore and dateAfter arguments to the measures_fill() function, and dropped NA option of its fill argument"beta" time aggregation option (see associated weights_beta() function)"attribWeights" element of output sentomeasures object in required measures_xyz() functions"lags") to the attributions() function, and corrected some edge caseslambdas argument to the ctr_model() function, directly passed on to the glmnet::glmnet() function if useddo.combine argument in measures_delete() and measures_select() functions to simplifycovr to Suggestscompute_sentiment() function, by writing part of the code in Rcpp relying on RcppParallel (added to Imports); there are now three approaches to computing sentiment (unigrams, bigrams and clusters)dfm argument in the compute_sentiment() and ctr_agg() functions by a tokens argument, and altered the input and behaviour of the nCore argument in these same two functionsquanteda package to the stringi package for more direct tokenisationlist_lexicons and list_valence_shifters built-in word lists by keeping only unigrams, and included same trimming procedure in the sento_lexicons() function"t" to the list_valence_shifters built-in word list, and reset values of the "y" column from 2 to 1.8 and from 0.5 to 0.2epu built-in dataset with the newest available series, up to July 2018list_valence_shifters[["en"]]compute_sentiment() functionprint() generic for a sentomeasures object"tf-idf" option for within-document aggregation in the ctr_agg() functionsento_lexicons() function outputs a sentolexicons object, which the compute_sentiment() function specifically requires as an input; a sentolexicons object also includes a "[" class-preserving extractor functionattributions() function outputs an attributions object; the plot_attribtutions() function is therefore replaced by the plot() genericperform_MCS() function, but the output of the get_loss_data() function can easily be used as an input to the MCSprocedure() function from the MCS package (discarded from Imports)parallel and doParallel packages to Suggests, as only needed (if enacted) in the sento_model() functionggthemes from Importsmeasures_delete(), nmeasures(), nobs(), and to_sentocorpus()xyz_measures() to measures_xyz(), extract_peakdocs() to peakdocs()do.normalizeAlm argument in the ctr_agg() function, but kept in the almons() functionalmons() function to be consistent with Ardia et al. (2017) paperlexicons to list_lexicons, and valence to list_valence_shiftersstats element of a sentomeasures object is now also updated in measures_fill()"_eng" to "_en"' in list_lexicons and list_valence_shifters objects, to be in accordance with two-letter ISO language naming"valence_language" naming to "language" in list_valence_shifters objectcompute_sentiment() function now also accepts a quanteda corpus object and a character vectoradd_features() function now also accepts a quanteda corpus objectnCore argument to the compute_sentiment(), ctr_agg(), and ctr_model() functions to allow for (more straightforward) parallelized computations, and omitted the do.parallel argument in the ctr_model() functiondo.difference argument to the ctr_model() function and expanded the use of the already existing oos argumentggplot2 and foreach to Importsto_global()tolower = FALSE of quanteda::dfm() constructor in compute_sentiment()intercept argument in ctr_model() to do.intercept for consistencysento_corpus() and add_features()diff(), extract_peakdocs(), and subset_measures()sentimentrincluce_valence() helper function)"proportionalPol")dfm argument in ctr_agg()select_measures() simplified, but toSelect argument expandedto_global() changed (see vignette)add_features(): regex and non-binary (between 0 and 1) allowed