Bridging the Gap Between Qualitative Data and Quantitative Analysis

Automates many of the tasks associated with quantitative discourse analysis of transcripts containing discourse including frequency counts of sentence types, words, sentences, turns of talk, syllables and other assorted analysis tasks. The package provides parsing tools for preparing transcript data. Many functions enable the user to aggregate data by any number of grouping variables, providing analysis and seamless integration with other R packages that undertake higher level analysis and visualization of text. This affords the user a more efficient and targeted analysis. 'qdap' is designed for transcript analysis, however, many functions are applicable to other areas of Text Mining/ Natural Language Processing.


qdap


qdap (Quantitative Discourse Analysis Package) is an R package designed to assist in quantitative discourse analysis. The package stands as a bridge between qualitative transcripts of dialogue and statistical analysis & visualization. To download the development version of qdap:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version (The user may want to install the dev version of reports first):

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh(
    "trinker/qdapDictionaries",
    "trinker/qdapRegex",
    "trinker/qdapTools",
    "trinker/qdap"
)

You are welcome to:

Note: If you are reporting a bug make sure you have first read the Cleaning Text & Debugging vignette

News

NEWS

Releases will be numbered with the following semantic versioning format:

..

And constructed with the following guidelines:

  • Breaking backward compatibility bumps the major (and resets the minor and patch)
  • New additions without breaking backward compatibility bumps the minor (and resets the patch)
  • Bug fixes and misc. changes bumps the patch

BUG FIXES

  • check_spelling and other spell checkers threw an error with a custom dictionary that did not have at least one word beginning with all 26 letters of the alphabet. The dictionary automatically uses assume.first.correct=FALSE if this occurs. Reported by @CallumH of StackOverflow: http://stackoverflow.com/q/33516466/1000343 See issue #217 for details.

NEW FEATURES

MINOR FEATURES

IMPROVEMENTS

CHANGES

NEW FEATURES

  • add_s added to add -s, -es, or -ies to word endings.

MINOR FEATURES

IMPROVEMENTS

  • common now returns NULL invisibly with a message rather than an error if no groups meet the parmeters. Suggested by @bitanshu via issue #213

  • word_cor's defualt group.var is no longer NULL but set to use 1:nrow via qdapTools::id(text.var). Thanks to Drew Schmidt for bringing this issue to attention. Documentation and an error for group.var = NULL has been updated to add clarity.

CHANGES

BUG FIXES

  • type_token_ratio was misnamed as type_text_ratio, this has been corrected. The plot for this class also contained a misspelling "type-toke ratio" which has been corrected as well.

NEW FEATURES

  • inspect_text added to allow for pretty printed viewing of text strings and tm Corpuses.

CHANGES

  • The following functions had been previously deprecated and now have been removed: df2tm_corpus, tm2qdap, tm_corpus2wfm, tm_corpus2df, tdm, dtm, and polarity_frame.

BUG FIXES

  • The internal vignette "An Introduction to qdap" produced errors when compiled by build_qdap_vignete. This behavior has been fixed by using static reporting. The root of the behavior is the ability of cm_ functions to grab data from the global environment, which may not be the case in a knitr/ rmarkdown generated environment.

  • polarity no longer handled phrases (words + spaces) for polarity.frame. This behavior was caught by @Benasso http://stackoverflow.com/q/27156834/1000343. This bug is a result of the changes made to bag_o_words earlier this year. The bug has been fixed and a unit test put in place to ensure the bug is not reintroduced.

  • Network.formality did not include edge width handling. This has been corrected.

  • word_stats gave an incorrect warning message for missing endmarks: "Some sentences not have standard qdap punctuation endmarks." The "do" has been added: "Some sentences do not have standard qdap punctuation endmarks."

  • pres_debates2012 data set contained missplits in lines: 544, 1054. These have been corrected (GitHub issue #205).

  • pos threw an error if only one word was passed to text.var. Fix: drop = FALSE has been added to data frame indexing. Caught by StackOverflow user G_1991 http://stackoverflow.com/q/29896488/1000343.

  • as.tdm.wfm would error if no grouping variable was supplied. This behavior has been corrected.

NEW FEATURES

  • word_length function added to give counts of word length usage by grouping variable. See ?word_length for details`

  • word_position function added to give counts of the position of words within a sentence.

  • sent_detect_nlp added in the sentSplit family to wrap NLP package functionality into a convenient function.

  • lexical_classification provides a means of assessing content vs. functional word usage at the grouping variable and sentence level. The class comes with generic methods for preprocessed, scores (and plots of these methods), Animated, Network, cumulative and Animate.cumulative.

  • Animate.character added as a generic method that allows for the animation of text. This is useful in conjunction with other \code{Animate} objects to create complex animations with accompanying text.

  • add_incomplete added to replace sentences with missing endmarks with a | to indicate an incomplete sentence.

  • type_toke_ratio added to determine type-token ratio per grouping variable.

IMPROVEMENTS

  • polarity takes polarity.frame with phrases (words with spaces).

  • The Animate method for the classes: polarity & formality gains the ability to print corresponding animated text for combined use with other Animated methods.

  • multigsub/mgsub get a speed boost through better programming choices. See issue #201 for details. Thank you to @Alexey Ferapontov for his critical post http://stackoverflow.com/q/27367914/1000343 that inspired the changes.

  • formality and pos now have minimal unit tests.

  • trans_context used message to print to the console. This results in truncated output. message has been replaced with cat.

  • strip gets a speed boost (~10x) by using better regex algorithms, consolidating code/function calls, and by creating a generic strip method for different classes. Additionally, mutiple white spaces are now condensed to a single white space.

  • scrubber would automatically take a space and a single last character and remove the space. This was to remove spaces before ending punctuation. scrubber used substring rather than a more controlled regular expression. This has been corrected. Report thanks to @Fabrizio Maccallini. See issue #207 for more information.

  • pres_debates2012 picks up a role column to make fitering out the candidates easier. The variable order has also changed to put the dialogue last.

CHANGES

  • The ggplot2 package is no longer in Depends. This means the user will have to manually load the package to use additional ggplot2 features. See GitHub issue #199 for more.

  • pos now treats contractions words as 2 words. For example the word count on what's is 2 for what + is. The previous behavior was to strip out the apostrophes. This was undesirable as the sentence "She's cool" would have no verb in the pos output. This change affects pos_by and formality as well.

BUG FIXES

  • bag_o_words did not make use of the bag_o_words2 helper function that has finer grained control of the output. ... were ignored but now are respected.

  • fry threw an error if a group contained < 300 words but had enough text to generate 2 texts chunks of 100 words each, caught by S. Enrico P. Indiogine. The bug has been fixed as these groups are dropped and a warning given.

  • phrase_net threw an error caused by dplyr's (0.3) approach to subsetting columns. Previously a vector was returned, now a tbl_df object is returned: https://github.com/hadley/dplyr/issues/587. This was addressed by using explicit df[[index]] rather than df[, index].

NEW FEATURES

  • chunker added to break text, optionally by grouping variables, into equal chunks. The chunk size can be specified by giving number of words to be in each chunk or the number of chunks.

IMPROVEMENTS

all_words gains char.keep and char2space arguments to enable retention of characters and multi word phrases. These features are passed to freq_terms as well. Suggested by stackoverflow's lawyeR (http://stackoverflow.com/a/26162401/1000343).

CHANGES

  • rm_url has been moved into its own canned regex pattern extraction/replacer package named qdapRegex.

  • name2sex now uses the gender package to predict sex. This makes the function slightly slower but much more accurate than previous versions.
    Because of this increased accuracy and dependence on gender, the arguments pred.sex, fuzzy.match, and database are no longer necessary and have been removed.

BUG FIXES

  • syllable_count returned the sentence (recycled) in the words column of the output. This behavior has been fixed. See GitHub issue #188 for details.

  • syn returned antonyms for some words. This was caused by the dictionary: qdapDictionaries::key.syn contained antonyms and elements the were error messages (character). This has been fixed. Reference issue #190. (Jingjing Zou)

  • The pres_debates2012 data set contained three errors in speech attribution. This has been corrected and the turn of talk (tot) as well.

  • word_stats would throw an error if no poly-syllable words existed. This has been corrected (reported by Nicolas Turenne).

NEW FEATURES

  • qdap_df and %&% added to mimic some of the functionality of dplyr's tbl_df and chaining pipe in a more specific, less flexible, qdap oriented way.

  • Text added to view and change the text.var attribute of a data.frame of the classqdap_df`.

  • cumulative generic method added to view cumulative scores over time.

  • formality picks up a cumulative method.

  • polarity picks up a cumulative method.

  • end_mark picks up a class (end_mark), plot method, and a cumulative method.

  • syllable_sum, polysyllable_sum, and combo_syllable_sum pick up a class, plot method, and a cumulative method.

  • wfm becomes a generic method currently applied to a text.var that is: character, factor (coerced to character), or wfdf.

  • unbag added as a compliment to bag_o_words and friends for undoing string splitting. A convenience wrapper for paste(collapse = " ").

  • as.Corpus.TermDocumentMatrix, as.Corpus.DocumentTermMatrix, and as.Corpus.wfm added to convert a matrix format to a tm::Corpus.

  • exclude becomes a generic method for various classes. Functionality is the same but with improved code readability.

  • check_spelling_interactive, check_spelling, which_misspelled, and correct allow the user to identify potentially misspelled words and optionally suggest replacements.

  • random_data & random_sent added to generate random sentence data sets and vectors.

  • comma_spacer added to ensure strings with commas contain a space after them.

  • check_text added to identify potential problems in text.

  • replace_ordinal added to convert ordinal representations of 1 through 100 to strictly ordinal text (e.g., "1st" becomes "first").

  • A vignette: Cleaning Text & Debugging was added to assist users with cleaning and debugging problems in qdap.

  • pronoun_type, and subject_pronoun_type, object_pronoun_type added to examine usage of subject/object pronouns by grouping variable.

MINOR FEATURES

  • dplyr's chaining pipe imported for convenience. See http://www.rdocumentation.org/packages/magrittr/functions/magrittr for details.

IMPROVEMENTS

  • wfm gains a speed-up through generic classes and tm package integration (strip is no longer used in wfm).

  • as.tdm.character and as.dtm.character gain a speed boost with a tm package integration.

  • Added message to as.data.frame.Corpus for missing end-marks suggesting the use of: sent.split = FALSE.

  • as.Corpus family of functions didn't necessarily respect document names and sometimes used numeric sequence instead. The introduction of a reader via tm::readTabular has fixed this.

  • sentSplit now gives warnings for text that may contain anomalies such as: non-ASCII characters, factors, missing punctuation, empty cells, and no alphabetic characters found.

  • read.transcript now gives a warning when reading from a .docx file and the separator (sep) used is still found in the text as this may indicate the data did not split correctly.

  • dispersion_plot now takes a named list of vectors of terms as the argument to match.terms. The vectors are combined as a unified theme named with the names of the list supplied to match.terms.

CHANGES

  • as.data.frame.Corpus's default value for sent.split is now FALSE.

  • The state column in the qdap::DATA2 data-set is now character (previously factor).

BUG FIXES

  • new_project did not copy the .Rprofile over into the new project. This has been fixed. Reference issue #184.

  • sentiment_frame coerced words to factor. stringsAsFactors = FALSE has been added to prevent this.

  • polarity did not work on > 1 grams due to a bug in sentiment_frame converting character to factor (thanks for the find @chewth). See GitHub issue #185 for details.

NEW FEATURES

  • unique_by added to allow the user to find terms unique to individual elements of a grouping variable.

  • build_qdap_vignette replaces the temporary place holder version of the Introduction to qdap vignette. This function will replace the (1) HTML, (2) source, & (3) R code found in browseVignettes(package = 'qdap').

MINOR FEATURES

  • sub_holder picks up a alpha.type argument that allows the user to specify whether alpha or numeric keys should be used.

  • replace_number picks up a remove argument that removes numbers from text.

IMPROVEMENTS

  • qheat becomes a generic method. This means some of the internal function class checking has been moved to individual methods for those classes.
    Additionally, qheat now works with logical matrices/data.frames.

  • The tm package compatibility functions have been renamed in a more R-ish way and take the form of generic methods for specific classes. For example, df2tm_corpus becomes as.Corpus. Here is a complete list of changes:

    • df2tm_courpus is now as.Corpus
    • tm_corpus2df is now as.data.frame
    • as.wfm is now a generic method
    • tm_corpus2wfm is now as.wfm
    • tm2qdap is now as.wfm
    • tdm is now as.tdm or as.TermDocumentMatrix
    • dtm is now as.dtm or as.DocumentTermMatrix

CHANGES

  • colsplit2df and colpaste2df no longer convert character columns to factor.

  • df2tm_corpus is deprecated. It will be removed in a subsequent version of qdap. Use as.Corpus instead.

  • tm_corpus2df is deprecated. It will be removed in a subsequent version of qdap. Use as.data.frame instead.

  • tm2qdap is deprecated. It will be removed in a subsequent version of qdap. Use as.wfm instead.

  • tm_corpus2wfm is deprecated. It will be removed in a subsequent version of qdap. Use as.wfm instead.

  • tdm is deprecated. It will be removed in a subsequent version of qdap.
    Use as.tdm or as.TermDocumentMatrix instead.

  • dtm is deprecated. It will be removed in a subsequent version of qdap.
    Use as.dtm or as.DocumentTermMatrix instead.

  • The Introduction to qdap .Rmd vignette has been moved to an internal directory. The HTML version is not built by default. This saves CRAN space and time checking the package source. The file has been replaced with a temporary place holder that contains instructions for building the actual vignette. The user may also use the build_qdap_vignette directly.

  • qdap incorporates the changes from the tm package version: 0.6: http://cran.r-project.org/web/packages/tm/news.html Reference issue #187.

The qdapTools package now houses several former qdap functions. While qdapTools is a Dependency and all of these functions will be accessible to the qdap user there is a break in backward compatibility if these functions are included in code. For this reason this release is a major bump of qdap.

BUG FIXES

  • replace_number did not replace single digits numbers. Spotted by Ben Bolker. This behavior has been fixed and unit testing added for this function. See issue # 178.

NEW FEATURES

  • sub_holder added; this function holds the place for particular character values, allowing the user to manipulate the vector and then revert the place holders back to the original values.

  • Network method added to make network plots of select qdap objects.

  • qtheme, theme_nightheat, theme_duskheat, theme_norah,theme_cafe,theme_grayscale,theme_badkitchen, andtheme_hipsteradded to styleNetwork` plots.

  • polarity picks up a Network method.

  • formality picks up a Network method.

  • qdap officially begins utilizing the testthat package for unit testing, though only a few functions have begun the process, more will be added over time.

MINOR FEATURES

IMPROVEMENTS

CHANGES

  • The qdapTools package now houses the following former qdap functions: hash, %ha%, hash_look, hms2sec, id, lookup, %l%, %l+%, %l*%, repo2github, sec2hms, text2color, url_dl, v_outer, list2df, matrix2df, vect2df, list_df2df, list_vect2df, counts2list,
    vect2list, & mtabulate. These functions will continue to be available to qdap users in interactive mode (qdapTools is a Dependency and thus these functions are loaded into the workspace by default). This will allow this bundle of functions to be used outside of qdap without calling the larger qdap package per the request of Kirill Muller (see issue #165).

  • As scheduled the dissimilarity function has been removed from the qdap package to avoid conflict with the tm package. Use Dissimilarity function instead.

MINOR FEATURES

  • polarity picks up a constrain argument that constrains the polarity values to be between -1 and 1.

IMPROVEMENTS

  • polarity's equation now uses primes on the de-amplifiers before they're confined to be >= -1. This avoids confusion in the indicator function that took the de-amplifiers variable and returned the same variable.

  • dist_tab's frequency columns used a capital F in Freq. This was not consistent across all column names and has been changed to lower case.

CHANGES

  • polarity_frame is deprecated and will be removed in a subsequent release. Please use sentiment_frame instead.

BUG FIXES

  • The An Introduction to qdap vignette contained a broken link in the tm Package Compatibility section. This has been fixed. Also the reliance on Rgraphviz from the vignette has been removed. This will eliminate CRAN WARN in CRAN checks (for some OS) but not the note for tm's reliance on Rgraphviz.

  • polarity reported the incorrect number of words for sentences containing commas. This has been fixed (Max Ghenis).

NEW FEATURES

  • formality picks up an Animate method.

  • end_mark_by function added as a aggregated grouping version of end_mark.

MINOR FEATURES

  • raj.act.1POS added. raj.act.1POS is a data set for Romeo and Juliet: Act 1 broken into parts of speech.

IMPROVEMENTS

  • discourse_map picks up a pause argument that enables the user to pause between plots in interactive mode.

CHANGES

BUG FIXES

NEW FEATURES

  • gantt and gantt_wrap (single facet) pick up and Animate method.

  • polarity picks up an Animate method.

  • vertex_apply and edge apply added to make uniform changes to lists of igraph objects.

MINOR FEATURES

IMPROVEMENTS

  • discourse_map picks up a condense argument that allows the user to condense sequential rows for like grouping variable sub groups.

  • list_df2df names now use a zero padded numeric portion for default names.
    For example c("L1", "L2", "L3", ... "L10"), becomes c("L01", "L02", "L03", ... "L10").

CHANGES

BUG FIXES

  • colpaste2df dropped the column name for a single retained column when keep.orig = FALSE. See GitHub issue #157 for more.

  • multigsub (mgsub) would return NA for replacement of length 1 after the addition of the order.pattern (used to prevent substrings from replacing meta-strings) in version 1.3.2.

NEW FEATURES

  • phrase_net function provides functioning similar to the Many Eyes Phrase Net plot.

  • discourse_map function provides a network mapping of the flow of discourse between social actors. Function output is Animate ready as well. See ?discourse_map and http://trinker.github.io/qdap_examples/animation_dialogue for more.

  • Animate function added to convert select qdap outputs to an animated
    sequence. See ?Animate.discourse_map for more.

MINOR FEATURES

  • synonyms_frame (syn_frame) added to allow the user to create a synonym hash for the revamped synonyms function.

  • repo2github function added to send a directory to GitHub upon first commit.

IMPROVEMENTS

  • new_project has an improved directory structure and works with any version of the reports package.

  • synonyms function used the env.syl hash data from qdapDictionaries internally. This approach could cause problems if used within other functions in a package. It also limits the usability of synonyms. The synonyms function picks up a synonym.frame argument that allows the user to specify a synonym hash table. This can be created via the synonyms_frame function (per a request from J. Aravind).

CHANGES

This is a patch release to address the archiving of the lsa package.

BUG FIXES

  • The qdap-tm Package Compatibility Vignette contained an error in the Feinerer I, Hornik K, Meyer D (2008) reference (pages listed as 51-54 has been corrected to pages 1-54 as well as incorrect journal). Caught by Kurt Hornik.

MINOR FEATURES

  • DocumentTermMatrix and TermDocumentMatrix from the tm package pick up a Filter method.

IMPROVEMENTS

  • multigsub picks up an argument, order.pattern, to prevent substrings from replacing meta-strings.

  • The following data sets were added to qdapDictionaries package: Fry_1000, Leveled_Dolch, Dolch

CHANGES

  • The package lsa has been removed from Suggests field in the DESCRIPTIONN file, examples, and vignettes.

A version bump necessary for Re-Submission to CRAN.

CHANGES

  • new_project was reconfigured with the old code that does not require the newest version of the reports package.

BUG FIXES

  • read.transcript could leave a QDAP_PLACE_HOLDER behind if a colon was found in the person column. This behavior has been fixed.

  • word_cor's plotting method threw an error if a word did not have any words above the r threshold. This behavior has been corrected.

  • Filter overwrote a base R function; this has been fixed per Joshua Ulrich.

  • scores.polarity's print method would return an error if columns were not indexed yet were rounded. For instance, the following threw an error:

    scores(with(sentSplit(DATA, 4), polarity(state, person)))[, 1:4]

    This behavior has been fixed.

NEW FEATURES

  • qdap adds an HTML vignette to better explain the intended work flow and function use for the package. Use browseVignettes(package = "qdap") to open.

  • qdap adds a PDF vignette to describe the compatibility and navigation between qdap and the tm packages. Use browseVignettes(package = "qdap") to open.

MINOR FEATURES

IMPROVEMENTS

  • apply_as_df picks up a stopwords and filter arguments that allows the user to remove stopwords and min/max length words.

  • plot.word_cor picks up the argument ncol that allows the user to specify the number of columns used. This uses ggplot2's facet_wrap rather than facet_grid (which is the default if ncol =NULL).

  • name2sex relied upon having qdapDictionaries loaded. This could be an issue if the function were used internally. The user now supplies a dictionary of names and probabilities.

  • df2tm_corpus gains a demographics.vars argument that allows the user to add demographic information to the resulting corpus DMetaDat.

  • tm_corpus2df gains the ability to convert DMetaDat into demographic data.frame columns.

CHANGES

BUG FIXES

NEW FEATURES

  • Filter added to give the ability to provide a range of character lengths to filter from a wfm object.

  • scores generic method added to view scores from select qdap objects.

  • counts generic method added to view counts from select qdap objects.

  • proportions generic method added to view proportions from select qdap objects.

  • preprocessed generic method added to view preprocessed data from select qdap objects.

  • apply_as_df added to allow the user to apply qdap functions to a Corpus directly.

MINOR FEATURES

  • tm_corpus2wfm added to quickly convert from a tm package Corpus to a qdap wfm object.

  • as.wfm added as a means to attempt to coerce a matrix to a wfm object.

  • %l+% added as a counterpart to %l% that assumes missing = NULL.

  • %bs% added as quick counterpart to boolean_search for indexing.

IMPROVEMENTS

  • df2tm_corpus now sets metaData information for ID and creator (based on) Sys.info()["user"].

  • matrix2df now accepts a simple_triplet_matrix object as well.

  • word_cor output that was a list (not a correlation matrix) did not have a plot method. The plot method for word_cor now handles both matrices and the list of correlations.

  • rm_row picks up the contains argument that allows the user to search for, and remove rows of, within the string, not just the beginning.

  • read.transcript now handles multiple character spaces as an argument to sep when text argument is used.

CHANGES

  • dissimilarity has been renamed to Dissimilarity to prevent tm package conflicts. The old version has been deprecated and will be removed in a the next version (minor or major) push to CRAN.

A version bump necessary for Re-Submission to CRAN.

CHANGES

  • Downgraded the version requirement for the reports package to reports (>= 0.1.2) in order to upload to CRAN. reports (>= 0.2.0) is not yet available on CRAN.

The word lists and dictionaries in qdap have been moved to qdapDictionaries. Additionally, many functions have been renamed with underscores instead of the former period separators. These changes break backward compatibility. Thus this is a major release (ver. 1.0.0).

It is the general practice to deprecate functions within a package before removal, however, the number of necessary changes in light of qdap being relatively new to CRAN, made these changes sensible at this point.

BUG FIXES

  • qheat's argument by.column = FALSE resulted in an error. This behavior has been fixed.

  • question_type did not work because of changes to lookup that did not accept a two column matrix for key.match. See GitHub issue #127 for more.

  • combo_syllable.sum threw an error if the text.var contained a cell with an all non-character ([a-z]) string. This behavior has been fixed.

  • todo function created by new_project would not report completed tasks if report.completed = TRUE.

  • termco and termco.d threw an error if more than one consecutive regex special character was passed to match.list or match.string. See GitHub issue #128 for more.

  • trans.cloud threw an error if a single list with a named vector was passed to target.words. This behavior has been fixed.

  • sentSplit now returns the "tot" column when text.place = "original".

  • all_words output dataframe FREQ column class has been changed from factor to numeric. Additionally, the WORDS column prints using left.just but retains traditional character properties (print class added). all_words also picks up apostrophe.remove and ldots (for strip) arguments.

  • gantt_plot did not handle fill.vars, particularly if the fill was nested within the grouping.vars. This behavior has been fixed with corresponding examples added.

  • url_dl - Downloaded an empty file when not using a Dropbox key. This behavior has been fixed.

  • The cm_code. family of functions had a bug in the output due to cm_long2dummy and cm_dummy2long's handling of stretching spans. This has been corrected.

  • cm_code.exclude did not output the correct excluded spans. This behavior has been corrected.

  • The use of comment to convey object characteristics has been replaced with the use of class.

  • question_type did not include question words ending in 'd as part of the category. For instance "How'd you like it?" was not classified as a how question.

  • beg2char would not include the char if include = TRUE and noc = 1.

  • cm_range2long returned NAs for vectors containing multiple single values.
    See GitHub issue #144 for more.

  • termco family of functions did not handle NA values. This has been fixed. (Matt Williamson) See GitHub issue #147 for details.

  • pos threw an error for vectors of length 1. This has been fixed (Kurt Hornik). See GitHub issue #150 for details.

  • formality threw an error for vectors of length 1. This has been fixed. (Kurt Hornik) See GitHub issue #151 for details.

NEW FEATURES

  • The cm_xxx2long family of functions (cm_df2long, cm_range2long and cm_time2long) now have a generic wrapper, cm_2long, to generate the long formats.

  • hash_look (and %ha%) a counterpart to hash added to allow quick access to a hash table. Intended for use within functions or multiple uses of the same hash table, whereas lookup is intended for a single external (non function) use which is more convenient though could be slower.

  • boolean_search, a Boolean term search function, added to allow for indexed searches of Boolean terms.

  • trans_context is a printing function desired to grab the context (n rows before and after) an event (an index from a vector of indices). The function prints the indices around the episode from a transcript to the console or a .csv, .xlsx, .txt, or .doc file.

  • colpaste2df is a wrapper for paste2 that pastes dataframe columns together and outputs a dataframe.

  • colcomb2class quickly combines columns for number of qdap classes including output from: termco, question_type, pos_by, and character_table.

  • lview a function to unclass a list output that has a special print method that returns only a portion of the output. lview re-classes to "list".

  • word_cor added to find words within grouping variables that are associated based on correlation.

  • tm2qdap a function to convert "TermDocumentMatrix" and "DocumentTermMatrix" to a wfm added to allow easier integration with the tm package.

  • apply_as_tm a function to allow functions intended to be used on the tm package's TermDocumentMatrix to be applied to a wfm object.

  • tm_corpus2df and df2tm_corpus added to convert a tm package corpus to a dataframe for use in qdap or vice versa.

  • tdm and dtm are now truly compatible with the tm package. tdm and dtm produce outputs of the class "TermDocumentMatrix" and "DocumentTermMatrix" respectively. This change (coupled with the renaming of stopwords to rm_stopwords) should make the two packages logical companions and further extend the qdap package to integrate with the many packages that already handle "TermDocumentMatrix" and "DocumentTermMatrix".

  • cm_distance now uses resampling of data from the null model to generate pvalues for the mean code distances. Useful for determining if an association (small distance) between codes is likely to happen if the null is true.

  • dispersion_plot added to enable viewing of word dispersion through discourse.

  • word_proximity added to compliment dispersion_plot and word_cor functions. word_proximity gives the average distance between words in the unit of sentences.

MINOR FEATURES

  • url_dl now takes quoted string urls supplied to ... (no url argument is supplied)

  • condense is a function that condense dataframe columns that are a list of vectors to a single vector of strings. This outputs a dataframe with condensed columns that can be wrote to csv/xlsx.

  • mcsv_w now uses condense to attempt to attempt to condense columns that are lists of vectors to a single vector of strings. This adds flexibility to mcsv_w with more data sets. mcsv_w now writes lists of dataframes to multiple csvs (e.g., the output from termco or polarity). mcsv_w picks up a dataframes argument, an optional character vector supplied in lieu of \ldots that grabs the dataframes from an environment (default id the Global environment).

  • ngrams now has an argument ellipsis that passes further arguments supplied to strip

  • dtm added to compliment tdm, allowing for easier integration with other R packages that utilize tdm/dtm.

  • dir_map picks up a use.path argument that allows the user to specify a more flexible path to the created pre-formed read.transcript scripts based on something like file.path(getwd(), ). This means portability of code on different machines.

  • polarity_frame a function to make a hash environment lookup for use with the polarity function.

  • DATA.SPLIT a sentSplit version of the DATA data-set has been added to qdap.

  • gantt_plot accepts NULL for grouping.var and figures for "all" rows as a single grouping var.

  • replace_number now handles 10^47 digits compared to 10^14 previously.

  • The new_project function gains a github argument that optionally sends the repo to GitHub public account upon creation.

  • qheat, polarity.plot and formality.plot pick up the argument plot which optionally suppresses the plotting. This is useful if the user is operating in knitr, sweave, etc. and wishes to alter/add onto the plot.

  • lookup now takes missing = NULL. This results in the original values in terms corresponding to the missing elements being retained.

  • cm_time.temp picks up a grouping.var argument that works similarly to cm_range.temp's grouping.var. cm_time.temp also takes hour values for start and end as in end = "01:22:03".

  • gantt_rep picks up a generic plot method.

  • Functions in the cm_code.xxx and cm_xxx2long pick up a generic plot method that utilizes gantt_wrap to plot a Gantt plot of the span data.

  • Functions in the cm_code.xxx and cm_xxx2long pick up a generic summary method. This summary method has its own plot method that utilizes qheat to plot a heatmap of the summary statistics. The generic print method (print.sum_cmspans) is useful for output intended for publication.

  • qheat picks up a facet.vars argument that allows a character vector of length 1 or 2 to facet by.

  • question_type gives the indices of questions via $inds.

  • colsplit2df not splits multiple columns to match the capabilities of colpaste2df.

  • sentSplit now handles repeated measures and picks up a turn of talk plot method.

  • tot_plot now handles repeated measures and grouping.var to be nested within the turn of talk.

  • wfm now uses mtabulate and is ~10x faster.

  • plot.polarity gains arguments for optional error bars using the standard error of the mean polarity.

  • exclude now works with wfm and the tm package's DocumentTermMatrix and TermDocumentMatrix classes.

  • rm_url removes/replaces URLs in a string(s).

  • matrix2df added (under list2df) to convert rownames of matrix to a dataframe column.

CHANGES

  • The dictionaries and word lists for qdap have been moved to their own package, qdapDictionaries. This will allow easier access to these resources beyond the qdap package as well as reducing the overall size of the qdap package.
    Because this is a major change that make break the code of some users the major release number has been upped to 1. The following name changes have occurred:

    • increase.amplification.words -> became -> amplification.words

    • The deamplification.words wordlist and env.pol dictionary were added as well.

  • qdap gains an HTML package vignette to better explain the intended work flow and function use for the package. This is not currently a part of the build but can be accessed via:

    http://htmlpreview.github.io/?https://github.com/trinker/qdap/blob/master/vignettes/qdap_vignette.html

    Note that the vignette may include development version functions not yet available in the current CRAN version

  • polarity utilizes a new, unbounded algorithm based on weighting to determine polarity.

  • gantt_wrap no longer accepts unquoted strings to the plot.var argument.

  • cm_df.temp loses the logical csv argument. file.name have been replaced with file to fit conventional R naming schemes.

  • The plotting feature of gantt has been removed and a plot method has been added. The user can plot the output from gantt in base or ggplot2 graphics.

  • cm_time2long loses the argument start.end to ensure that the cmspans class produced would operate as expected.

  • Most exported functions utilizing a period separator have been replaced with underscore named versions.

  • wf_combine renamed wfm_combine to be consistent.

  • question_type algorithm improvements including implied do/does/did handling.

  • list2df and mtabulate now exported.

  • stopwords has been renamed to rm_stopwords(rm_stop shorthand) to better fit what the action the function performs and to avoid conflicts with the tm package.

  • replace_number's num.paste becomes logical rather than character input. This makes use easier as the user doesn't need to remember arguments.

Patch release. This version deals with the changes in the openNLP package that effect qdap. Next major release scheduled after slidify package is pushed to CRAN.

BUG FIXES

  • new_project placed a report in the CORRESPONDENCE directory rather than CONTACT_INFO

  • strip would not allow the characters "/" and "-" to be passed to char.keep. This has been fixed. (Jens Engelmann)

  • beg2end would only grab first character of a string after n -1 occurrences of the character. For example: beg2char(c("abc-edw-www", "nmn-ggg", "rer-qqq-fdf"), "-", 2) resulted in "abc-e" "nmn-g" "rer-q" rather than "abc-edw" "nmn-ggg" "rer-qqq"

NEW FEATURES

  • names2sex a function for predicting gender from name.

  • Added NAMES and NAMES_SEX data-sets, based on 1990 U.S. census data.

  • tdm added as an equivalent to TermDocumentMatrix from the tm package. This allows for portability across text analysis packages.

MINOR FEATURES

  • mgsub now gets a trim argument that optionally removes trailing leading white spaces.

  • lookup now takes a list of named vectors for the key.match argument.

CHANGES

  • new_project directory can now be transferred without breaking paths (i.e., file.path(getwd(), "DIR/file.ext") is used rather than the full file path).

BUG FIXES

  • genXtract labels returned the word "right" rather than the right edge string. See http://stackoverflow.com/a/15423439/1000343 for an example of the old behavior. This behavior has been fixed.

  • gradient_cloud's min.freq locked at 1. This has been fixed. (Manuel Fdez-Moya)

  • termco would produce an error if single length named vectors were passed to match.list and no multi-length vectors were supplied. Also an error was thrown if an unnamed multi-length vector was passed to match.list. This behavior has been fixed.

NEW FEATURES

  • tot_plot a visualizing function that uses a bar graph to visualize patterns in sentence length and grouping variables by turn of talk.

  • beg2char and char2end functions to grab text from beginning of string to a character or from a character to the end of a string.

  • ngrams function to calculate ngrams by grouping variable.

MINOR FEATURES

  • genX and bracketX gain an extra argument space.fix to remove extra spaces left over from bracket removal.

  • Updated out of date Dropbox url download in url_dl. url_dl also takes the Dropbox key as well.

CHANGES

  • qdap is now compiled for mac users (as openNLP now passes CRAN checks with no Errors on Mac).

BUG FIXES

  • word_associate colors the word cloud appropriately and deals with the error caused by a grouping variable not containing any words from 1 or more of the vectors of a list supplied to match string

  • trans.cloud produced an error when expand.target was TRUE. This error has been eliminated.

  • termco would eliminate > 1 columns matching an identical search.term found in a second vector of match.list. termco now counts repeated terms multiple times.

  • cm_df.transcript did not give the correct speaker labels (fixed).

NEW FEATURES

  • gradient_cloud: Binary gradient Word Cloud - A new plotting function that plots and colors words for a binary variable based on which group of the binary variable uses the term more frequently.

  • new_project: A project template generating function designed to increase efficiency and standardize work flow. The project comes with a .Rproj file for easy use with RStudio as well as a .Rprofile that makes loading and sourcing of packages, data and project functions. This function uses the reports package to generate an extensive reports folder.

MINOR FEATURES

  • stemmer, stem2df and stem.words now explicitly have the argument char.keep set to "~~" to enable retaining special character formerly stripped away.

  • hms2sec: A function to convert from h:m:s format to seconds.

  • mcsv_w now takes a list of data.frames.

  • cm_range.temp now takes the arguments text.var and grouping.var that will automatically output these (grouping.var) columns as range coded indices.

  • wfm gets as speed boost as the code has been re-written to be faster.

  • read.transcript now reads .txt files as well as text similar to read.table.

CHANGES

  • sec2hms is the new name for convert

  • folder and delete have been moved to the reports package which is imported by qdap. Previously folder would not generate a directory with the time/date stamp if no directory name was given; this has been fixed, though the function now resides in the reports package.

  • The first installation of the qdap package

  • Package designed to bridge the gap between qualitative data and quantitative analysis

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("qdap")

2.2.8 by Tyler Rinker, 16 days ago


http://trinker.github.com/qdap/


Report a bug at http://github.com/trinker/qdap/issues


Browse source code at https://github.com/cran/qdap


Authors: Bryan Goodrich [ctb], Dason Kurkiewicz [ctb], Tyler Rinker [aut, cre]


Documentation:   PDF Manual  


Task views: Natural Language Processing


GPL-2 license


Imports chron, dplyr, gdata, gender, ggplot2, grid, gridExtra, igraph, methods, NLP, openNLP, parallel, plotrix, RCurl, reports, reshape2, scales, stringdist, tidyr, tm, tools, venneuler, wordcloud, xlsx, XML

Depends on qdapDictionaries, qdapRegex, qdapTools, RColorBrewer

Suggests koRpus, knitr, lda, proxy, stringi, SnowballC, testthat


Imported by NLPutils, ie2misc, specmine.

Depended on by ANLP.

Suggested by iemiscdata.


See at CRAN