Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools

Text mining for word processing and sentiment analysis using 'dplyr', 'ggplot2', and other tidy tools.


tidytext: Text mining using dplyr, ggplot2, and other tidy tools

Authors: Julia Silge, David Robinson
License: MIT

Build Status AppVeyor Build Status CRAN_Status_Badge Coverage Status DOI status

Using tidy data principles can make many text mining tasks easier, more effective, and consistent with tools already in wide use. Much of the infrastructure needed for text mining with tidy data frames already exists in packages like dplyr, broom, tidyr and ggplot2. In this package, we provide functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages. Check out our book to learn more about text mining using tidy data principles.

You can install this package from CRAN:

install.packages("tidytext")

Or you can install the development version from Github with devtools:

library(devtools)
install_github("juliasilge/tidytext")

Tidy text mining example: the unnest_tokens function

The novels of Jane Austen can be so tidy! Let's use the text of Jane Austen's 6 completed, published novels from the janeaustenr package, and bring them into a tidy format. janeaustenr provides them as a one-row-per-line format:

library(janeaustenr)
library(dplyr)
 
original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number()) %>%
  ungroup()
 
original_books
#> # A tibble: 73,422 x 3
#>    text                  book                linenumber
#>    <chr>                 <fctr>                   <int>
#>  1 SENSE AND SENSIBILITY Sense & Sensibility          1
#>  2 ""                    Sense & Sensibility          2
#>  3 by Jane Austen        Sense & Sensibility          3
#>  4 ""                    Sense & Sensibility          4
#>  5 (1811)                Sense & Sensibility          5
#>  6 ""                    Sense & Sensibility          6
#>  7 ""                    Sense & Sensibility          7
#>  8 ""                    Sense & Sensibility          8
#>  9 ""                    Sense & Sensibility          9
#> 10 CHAPTER 1             Sense & Sensibility         10
#> # ... with 73,412 more rows

To work with this as a tidy dataset, we need to restructure it as one-token-per-row format. The unnest_tokens function is a way to convert a dataframe with a text column to be one-token-per-row:

library(tidytext)
tidy_books <- original_books %>%
  unnest_tokens(word, text)
 
tidy_books
#> # A tibble: 725,055 x 3
#>    book                linenumber word       
#>    <fctr>                   <int> <chr>      
#>  1 Sense & Sensibility          1 sense      
#>  2 Sense & Sensibility          1 and        
#>  3 Sense & Sensibility          1 sensibility
#>  4 Sense & Sensibility          3 by         
#>  5 Sense & Sensibility          3 jane       
#>  6 Sense & Sensibility          3 austen     
#>  7 Sense & Sensibility          5 1811       
#>  8 Sense & Sensibility         10 chapter    
#>  9 Sense & Sensibility         10 1          
#> 10 Sense & Sensibility         13 the        
#> # ... with 725,045 more rows

This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, n-grams, sentences, lines, paragraphs, or separation around a regex pattern.

Now that the data is in one-word-per-row format, we can manipulate it with tidy tools like dplyr. We can remove stop words (kept in the tidytext dataset stop_words) with an anti_join.

data("stop_words")
tidy_books <- tidy_books %>%
  anti_join(stop_words)

We can also use count to find the most common words in all the books as a whole.

tidy_books %>%
  count(word, sort = TRUE) 
#> # A tibble: 13,914 x 2
#>    word       n
#>    <chr>  <int>
#>  1 miss    1855
#>  2 time    1337
#>  3 fanny    862
#>  4 dear     822
#>  5 lady     817
#>  6 sir      806
#>  7 day      797
#>  8 emma     787
#>  9 sister   727
#> 10 house    699
#> # ... with 13,904 more rows

Sentiment analysis can be done as an inner join. Three sentiment lexicons are available via the get_sentiments() function. Let's examine how sentiment changes during each novel. Let's find a sentiment score for each word using the Bing lexicon, then count the number of positive and negative words in defined sections of each novel.

library(tidyr)
get_sentiments("bing")
#> # A tibble: 6,788 x 2
#>    word        sentiment
#>    <chr>       <chr>    
#>  1 2-faced     negative 
#>  2 2-faces     negative 
#>  3 a+          positive 
#>  4 abnormal    negative 
#>  5 abolish     negative 
#>  6 abominable  negative 
#>  7 abominably  negative 
#>  8 abominate   negative 
#>  9 abomination negative 
#> 10 abort       negative 
#> # ... with 6,778 more rows
 
janeaustensentiment <- tidy_books %>%
  inner_join(get_sentiments("bing"), by = "word") %>% 
  count(book, index = linenumber %/% 80, sentiment) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative)
 
janeaustensentiment
#> # A tibble: 920 x 5
#>    book                index negative positive sentiment
#>    <fctr>              <dbl>    <dbl>    <dbl>     <dbl>
#>  1 Sense & Sensibility  0        16.0     26.0     10.0 
#>  2 Sense & Sensibility  1.00     19.0     44.0     25.0 
#>  3 Sense & Sensibility  2.00     12.0     23.0     11.0 
#>  4 Sense & Sensibility  3.00     15.0     22.0      7.00
#>  5 Sense & Sensibility  4.00     16.0     29.0     13.0 
#>  6 Sense & Sensibility  5.00     16.0     39.0     23.0 
#>  7 Sense & Sensibility  6.00     24.0     37.0     13.0 
#>  8 Sense & Sensibility  7.00     22.0     39.0     17.0 
#>  9 Sense & Sensibility  8.00     30.0     35.0      5.00
#> 10 Sense & Sensibility  9.00     14.0     18.0      4.00
#> # ... with 910 more rows

Now we can plot these sentiment scores across the plot trajectory of each novel.

library(ggplot2)
 
ggplot(janeaustensentiment, aes(index, sentiment, fill = book)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

plot of chunk unnamed-chunk-9

For more examples of text mining using tidy data frames, see the tidytext vignette.

Tidying document term matrices

Many existing text mining datasets are in the form of a DocumentTermMatrix class (from the tm package). For example, consider the corpus of 2246 Associated Press articles from the topicmodels dataset.

library(tm)
data("AssociatedPress", package = "topicmodels")
AssociatedPress
#> <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
#> Non-/sparse entries: 302031/23220327
#> Sparsity           : 99%
#> Maximal term length: 18
#> Weighting          : term frequency (tf)

If we want to analyze this with tidy tools, we need to transform it into a one-row-per-term data frame first with a tidy function. (For more on the tidy verb, see the broom package).

tidy(AssociatedPress)
#> # A tibble: 302,031 x 3
#>    document term       count
#>       <int> <chr>      <dbl>
#>  1        1 adding      1.00
#>  2        1 adult       2.00
#>  3        1 ago         1.00
#>  4        1 alcohol     1.00
#>  5        1 allegedly   1.00
#>  6        1 allen       1.00
#>  7        1 apparently  2.00
#>  8        1 appeared    1.00
#>  9        1 arrested    1.00
#> 10        1 assault     1.00
#> # ... with 302,021 more rows

We could find the most negative documents:

ap_sentiments <- tidy(AssociatedPress) %>%
  inner_join(get_sentiments("bing"), by = c(term = "word")) %>%
  count(document, sentiment, wt = count) %>%
  ungroup() %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  arrange(sentiment)

Or we can join the Austen and AP datasets and compare the frequencies of each word:

comparison <- tidy(AssociatedPress) %>%
  count(word = term) %>%
  rename(AP = n) %>%
  inner_join(count(tidy_books, word)) %>%
  rename(Austen = n) %>%
  mutate(AP = AP / sum(AP),
         Austen = Austen / sum(Austen))
 
comparison
#> # A tibble: 4,437 x 3
#>    word              AP     Austen
#>    <chr>          <dbl>      <dbl>
#>  1 abandoned  0.000210  0.00000709
#>  2 abide      0.0000360 0.0000284 
#>  3 abilities  0.0000360 0.000206  
#>  4 ability    0.000294  0.0000213 
#>  5 abroad     0.000240  0.000255  
#>  6 abrupt     0.0000360 0.0000355 
#>  7 absence    0.0000959 0.000787  
#>  8 absent     0.0000539 0.000355  
#>  9 absolute   0.0000659 0.000184  
#> 10 absolutely 0.000210  0.000674  
#> # ... with 4,427 more rows
 
library(scales)
ggplot(comparison, aes(AP, Austen)) +
  geom_point(alpha = 0.5) +
  geom_text(aes(label = word), check_overlap = TRUE,
            vjust = 1, hjust = 1) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red")

plot of chunk unnamed-chunk-13

For more examples of working with objects from other text mining packages using tidy data principles, see the vignette on converting to and from document term matrices.

Community Guidelines

This project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms. Feedback, bug reports (and fixes!), and feature requests are welcome; file issues or seek support here.

News

tidytext 0.1.6

  • unnest_tokens can now unnest a data frame with a list column (which formerly threw the error unnest_tokens expects all columns of input to be atomic vectors (not lists)). The unnested result repeats the objects within each list. (It's still not possible when collapse = TRUE, in which tokens can span multiple lines).
  • Add get_tidy_stopwords() to obtain stopword lexicons in multiple languages in a tidy format.
  • Add a dataset nma_words of negators, modals, and adverbs that affect sentiment analysis (#55).
  • Updated various vignettes/docs/tests so package can build on R-oldrel.

tidytext 0.1.5

  • Change how NA values are handled in unnest_tokens so they no longer cause other columns to become NA (#82).
  • Update tidiers and casters to align with quanteda v1.0 (#87).
  • Handle input/output object classes (such as data.table) consistently (#88).

tidytext 0.1.4

  • Fix tidier for quanteda dictionary for correct class (#71).
  • Add a pkgdown site.
  • Convert NSE from underscored function to tidyeval (unnest_tokens, bind_tf_idf, all sparse casters) (#67, #74).
  • Added tidiers for topic models from the stm package (#51).

tidytext 0.1.3

  • get_sentiments now works regardless of whether tidytext has been loaded or not (#50).
  • unnest_tokens now supports data.table objects (#37).
  • Fixed to_lower parameter in unnest_tokens to work properly for all tokenizing options.
  • Updated tidy.corpus, glance.corpus, tests, and vignette for changes to quanteda API
  • Removed the deprecated pair_count function, which is now in the in-development widyr package
  • Added tidiers for LDA models from the mallet package
  • Added the Loughran and McDonald dictionary of sentiment words specific to financial reports
  • unnest_tokens preserves custom attributes of data frames and data.tables

tidytext 0.1.2

  • Updated DESCRIPTION to require purrr >= 0.1.1.
  • Fixed cast_sparse, cast_dtm, and other sparse casters to ignore groups in the input (#19)
  • Changed unnest_tokens so that it no longer uses tidyr's unnest, but rather a custom version that removes some overhead. In some experiments, this sped up unnest_tokens on large inputs by about 40%. This also moves tidyr from Imports to Suggests for now.
  • unnest_tokens now checks that there are no list columns in the input, and raises an error if present (since those cannot be unnested).
  • Added a format argument to unnest_tokens so that it can process html, xml, latex or man pages using the hunspell package, though only when token = "words".
  • Added a get_sentiments function that takes the name of a lexicon ("nrc", "bing", or "sentiment") and returns just that sentiment data frame (#25)

tidytext 0.1.1

  • Added documentation for n-grams, skip n-grams, and regex
  • Added codecov and appveyor
  • Added tidiers for LDA objects from topicmodels and a vignette on topic modeling
  • Added function to calculate tf-idf of a tidy text dataset and a tf-idf vignette
  • Fixed a bug when tidying by line/sentence/paragraph/regex and there are multiple non-text columns
  • Fixed a bug when unnesting using n-grams and skip n-grams (entire text was not being collapsed)
  • Added ability to pass a (custom tokenizing) function to token. Also added a collapse argument that makes the choice whether to combine lines before tokenizing explicit.
  • Changed tidy.dictionary to return a tbl_df rather than a data.frame
  • Updated cast_sparse to work with dplyr 0.5.0
  • Deprecated the pair_count function, which has been moved to pairwise_count in the widyr package. This will be removed entirely in a future version.

tidytext 0.1.0

  • Initial release for text mining using tidy tools

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("tidytext")

0.1.9 by Julia Silge, 2 months ago


http://github.com/juliasilge/tidytext


Report a bug at http://github.com/juliasilge/tidytext/issues


Browse source code at https://github.com/cran/tidytext


Authors: Gabriela De Queiroz [ctb], Emil Hvitfeldt [ctb], Os Keyes [ctb] (<https://orcid.org/0000-0001-5196-609X>), Kanishka Misra [ctb], David Robinson [aut], Julia Silge [aut, cre] (<https://orcid.org/0000-0002-3671-836X>)


Documentation:   PDF Manual  


Task views: Natural Language Processing


MIT + file LICENSE license


Imports rlang, dplyr, stringr, hunspell, broom, Matrix, tokenizers, janeaustenr, purrr, methods, stopwords

Suggests readr, tidyr, XML, tm, quanteda, knitr, rmarkdown, ggplot2, reshape2, wordcloud, topicmodels, NLP, scales, gutenbergr, testthat, mallet, stm, data.table


Imported by available, crsra, statquotes, widyr.

Suggested by fivethirtyeight, funrar, gutenbergr, kdtools, polmineR, quanteda, rzeit2.


See at CRAN