Tools for Stemming and Lemmatizing Text

Tools that stem and lemmatize text. Stemming is a process that removes endings such as affixes. Lemmatization is the process of grouping inflected forms together as a single base form.


textstem

Project Status: Active - The project has reached a stable, usablestate and is being activelydeveloped. BuildStatus CoverageStatus

textstem is a tool-set for stemming and lemmatizing words. Stemming is a process that removes affixes. Lemmatization is the process of grouping inflected forms together as a single base form.

Table of Contents

Functions

The main functions, task category, & descriptions are summarized in the table below:

Function Task Description
stem_words stemming Stem words
stem_strings stemming Stem strings
lemmatize_words lemmatizing Lemmatize words
lemmatize_strings lemmatizing Lemmatize strings
make_lemma_dictionary_words lemmatizing Generate a dictionary of lemmas for a text

Installation

To download the development version of textstem:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/textstem")

Contact

You are welcome to:

Examples

The following examples demonstrate some of the functionality of textstem.

Load the Tools/Data

if (!require("pacman")) install.packages("pacman")
pacman::p_load(textstem, dplyr)

data(presidential_debates_2012)

Stemming Versus Lemmatizing

Before moving into the meat these two examples let's highlight the difference between stemming and lemmatizing.

dw <- c('driver', 'drive', 'drove', 'driven', 'drives', 'driving')

stem_words(dw)

## [1] "driver" "drive"  "drove"  "driven" "drive"  "drive"

lemmatize_words(dw)

## [1] "driver" "drive"  "drive"  "drive"  "drive"  "drive"

"Be" Stemming vs. Lemmatizing

bw <- c('are', 'am', 'being', 'been', 'be')

stem_words(bw)

## [1] "ar"   "am"   "be"   "been" "be"

lemmatize_words(bw)

## [1] "be" "be" "be" "be" "be"

Stemming

Stemming is the act of removing inflections from a word not necessarily "identical to the morphological root of the word" (wikipedia). Below I show stemming of several small strings.

y <- c(
    'the dirtier dog has eaten the pies',
    'that shameful pooch is tricky and sneaky',
    "He opened and then reopened the food bag",
    'There are skies of blue and red roses too!',
    NA,
    "The doggies, well they aren't joyfully running.",
     "The daddies are coming over...",
    "This is 34.546 above"
)
stem_strings(y)

## [1] "the dirtier dog ha eaten the pi"          
## [2] "that shame pooch i tricki and sneaki"     
## [3] "He open and then reopen the food bag"     
## [4] "There ar ski of blue and red rose too!"   
## [5] NA                                         
## [6] "The doggi, well thei aren't joyfulli run."
## [7] "The daddi ar come over..."                
## [8] "Thi i 34.546 abov"

Lemmatizing

Default Lemma Dictionary

Lemmatizing is the "grouping together the inflected forms of a word so they can be analysed as a single item" (wikipedia). In the example below I reduce the strings to their lemma form. lemmatize_strings uses a lookup dictionary. The default uses Mechura's (2016) English lemmatization list available from the lexicon package. The make_lemma_dictionary function contains two additional engines for generating a lemma lookup table for use in lemmatize_strings.

y <- c(
    'the dirtier dog has eaten the pies',
    'that shameful pooch is tricky and sneaky',
    "He opened and then reopened the food bag",
    'There are skies of blue and red roses too!',
    NA,
    "The doggies, well they aren't joyfully running.",
     "The daddies are coming over...",
    "This is 34.546 above"
)
lemmatize_strings(y)

## [1] "the dirty dog have eat the pie"           
## [2] "that shameful pooch be tricky and sneaky" 
## [3] "He open and then reopen the food bag"     
## [4] "There be sky of blue and red rose too!"   
## [5] NA                                         
## [6] "The doggy, good they aren't joyfully run."
## [7] "The daddy be come over..."                
## [8] "This be 34.546 above"

Hunspell Lemma Dictionary

This lemmatization uses the hunspell package to generate lemmas.

lemma_dictionary_hs <- make_lemma_dictionary(y, engine = 'hunspell')
lemmatize_strings(y, dictionary = lemma_dictionary_hs)

## [1] "the dirty dog has eat the pie"              
## [2] "that shameful pooch is tricky and sneaky"   
## [3] "He open and then opened the food bag"       
## [4] "There are sky of blue and red rose too!"    
## [5] NA                                           
## [6] "The doggy, well they aren't joyful running."
## [7] "The daddy are come over..."                 
## [8] "This is 34.546 above"

koRpus Lemma Dictionary

This lemmatization uses the koRpus package and the TreeTagger program to generate lemmas. You'll have to get TreeTagger set up, preferably in your machine's root directory.

lemma_dictionary_tt <- make_lemma_dictionary(y, engine = 'treetagger')
lemmatize_strings(y, lemma_dictionary_tt)

## [1] "the dirty dog have eat the pie"           
## [2] "that shameful pooch be tricky and sneaky" 
## [3] "He open and then reopen the food bag"     
## [4] "There be sky of blue and red rose too!"   
## [5] NA                                         
## [6] "The doggy, well they aren't joyfully run."
## [7] "The daddy be come over..."                
## [8] "This be 34.546 above"

Lemmatization Speed

It's pretty fast too. Observe:

tic <- Sys.time()

presidential_debates_2012$dialogue %>%
    lemmatize_strings() %>%
    head()

## [1] "We'll talk about specifically about health care in a moment."                            
## [2] "But what do you support the voucher system, Governor?"                                   
## [3] "What I support be no change for current retiree and near retiree to Medicare."           
## [4] "And the president support take dollar seven hundred sixteen billion out of that program."
## [5] "And what about the voucher?"                                                             
## [6] "So that's that's numb one."

(toc <- Sys.time() - tic)

## Time difference of 0.8978779 secs

That's 2,912 rows of text, or 42,708 words, in 0.9 seconds.

Combine With Other Text Tools

This example shows how stemming/lemmatizing might be complemented by other text tools such as replace_contraction from the textclean package.

library(textclean)

'aren\'t' %>% 
    lemmatize_strings()

## [1] "aren't"

'aren\'t' %>% 
    textclean::replace_contraction() %>%
    lemmatize_strings()

## [1] "be not"

News

NEWS

Versioning

Releases will be numbered with the following semantic versioning format:

..

And constructed with the following guidelines:

  • Breaking backward compatibility bumps the major (and resets the minor and patch)
  • New additions without breaking backward compatibility bumps the minor (and resets the patch)
  • Bug fixes and misc changes bumps the patch

textstem 0.1.0 - 0.1.2

BUG FIXES

  • lemmatize_strings and stem_strings would split numbers with decimals rather than treating it as a single token. This issue has been corrected (see issue #3).

textstem 0.0.1

This package is collection of tools that stem and lemmatize text. Stemming is a process that removes endings such as suffixes. Lemmatization is the process of grouping inflected forms together as a single base form.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("textstem")

0.1.4 by Tyler Rinker, 2 months ago


http://github.com/trinker/textstem


Report a bug at http://github.com/trinker/textstem/issues


Browse source code at https://github.com/cran/textstem


Authors: Tyler Rinker [aut, cre]


Documentation:   PDF Manual  


GPL-2 license


Imports dplyr, hunspell, koRpus, lexicon, quanteda, SnowballC, stats, stringi, textclean, textshape, utils

Depends on koRpus.lang.en

Suggests testthat


See at CRAN