A Small Collection of Syllable Counting Functions

Tools for counting syllables and polysyllables. The tools rely primarily on a 'data.table' hash table lookup, resulting in fast syllable counting.


syllable Follow

Project Status: Active - The project has reached a stable, usablestate and is being activelydeveloped. BuildStatus CoverageStatus DOI Version

qdapRegex Logo

syllable is a small collection of tools for counting syllables and polysyllables. The tools rely primarily on data.table hash table lookups, resulting in fast syllable counting.

Table of Contents

Main Functions

The main functions follow the format of action_object.

Actions

The following table outlines the actions. Example Output correspond to this string: "I like chicken sandwiches.".

Action Description Returns Example Output
count One integer per word A vector per string 1, 1, 2, 3
sum Sum of syllable counts An integer per string 7
tally* Sum of syllable attributes An integer per string pollysyllable tallies = 1

* The addition of _mono, _di, _poly _short (monosyllabic + disyllabic), or _both (short & pollysyllabic) to tally allows the user specify what syllable attribute is being tallied.

Objects

The following table outlines the objects acted upon:

Object Description Example
string A character string "I like chicken sandwiches."
vector* A vector of character strings c("I like it.", "Look out!")

* The addition of _by to vector allows the user to aggregate by one or more vectors of grouping variables.

Putting It Together

The function count_vector will provide a vector of integer counts for each word in a string. For this reason count_vector will return a list of integer vector counts.

count_vector(c("I like it.", "Look out!"))
## [1] 1 1 1
## 
## $`2`
## [1] 1 1

Each of the main functions is optimized to do its task efficiently. While one could use sum(count_vector(x)) and achieve the same results as sum_vector(x) it would be less efficient.

The available syllable functions that follow the format of action_object are:

count_string tally_both_string tally_mono_string tally_short_string
count_vector tally_both_vector tally_mono_vector tally_short_vector
count_vector_by tally_both_vector_by tally_mono_vector_by tally_short_vector_by
sum_string tally_di_string tally_poly_string
sum_vector tally_di_vector tally_poly_vector
sum_vector_by tally_di_vector_by tally_poly_vector_by

Available Variable Functions

Installation ============

To download the development version of syllable:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh(
    'trinker/lexicon',
    'trinker/textclean',
    'trinker/textshape',
    'trinker/syllable'
)

Contact

You are welcome to:

Examples

The following examples demonstrate the functionality of a select sample of syllable functions.

Count Syllables In a String

Counts the number of syllables for each word in a string.

count_string("I like chicken and eggs for breakfast")

## [1] 1 1 2 1 1 1 2

Count Syllables In a Vector of Strings

sents <- c("I like chicken.", "I want eggs benidict for breakfast.")
count_vector(sents)

## $`1`
## [1] 1 1 2
## 
## $`2`
## [1] 1 1 1 3 1 2

Map(function(x, y) setNames(x, y),
   count_vector(sents),
   strsplit(gsub("[^a-z ]", "", tolower(sents)), "\\s+")
)

## $`1`
##       i    like chicken 
##       1       1       2 
## 
## $`2`
##         i      want      eggs  benidict       for breakfast 
##         1         1         1         3         1         2

Sum the Syllables In a Vector of Strings by Grouping Variable(s)

dat <- data.frame(
   text = c("I like chicken.", "I want eggs benedict for breakfast.", "Really?"),
   group = c("A", "B", "A")
)
sum_vector_by(dat$text, dat$group)

##    group n.words count
## 1:     A       4     7
## 2:     B       6     9

Tally the Short/Poly-Syllabic Words by Group(s)

dat <- data.frame(
   text = c("I like excellent chicken.", "I want eggs benedict now.", "Really?"),
   group = c("A", "B", "A")
)
tally_both_vector_by(dat$text, dat$group)

##    group n.words short poly
## 1:     A       5     3    2
## 2:     B       5     4    1

with(presidential_debates_2012, tally_both_vector_by(dialogue, person))

##       person n.words short poly
## 1:     OBAMA   18319 16286 2033
## 2:    ROMNEY   19924 17858 2066
## 3:   CROWLEY    1672  1525  147
## 4:    LEHRER     765   674   91
## 5:  QUESTION     583   486   97
## 6: SCHIEFFER    1445  1289  156

Readability Word Statistics by Grouping Variable(s)

with(presidential_debates_2012, readability_word_stats_by(dialogue, list(person, time)))

##        person   time n.sents n.words n.chars n.sylls n.shorts n.polys
##  1:     OBAMA time 1     179    3599   16002    5221     3221     378
##  2:     OBAMA time 2     494    7477   32459   10654     6696     781
##  3:     OBAMA time 3     405    7243   32288   10675     6369     874
##  4:    ROMNEY time 1     279    4085   17984    5875     3646     439
##  5:    ROMNEY time 2     560    7536   32504   10720     6788     748
##  6:    ROMNEY time 3     569    8303   35824   11883     7424     879
##  7:   CROWLEY time 2     165    1672    6904    2308     1525     147
##  8:    LEHRER time 1      87     765    3256    1087      674      91
##  9:  QUESTION time 2      40     583    2765     930      486      97
## 10: SCHIEFFER time 3     133    1445    6234    2058     1289     156
##     n.complexes
##  1:         378
##  2:         781
##  3:         873
##  4:         439
##  5:         746
##  6:         878
##  7:         147
##  8:          91
##  9:          97
## 10:         156

Visualize Poly Syllable Distributions

if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, ggplot2, scales)

tally_both_vector(presidential_debates_2012$dialogue) %>%
    mutate(Duration = 1:length(poly)) %>%
    rowwise() %>%
    filter((short + poly) > 4) %>%
    mutate(
        short = short/(short+poly),
        poly = 1 - short,
        size = poly > .3
    ) %>%
    ggplot(aes(Duration, poly)) +
        geom_text(aes(label = Duration, size = size, color = size)) +
        coord_flip() +
        scale_size_manual(values = c(1.5, 2.5), guide=FALSE) +
        scale_color_manual(values = c("grey75", "black"), guide=FALSE) +
        scale_x_reverse() +
        scale_y_continuous(label = scales::percent) +
        ylab("Poly-syllabic") +
        xlab("Duration (sentences)") +
        theme_bw() 

Visualize Poly Syllable Distributions by Group

if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, ggplot2, tidyr, scales)

with(presidential_debates_2012, tally_both_vector_by(dialogue, list(person, time))) %>%
    mutate(
        person_time = paste(person, time, sep = "-"),
        short = short/(short+poly),
        poly = 1 - short
    ) %>%
    arrange(poly) %>%
    mutate(person_time = factor(person_time, levels = person_time)) %>%
    gather(type, prop, c(short, poly)) %>%
    ggplot(aes(person_time, weight = prop, fill = type)) +
        geom_bar() +
        coord_flip() +        
        scale_y_continuous(label = scales::percent) +
        scale_fill_discrete(name="Syllable\nType") +
        xlab("Person & Time") +
        ylab("Usage") +
        theme_bw()

News

NEWS

Versioning

Releases will be numbered with the following semantic versioning format:

..

And constructed with the following guidelines:

  • Breaking backward compatibility bumps the major (and resets the minor and patch)
  • New additions without breaking backward compatibility bumps the minor (and resets the patch)
  • Bug fixes and misc changes bumps the patch

syllable 0.1.3

CHANGES

  • as.tibble removed from all function arguments. This was a nice interactive feature that made programming very difficult to reason about. Having an environment dependant output would result in no adoption of the syllable package as a dependency. The problem was so egregious and the package infant enough, that removal without deprecation was warranted.

syllable 0.1.0 - 0.1.2

NEW FEATURES

  • Users can now globally select a tibble output rather than a data.table output for all functions that outputted a data.table. This can be set globally via set_output. If the user does not set the output type syllable tries to infer based on whether or not the user has dplyr loaded. If dplyr is loaded then tibble is the default output.

  • set_output and tibble_output imported from textshape to globally set the output type (tibble or data.table) and to check/infer the desired output type.

IMPROVEMENTS

  • readability_word_stats and readability_word_stats_by used stringi's sentence detection. This was not accurate as seen: http://stackoverflow.com/q/31865511/1000343. The package now utilizes NLP/openNLP to detect number of sentences. This comes at the cost of speed.

  • readability_word_stats now removes -es & -ed suffixes for calculating n.complexes.

CHANGES

  • NLP and openNLP dependencies replaced with textshape and textclean to improve sentence detection speed.

  • textcleanLite dependency replaced with textclean because the hunspell dependency in textclean is no longer explicitly imported. This allows the package to be used within trickier environments such as Microsoft Azure.

syllable 0.0.1

This package is a small collection of tools for counting syllables and polysyllables. The tools rely primarily on a 'data.table' hash table lookup, resulting in fast syllable counting.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("syllable")

0.1.3 by Tyler Rinker, 2 years ago


http://github.com/trinker/syllable


Report a bug at http://github.com/trinker/syllable/issues


Browse source code at https://github.com/cran/syllable


Authors: Tyler Rinker [aut, cre]


Documentation:   PDF Manual  


GPL-2 license


Imports data.table, stringi, stats, textclean, textshape, utils

Suggests testthat


See at CRAN