Provides an interface to perform cluster analysis on a corpus of
text. Interfaces to Quanteda to assemble text corpuses easily. Deviationalizes
text vectors prior to clustering using technique described by Sherin (Sherin,
B. . A computational study of commonsense science: An exploration in the
automated analysis of clinical interview data. Journal of the Learning Sciences,
22(4), 600-638. Chicago.
The goal of clustRcompaR is to make it easy to cluster (or group) a series of documents (texts of any length), and to interpret these groups and to describe their frequency across factors, such as between different groups or over time.
You can install the development version of clustRcompaR from GitHub with:
You can install the stable release on CRAN with:
This is a basic example using the built-in inaugural addressess dataset.
First, we use
cluster() to cluster the documents into three clusters. We include a new variable,
year_before_1900, which we will later use to compare frequencies across clusters. Then we use
extract_terms() to view the terms and term frequencies in the two clusters.
library(clustRcompaR)library(dplyr)#>#> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#>#> filter, lag#> The following objects are masked from 'package:base':#>#> intersect, setdiff, setequal, uniond <- inaugural_addressesd <- mutate(d, century = ifelse(Year < 1800, "17th",ifelse(Year >= 1800 & Year < 1900, "18th",ifelse(Year >= 1900 & Year < 2000, "19th", "20th"))))three_clusters <- cluster(d, century, n_clusters = 3)#> Document-feature matrix of: 58 documents, 2,820 features (79.6% sparse).extract_terms(three_clusters)#> Cluster.1.Terms Cluster.1.Term.Frequencies Cluster.2.Terms#> 1 in 34.200000 in#> 2 my 13.866667 their#> 3 their 12.333333 govern#> 4 will 11.200000 will#> 5 govern 9.533333 has#> 6 peopl 7.200000 it#> 7 it 7.133333 state#> 8 nation 7.000000 been#> 9 has 6.733333 peopl#> 10 countri 6.533333 nation#> Cluster.2.Term.Frequencies Cluster.3.Terms Cluster.3.Term.Frequencies#> 1 77.52941 in 36.692308#> 2 22.88235 will 16.076923#> 3 21.41176 nation 12.500000#> 4 20.29412 us 12.038462#> 5 20.00000 world 9.807692#> 6 19.41176 peopl 9.307692#> 7 18.23529 can 7.769231#> 8 17.82353 must 7.730769#> 9 16.05882 america 7.423077#> 10 14.41176 no 7.192308
Second, we use the
compare() function to compare the frequency of clusters across a factor, in this case, the century. We can then use the
compare_test() (which uses a Chi-Square test) function.
three_clusters_comparison <- compare(three_clusters, "century")compare_plot(three_clusters_comparison)
NEWS.mdfile to track changes to the package.
compare()function so that which clusters and which profiles do not need to be manually entered
README.rmdfile with examples from the built-in inaugural addresses dataset from the quanteda package