Modern Text Mining Framework for R

Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities. This package provides a source-agnostic streaming API, which allows researchers to perform analysis of collections of documents which are larger than available RAM. All core functions are parallelized to benefit from multicore machines.


You've just discovered text2vec!

text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP).

Goals which we aimed to achieve as a result of development of text2vec:

  • Concise - expose as few functions as possible
  • Consistent - expose unified interfaces, no need to explore new interface for each task
  • Flexible - allow to easily solve complex tasks
  • Fast - maximize efficiency per single thread, transparently scale to multiple threads on multicore machines
  • Memory efficient - use streams and iterators, not keep data in RAM if possible

To learn how to use this package, see text2vec.org and the package vignettes. See also the text2vec articles on my blog.

Features

The core functionality at the moment includes

  1. Fast text vectorization on arbitrary n-grams, using vocabulary or feature hashing.
  2. GloVe word embeddings.
  3. Topic modeling with:
  • Latent Dirichlet Allocation
  • Latent Sematic Analysis
  1. Similarities/distances between 2 matrices

Performance

Author of the package is a little bit obsessed about efficiency.

This package is efficient because it is carefully written in C++, which also means that text2vec is memory friendly. Some parts, such as training GloVe word embeddings, are fully parallelized using the excellent RcppParallel package. This means that the word embeddings are computed in parallel on OS X, Linux, Windows, and Solaris (x86) without any additional tuning or tricks. Other emrassingly parallel tasks such as vectorization can use any parallel backend wich supports foreach package. So they can achieve near-linear scalability with number of available cores. Finally, a streaming API means that users do not have to load all the data into RAM.

Contributing

The package has issue tracker on GitHub where I'm filing feature requests and notes for future work. Any ideas are appreciated.

Contributors are welcome. You can help by:

License

GPL (>= 2)

News

text2vec 0.4.0

2016-10-03. See 0.4 milestone tags.

  1. Now under GPL (>= 2) Licence
  2. "immutable" iterators - no need to reinitialize them
  3. unified models interface
  4. New models: LSA, LDA, GloVe with L1 regularization
  5. Fast similarity and distances calculation: Cosine, Jaccard, Relaxed Word Mover's Distance, Euclidean
  6. Better hadnling UTF-8 strings, thanks to @qinwf
  7. iterators and models rely on R6 package

text2vec 0.3.0

  1. 2016-01-13 fix for #46, thanks to @buhrmann for reporting
  2. 2016-01-16 format of vocabulary changed.
    • do not keep doc_proportions. see #52.
    • add stop_words argument to prune_vocabulary. signature also was changed.
  3. 2016-01-17 fix for #51. if iterator over tokens returns list with names, these names will be:
    • stored as attr(corpus, 'ids')
    • rownames in dtm
    • names for dtm list in lda_c format
  4. 2016-02-02 high level function for corpus and vocabulary construction.
    • construction of vocabulary from list of itoken.
    • construction of dtm from list of itoken.
  5. 2016-02-10 rename transformers
    • now all transformers starts with transform_* - more intuitive + simpler usage with autocompletion
  6. 2016-03-29 (accumulated since 2016-02-10)
    • rename vocabulary to create_vocabulary.
    • new functions create_dtm, create_tcm.
    • All core functions are able to benefit from multicore machines (user have to register parallel backend themselves)
    • Fix for progress bars. Now they are able to reach 100% and ticks increased after computation.
    • ids argument to itoken. Simplifies assignement of ids to rows of DTM
    • create_vocabulary now can handle stopwords
    • see all updates here
  7. 2016-03-30 more robust split_into() util.

text2vec 0.2.0 (2016-01-10)

First CRAN release of text2vec.

  • Fast text vectorization with stable streaming API on arbitrary n-grams.
    • Functions for vocabulary extraction and management
    • Hash vectorizer (based on digest murmurhash3)
    • Vocabulary vectorizer
  • GloVe algorithm word embeddings.
    • Fast term-cooccurence matrix factorization via parallel async AdaGrad.
  • All core functions written in C++.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("text2vec")

0.4.0 by Dmitriy Selivanov, 8 months ago


http://text2vec.org


Report a bug at https://github.com/dselivanov/text2vec/issues


Browse source code at https://github.com/cran/text2vec


Authors: Dmitriy Selivanov [aut, cre], Lincoln Mullen [ctb]


Documentation:   PDF Manual  


Task views: Natural Language Processing


GPL (>= 2) | file LICENSE license


Imports Matrix, Rcpp, RcppParallel, digest, foreach, data.table, magrittr, irlba, R6

Depends on methods

Suggests stringr, testthat, covr, knitr, rmarkdown, glmnet, parallel

Linking to Rcpp, RcppParallel, digest

System requirements: GNU make, C++11


Imported by textmineR.


See at CRAN