Toolkit for Corpus Analysis

Library for corpus analysis using the Corpus Workbench as an efficient back end for indexing and querying large corpora. The package offers functionality to flexibly create partitions and to carry out basic statistical operations (count, co-occurrences etc.). The original full text of documents can be reconstructed and inspected at any time. Beyond that, the package is intended to serve as an interface to packages implementing advanced statistical procedures. Respective data structures (document term matrices, term co- occurrence matrices etc.) can be created based on the indexed corpora.


License CRAN_Status_Badge Downloads Travis-CI Build Status codecov

Purpose: The focus of the package 'polmineR' is the interactive analysis of corpora using R. Core objectives for the development of the package are performance, usability, and a modular design.

Aims: Key aims for developing the package are:

  • To keep the original text accessible. A seamless integration of qualitative and quantitative steps in corpus analysis supports validation, based on inspecting the text behind the numbers.

  • To provide a library with standard tasks. It is an open source platform that will make text mining more productive, avoiding prohibitive costs to reimplement basics, or to run many lines of code to perform a basic tasks.

  • To create a package that makes the creation and analysis of subcorpora ('partitions') easy. A particular strength of the package is to support contrastive/comparative research.

  • To offer performance for users with an ordinary infrastructure. The package picks up the idea of a three-tier software design. Corpus data are managed and indexed by using the Open Corpus Workbench (CWB). The CWB is particularly efficient for storing large corpora and offers a powerful language for querying corpora, the Corpus Query Processor (CQP).

  • To support sharing consolidated and documented data, following the ideas of reproducible research.

Background: The polmineR-package was specifically developed to make full use of the XML annotation structure of the corpora created in the PolMine project (see polmine.sowi.uni-due.de). The core PolMine corpora are corpora of plenary protocols. In these corpora, speakers, parties etc. are structurally annotated. The polmineR-package is meant to help making full use of the rich annotation structure.

Core Functions

  • partition: Set up a partition (i.e. subcorpus);
  • count: Count features
  • dispersion: Analyse the dispersion of a query across one or two dimensions (absolute and relative frequencies);
  • cooccurrences: Analyse the context of a query (including some statistics);
  • features: Compare partitions to identify features / keywords (using statistical tests such as chi square).

kwic()

Installation

At this stage, an easy way to install polmineR is available only for 32bit R. Usually, an R installation will include both 32bit and 64bit R. So if you want to keep things simple, make sure that you work with 32bit version. If you work with RStudio (highly recommended), the menu Tools > Global Options will open a dialogue where you can choose 32bit R.

Before installing polmineR, the package 'rcqp' needs to be installed. In turn, rcqp requires plyr, which should be installed first.

install.packages("plyr")

To avoid compiling C code in a package, packages with compiled binaries are very handy. Windows binaries for the rcqp package are not available at CRAN, but can be installed from a repository of packages entertained at the server of the PolMine project:

install.packages("rcqp", repos = "http://polmine.sowi.uni-due.de/packages", type = "win.binary")

To explain: Compiling the C code in the rcqp package on a windows machine is not yet possible. The package we offer uses a cross-compilation of these C libraries, i.e. binaries that have been prepared for windows on a MacOS/Linux machine.

Before proceeding to install polmineR, we install dependencies that are not installed automatically.

install.packages(pkgs = c("htmltools", "htmlwidgets", "magrittr", "iterators", "NLP"))

The latest stable version of polmineR can now be installed from CRAN. Several other packages that polmineR depends on, or that dependencies depend on may be installed automatically.

install.packages("polmineR")

The development version of the package, which may include the most recent updates and features, can be installed from GitHub. The easiest way to do this is to use a mechanism offered by the package devtools.

install.packages("devtools")
devtools::install_github("PolMine/polmineR", ref = "dev")

The installation may throw warnings. There are three warnings you can ignore at this stage:

  • "WARNING: this package has a configure script / It probably needs manual configuration".
  • The environment variable CORPUS_REGISTRY is not defined.
  • package 'rcqp' is not installed for 'arch = x64'.

The configure script is for Linux/MacOS installation, its sole purpose is to pass tests for uploading the package to CRAN. As mentioned, windows binaries are not yet available for 64bit R at present, so that can be ignored. The environment variable "CORPUS_REGISTRY" can be set as follows in R:

Sys.setenv(CORPUS_REGISTRY = "C:/PATH/TO/YOUR/REGISTRY")

To set the environment variable CORPUS_REGISTRY permanently, see the instructions R offer how to find the file '.Renviron' or '.Renviron.site' when calling the help for the startup process(?Startup).

Two important notes concerning problems with the CORPUS_REGISTRY environment variable that may cause serious headaches:

  • The path can not be processed, if there is any whitespace in the path pointing to the registry. Whitespace may occur in the user name ("C:/Users/Donald Duck/Documents"), for instance. We do not yet know any workaround to make rcqp/CWB process whitespace. The recommendation is to create a directory at a path without whitespace to keep the registry and the indexed_corpora (a directory such as "C:/cwb").

  • If you keep data on another volume than your system files, your R packages etc. (eg. volume 'C:' for system files, and 'D:' for data and user files), make sure to set the working directory (setwd()) is set to any directory on the volume with the directory defined via CORPUS_REGISTRY. CWB/rcqp will assume that the CORPUS_REGISTRY directory is on the same volume as the current working directory (which can be identified by calling getwd()).

Finally: polmineR if optimized for working with RStudio. It you work with 32bit R, you may have to check in the settings of RStudio that it will call 32bit R. To be sure, check the startup message.

If everything works, check whether polmineR can be loaded.

library(polmineR)
corpus() # to see corpora available at your system

Windows (64 bit / x86)

At this stage, 64 bit support is still experimental. Apart from an installation of 64 bit R, you will need to install Rtools, available here. Rtools is a collection of tools necessary to build and compile R packages on a Windows machine.

To interface to a core C library of the Corpus Workbench (CWB), you will need an installation of a 64 bit AND a 32 bit version of the CWB.

The "official" 32 bit version of the CWB is available here. Installation instructions are available at the CWB Website. The 32 bit version should be installed in the directory "C:Files", with admin rights.

The 64 bit version, prepared by Andreas Blaette, is available here. Install this 64 bit CWB version to "C:Files (x86)". In the unzipped downloaded zip file, you will find a bat file that will do the installation. Take care that you run the file with administrator rights. Without these rights, no files will be copied.

The interface to the Corpus Workbench is the package polmineR.Rcpp, available at GitHub. If you use git, you can clone that repository, otherwise, you can download a zip file.

The downloaded zip file needs to be unzipped again. Then, in the directory with the 'polmineR.Rcpp'-directory, run:

R CMD build polmineR.Rcpp
R CMD INSTALL polmineR.Rcpp_0.1.0.tar.gz

If you read closely what is going on during the compilation, you will see a few warnings that libraries are not found. If creating the package is not aborted, nothing is wrong. R CMD build will look for the 64 bit files in the directory with the 32 bit dlls first and discover that they do not work for 64 bit, only then will it move to the correct location.

One polmineR.Rcpp is installed, proceed with the instructions for installing polmineR in a 32 bit context. Future binary releases of the polmineR.Rcpp package may make things easier. Anyway, the proof of concept is there that polmineR will work on a 64 bit Windows machine too.

Finally, you need to make sure that polmineR will interface to CWB indexed corpora using polmineR.Rcpp, and not with rcqp (the default). To set the interface accordingly:

setCorpusWorkbenchInterface("Rcpp")

To test whether corpora are available:

corpus()

MacOS

The following instructions for Mac users assume that R is installed on your system. Binaries are available from the Homepage of the R Project. An installation of RStudio is highly recommended. The Open Source License version of RStudio Desktop is what you need.

Installing 'polmineR'

The latest release of polmineR can be installed from CRAN using the usual install.packages-function.

install.packages("polmineR")

The development version of polmineR can be installed using devtools:

install.packages("devtools") # unless devtools is already installed
devtools::install_github("PolMine/polmineR", ref = "dev")

Installing 'rcqp'

The default interface of the polmineR package to access CWB indexed corpora is the package 'rcqp'. Accessing corpora will not work before you have installed the interface.

Installing precompiled binary of rcqp from the PolMine server

The easiest way to get rcqp for Mac is install a precompiled binary that is available at the PolMine server:

install.packages(
  "rcqp",
  repos = "http://polmine.sowi.uni-due.de/packages",
  type = "mac.binary"
  )
Building rcqp from source

If you want to get rcqp from CRAN and/or if you want to to compile the C code yourself, the procedure is as follows.

First, you will need an installation of Xcode, which you can get it via the Mac App Store. You will also need the Command Line Tools for Xcode. It can be installed from a terminal with:

xcode-select --install

To compile the C code in the rcqp package, there are system requirements that need to be fulfilled. Using a package manager such as Homebrew or Macports makes things considerably easier.

Option 1: Using Homebrew

We recommend to use 'Homebrew'. To install Homebrew, follow the instructions on the Homebrew Homepage. The following commands will install the C libraries the rcqp package relies on:

brew -v install pkg-config
brew -v install glib --universal
brew -v install pcre --universal
brew -v install readline

Option 2: Using Macports

If you prefer using Macports, get it from https://www.macports.org/. After installing Macports, it is necessary to restart the computer. Next, an update of Macports is necessary.

sudo port -v selfupdate

Now we can install the libraries rcqp will require. Again, from the terminal.

sudo port install glib2
sudo port install pkgconfig
sudo port install pcre

Install dependencies and rcqp

Once the system requirements are there, the next steps can be done from R. Before installing rcqp, and then polmineR, we install a few packages. In the R console:

install.packages(pkgs = c("RUnit", "devtools", "plyr", "tm"))

Now rcqp can be installed, and then polmineR:

install.packages("rcqp")
install.packages("polmineR")

If you like to work with the development version, that can be installed from GitHub.

devtools::install_github("PolMine/polmineR", ref = "dev")

Linux

The pcre, glib and pkg-config libraries can be installed using apt-get.

sudo apt-get install libglib2.0-dev
sudo apt-get install libssl-dev
sudo apt-get install libcurl4-openssl-dev

The system requirements will now be fulfilled. From R, install dependencies for rcqp/polmineR first, and then rcqp and polmineR.

install.packages("RUnit", "devtools", "plyr", "tm")
install.packages("rcqp")
install.packages("polmineR")

News

v0.7.5

  • class 'Regions' renamed to class 'regions' as a matter of consistency
  • data type of slot cpos of class 'regions' is a matrix now
  • rework and improved documentation for decode- and encode-methods
  • new functions copy.corpus and rename.corpus
  • as.DocumentTermMatrix-method checks for strucs with value -1
  • improved as.speeches-method: reordering of speeches, default values
  • blapply-method: verbose output will be suppressed of progress is TRUE

v0.7.4

  • applying stoplists and positivelists working again for context-method
  • matches-method to learn about matches for CQP queries replacing frequencies-method
  • Rework of enrich-method, including documentation.
  • param 'neighbor' dropped from kwic,context-method; params positivelist and negativelist offer equivalent functionality
  • highlight-method for (newly exported) kwic-method (for validation purposes)
  • performance improvement for partitionBundle,character-method
  • a new Labels class and label method for generating test data
  • bug removed for partitionBundle,character-class, and performance improved
  • Improved explanation of the installation procedure for Mac in the package vignette
  • for context-method: param sAttribute working again to check boundaries of match regions
  • sample-method for objects of class kwic and context
  • kwic, cpos, and context method will accept queries of length > 1
  • use-function and resetRegistry-function reworked
  • more explicit startup message to get info about version, registry and interface
  • encoding issues solved for size-method, hits-method and dispersion-method
  • use-function will now work for users working with polmineR.Rcpp as interface

v0.7.3

  • new installed.corpora() convenience function to list all data packages with corpora
  • view-method and show-method for cooccurrences-objects now successfully redirect output to RStudio viewer
  • data.table-style indexing of objects inheriting from textstat-class
  • for windows compatibility, as.corpusEnc/as.nativeEnc for encoding conversion
  • performance gain for size-method by using polmineR.Rcpp
  • dissect-method dropped (replaced by size)
  • improved documentation of size-method
  • labels for cooccurrences-output
  • cooccurrencesBundle-class and cooccurrence-method for bundle restored
  • as.data.table for cooccurrencesBundle-class
  • count-method for whole corpus for pAttribute > 1
  • functionality of meta-method merged into sAttributes-method (meta-method dropped)
  • speed improvements for generating html output for reading
  • previously unexported highlight-method now exported, and more robust than before (using xml2)
  • progress bars for multicore operations now generated by pbapply package
  • starting to use testthat for unit testing

v0.7.2

  • updated documentation of partition-method.
  • documentation of hits-method improved
  • use-methode: default value for pkg ist NULL (return to default registry), function more robust
  • Rework for parsing the registry
  • rework of templates, are part of options now (see ?setTemplate, ?getTemplate)
  • experimental use of polmineR.Rcpp-package for fast counts for whole corpus
  • new convenience function install.corpus to install CWB corpus wrapped into R data package
  • adjustments to make package compatible with polmineR.shiny
  • cpos-method to get hits more robust if there are not matches for string
  • hits-method removes NAs
  • compare-method renamed to features-method
  • warnings caused by startup on windows removed

v0.7.1

  • size-method now allows for a param 'sAttribute'
  • hits-method reworked, allows for names query vectors
  • first version that can be installed on windows

v0.7.0

  • rcqp package moved to suggests, to facilitate installation

v0.6.3

  • more generic implementation of as.markdown-method to prepare use of templates
  • LICENSE file updated
  • getTokenStream,character-method: new default behavior for params left and right
  • use of templates for as.markdown-method
  • Regions and TokenStream class (not for frontend use, so far)
  • getTermFrequencies-method merged into count-method
  • Corpus class introduced
  • decode- and encode-methods introduced

v0.6.2

  • refactoring of context-method to prepare more consistent usage
  • progress bar for context-method (using blapply)
  • progress bar for partitionBundle (using blapply)
  • more coherent naming of parameters in partitionBundle-method
  • partitionBundle,character-method debugged and more robust
  • usage of blapply in as.speeches-method
  • hits-method: paramter cqp defaults to FALSE for hits-method, size defaults to FALSE
  • new parameter cqp for dispersion-method
  • aggregation for dispersion-method when length(sAttribute) == 1
  • bugfix for ngrams-method, sample code for the method

v0.6.1

  • configure file removed to avoid unwanted bugs

v0.6.0

  • this is the first version that passes all CRAN tests and that is available via CRAN
  • the 'rcqp' remains the interface to the CWB, but usage of rcqp functions is wrapped into an new new CQI.rcqp (R6) class. CQI.perl and CQI.cqpserver are introduced as alternative interfaces to prepare portability to Windows systems
  • code in the vignette and method examples will be executed conditionally, if rcqp and the polmineR.sampleCorpus are available
  • the polmineR.sampleCorpus package is available in a drat repo at www.github.com/PolMine
  • a series of bug fixes

v0.5.6

  • slot tf renamed to stat, class is data.table now
  • keyness_method moved to data.tables

v0.5.3

  • renamed collocations to cooccurrences, seems more appropriate

v0.5.2

  • multicore for term frequency counts (param for partition)

v0.5.0

  • renamed xxxCluster to xxxBundle, bundle-superclass introduced
  • slot label/labels renamed to name/names
  • name/names-method instead of label/labels

v0.4.57

  • debugging of tf,partition-method: There was an error, if all hits for a query in a corpus were obtained outside of the partition
  • tf,context-method to provide quick access to number of query results
  • view-method as a wrapper for View
  • multicore for tf-method if cqp=TRUE

v0.4.56

  • pAttributes-method for character, and partition objects for easy access to inspect available p-attributes
  • (hidden) helper functions .parseRegistry and .parseInfoFile
  • automatic detection of corpus type, if .info file is available
  • def parameter of partition-method may be NULL, and anchor element will be read from .info-file

v0.4.49

  • cooccurences method for partition objects

v0.46

  • multicore for keyness,partitionCluster-method

v0.45

  • bubblegraph-method removed, turned into an independent package (available at github.com/ablaette/bubblegraph)
  • tag-method using the treetag function from the koRpus package

v0.41

  • DataTables-functionality moved to DataTablesR-package which is imported

v0.40

  • browse-method for textstat objects: use DataTables.js

v0.39

  • reshape collocations-class-objects with trim(object, reshape=TRUE)
  • collocationsReshaped-class
  • keynessCollocations-class

v0.37

  • turned partition in a character-method
  • verbose=FALSE really implemented for

v0.36

  • context,collocations-method
  • call-slot in most objects
  • as.TermDocumentMatrix,collocations-method finalized

v0.35

  • textstat-class introduced to serve as superclass for keyness- and context- classes
  • chisquare-, ll-, pmi-methods for statistics
  • collocations-method introduced
  • multicore for partition-constructor (parallel preparation of tf-lists)
  • parallelisation of as.sparseMatrix,collocations-method
  • introduction of keyness,collocations-method

v0.4.34

  • plot-method for partitionCluster
  • rm.blank-functionality extended

v0.4.33

  • code re-ordered

v0.4.32

  • meta,character-method to learn about a corpus without first generating a partition
  • rudimentary barplot-method for partitionCluster
  • functionality to remove empty rows in DocumentTermMatrix upon construction
  • tf/idf-weighting included in as.TermDocumentMatrix
  • summary-method for keynessCluster-class
  • [[- method for keynessCluster-class

v0.4.31

  • some changes to context method:
    • whether to use multicore can be stated explicitly
    • stopwords renamed to stoplist, and a positivelist is introduced
  • individual documentation for partitionCluster enrich-method
  • NULL object returned for partition call if s-attribute/value-combination not available
  • call to dispersion (2dim): metadata are set up if not available
  • as.data.frame-method for context and keyness class to access statistics table more easily

v0.4.30

  • minor bug fixes

v0.4.29

  • controls() function for setting drillingControls
  • mail concordances
  • rework of kwic as a S4 method for partition and context class
  • speeches method
  • tf for partitionCluster improved

v0.4.28

  • reorganization of the code in files so that shift to S4 methods is reflected
  • documentation for trim method
  • documentation for enrich method
  • addPos method integrated into enrich method [to be checked]
  • addPos is kept as a method, but not exported into namespace
  • set up of missing metadata for dispersion
  • warning if labels are missing in tf method for partitionCluster
  • Encoding of partition labels adjusted to encoding of console
  • adjust encoding for input to partitionCluster
  • speeches method is drafted at end of partition.R [final development, debugging]
  • context method: no explicit statement of posFilter required [to be checked]
  • methods for adjusting crosstab objects fused into trim,crosstab-method [BUG, needs to be checked]

v0.4.27

  • context and contextCluster functions turned into methods

v0.4.26

  • extended export functionality: mail statistics
  • parameters for partition and partitionCluster call simplified (tf, meta)
  • sAttributes method for character vectors to get sAttributes of corpus

v0.4.25

  • html method to inspect partitions

v0.4.24

  • tf method for partition and partitionCluster
  • summary for partitionCluster
  • selective setup of metadata for speeding up things

v0.4.23

  • stopwords option introduced in context function for filtering and brute disambiguation
  • cqpQuery class thrown out again - it does not improve usability
  • partitionCluster and contextCluster are now S4 classes, with some more methods
  • adjust function is now trim method for contextCluster objects

v0.4.22

  • zoom function added to specify partitions
  • partitionCluster faster as it relies on zoom function
  • context function will work on character vectors and cqpQuery object
  • some modifications of backend functions

v0.4.21

  • technical update: automatic generation of NAMESPACE file with roxygen

v0.4.20

  • new cqpQuery-class introduced
  • distribution-function renamed as dispersion
  • trim function for sorting tables

v0.4.19

  • documentation streamlined, package fully roxygenized

v0.4.18

  • using options (not exported list drillingControls)
  • getting Encoding of corpus from registry

v0.4.17

  • method 'addPos' added for partition object, and keyness objects

v0.4.16

  • inclusion of sample Data

v0.4.15

  • reduction of dependencies for publication of the package on CRAN

v0.4.14

  • adapting partition and distribution functions to cope with nested xml

v0.4.13

  • wordCloud visualization
  • usability of concordances improved

v0.4.12

  • partitionMerge changed so that it will use full functionality of partition-function
  • helper functions for distribution on two dimensions improved, tremendous gain in speed
  • started to fill in sample code into vignette

v0.4.12

  • new function for frequency counts at partition setup

v0.4.11

  • inclusion of shiny apps
  • function 'distribution' as wrapper for functions to inspect distribution

v0.4.10

  • improved usability of the package, started to use lowerCamelCase

v0.4.9

  • partition can now be called without explicitly stating a label
  • partition does not require sAttributes to be set, function .sattributes2cpos streamlined
  • no labels in partition.cluster
  • in context, the pos.filter can also be set as an exclusion

v0.4.8

  • started to make functions more usable by shortening function names, 'partition.init' is now 'partition'
  • xterm highlighting for collocates
  • context can be called with parameters given by drillingControls

v0.4.7

  • xtermStyle used for kwic output on console
  • partition object can be indexed

v0.4.6

  • partition.init expanded to allow for the generation of partitions with specified start and end dates
  • wordscore analysis adapted so that it can be really used for performing wordscore analysis

v0.4.5

  • export functionality to tm introduced with as.TermDocumentMatrix.partitioncluster

v0.4.4

  • combine.collocates improved (new columns for plotting)

v0.4.3

  • bug-fix for partition.merge

v0.4.1

  • partition.init can be used without setting up frequency lists and metadata. This may be useful, if a quick partition.init is desired and term frequencies and/or metadata are not needed.
  • partition.init can handle sattribute-lists with length == 1. partition.init will still not work, if no sattributes are given.
  • query.crosstab has been renamed to crosstab
  • crosstab will accept special characters (transforming them to .)

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.