Toolkit for Corpus Analysis

Library for corpus analysis using the Corpus Workbench as an efficient back end for indexing and querying large corpora. The package offers functionality to flexibly create partitions and to carry out basic statistical operations (count, co-occurrences etc.). The original full text of documents can be reconstructed and inspected at any time. Beyond that, the package is intended to serve as an interface to packages implementing advanced statistical procedures. Respective data structures (document term matrices, term co- occurrence matrices etc.) can be created based on the indexed corpora.

The purpose of the package 'polmineR' is to facilitate the interactive analysis of corpora using R. Core objectives for the development of the package are performance, usability, and a modular design.

There are many tools already for text mining. Why yet another one? Important incentives for developing the package were:

  • to create a package that makes the creation and analysis of subcorpora (called 'partitions' here) as easy as possible. A particular strength of the package should be to support contrastive/comparative research.
  • to keep the original text accessible. The polmineR is based on the conviction that statistical analysis alone may be blind and deaf.
  • to provide an open source platform that will make text mining more productive, avoiding prohibitive costs of any kind. Well, some familiarity with R is still necessary.

The polmineR relies on the Open Corpus Workbench (CWB) as a backend and uses the rcqp package as an interface. The CWB is particularly efficient for storing large corpora and offers a powerful language for querying corpora, the Corpus Query Processor (CQP). The architecture may be overengineered if you work with smaller corpora. It is meant to make working with larger corpora efficient, both locally, or on a server.

The polmineR-package was specifically developed to make full use of the XML annotation structure of the corpora created in the PolMine project (see The core PolMine corpora are corpora of plenary protocols. In these corpora, speakers, parties etc. are structurally annotated. The polmineR-package is meant to help making full use of the rich annotation structure.

  • partition: Set up a partition (i.e. subcorpus);
  • context: Analyse the context of a query (including some statistics);
  • dispersion: Analyse the dispersion of a query across one or two dimensions (absolute and relative frequencies);
  • compare: Compare two partitions to identify specific vocabulary (using a chi-square test).
  • count: Count features

There are quite a few further functions, some of which are experimental. The publication of the polmineR-package on CRAN is planned as soon as the portability of the package is ensured. Most recent developments will be available here on GitHub.

Theoretically, it sould be easy to install the package with the devtools mechanism. It has been checked on a preliminary basis that the package is portable, but feedback is most welcome. The tricky part of the installation will usually be the rcqp package. See the package vignette for some advice.

Getting feedback is most welcome! I want this to be a useful package not just for me. Please do get in touch: Andreas Blaette, University of Duisburg-Essen (



  • refactoring of context-method to prepare more consistent usage
  • progress bar for context-method (using blapply)
  • progress bar for partitionBundle (using blapply)
  • more coherent naming of parameters in partitionBundle-method
  • partitionBundle,character-method debugged and more robust
  • usage of blapply in as.speeches-method
  • hits-method: paramter cqp defaults to FALSE for hits-method, size defaults to FALSE
  • new parameter cqp for dispersion-method
  • aggregation for dispersion-method when length(sAttribute) == 1
  • bugfix for ngrams-method, sample code for the method


  • configure file removed to avoid unwanted bugs


  • this is the first version that passes all CRAN tests and that is available via CRAN
  • the 'rcqp' remains the interface to the CWB, but usage of rcqp functions is wrapped into an new new CQI.rcqp (R6) class. CQI.perl and CQI.cqpserver are introduced as alternative interfaces to prepare portability to Windows systems
  • code in the vignette and method examples will be executed conditionally, if rcqp and the polmineR.sampleCorpus are available
  • the polmineR.sampleCorpus package is available in a drat repo at
  • a series of bug fixes


  • slot tf renamed to stat, class is data.table now
  • keyness_method moved to data.tables


  • renamed collocations to cooccurrences, seems more appropriate


  • multicore for term frequency counts (param for partition)


  • renamed xxxCluster to xxxBundle, bundle-superclass introduced
  • slot label/labels renamed to name/names
  • name/names-method instead of label/labels


  • debugging of tf,partition-method: There was an error, if all hits for a query in a corpus were obtained outside of the partition
  • tf,context-method to provide quick access to number of query results
  • view-method as a wrapper for View
  • multicore for tf-method if cqp=TRUE


  • pAttributes-method for character, and partition objects for easy access to inspect available p-attributes
  • (hidden) helper functions .parseRegistry and .parseInfoFile
  • automatic detection of corpus type, if .info file is available
  • def parameter of partition-method may be NULL, and anchor element will be read from .info-file


  • cooccurences method for partition objects


  • multicore for keyness,partitionCluster-method


  • bubblegraph-method removed, turned into an independent package (available at
  • tag-method using the treetag function from the koRpus package


  • DataTables-functionality moved to DataTablesR-package which is imported


  • browse-method for textstat objects: use DataTables.js


  • reshape collocations-class-objects with trim(object, reshape=TRUE)
  • collocationsReshaped-class
  • keynessCollocations-class


  • turned partition in a character-method
  • verbose=FALSE really implemented for


  • context,collocations-method
  • call-slot in most objects
  • as.TermDocumentMatrix,collocations-method finalized


  • textstat-class introduced to serve as superclass for keyness- and context- classes
  • chisquare-, ll-, pmi-methods for statistics
  • collocations-method introduced
  • multicore for partition-constructor (parallel preparation of tf-lists)
  • parallelisation of as.sparseMatrix,collocations-method
  • introduction of keyness,collocations-method


  • plot-method for partitionCluster
  • rm.blank-functionality extended


  • code re-ordered


  • meta,character-method to learn about a corpus without first generating a partition
  • rudimentary barplot-method for partitionCluster
  • functionality to remove empty rows in DocumentTermMatrix upon construction
  • tf/idf-weighting included in as.TermDocumentMatrix
  • summary-method for keynessCluster-class
  • [[- method for keynessCluster-class


  • some changes to context method:
    • whether to use multicore can be stated explicitly
    • stopwords renamed to stoplist, and a positivelist is introduced
  • individual documentation for partitionCluster enrich-method
  • NULL object returned for partition call if s-attribute/value-combination not available
  • call to dispersion (2dim): metadata are set up if not available
  • for context and keyness class to access statistics table more easily


  • minor bug fixes


  • controls() function for setting drillingControls
  • mail concordances
  • rework of kwic as a S4 method for partition and context class
  • speeches method
  • tf for partitionCluster improved


  • reorganization of the code in files so that shift to S4 methods is reflected
  • documentation for trim method
  • documentation for enrich method
  • addPos method integrated into enrich method [to be checked]
  • addPos is kept as a method, but not exported into namespace
  • set up of missing metadata for dispersion
  • warning if labels are missing in tf method for partitionCluster
  • Encoding of partition labels adjusted to encoding of console
  • adjust encoding for input to partitionCluster
  • speeches method is drafted at end of partition.R [final development, debugging]
  • context method: no explicit statement of posFilter required [to be checked]
  • methods for adjusting crosstab objects fused into trim,crosstab-method [BUG, needs to be checked]


  • context and contextCluster functions turned into methods


  • extended export functionality: mail statistics
  • parameters for partition and partitionCluster call simplified (tf, meta)
  • sAttributes method for character vectors to get sAttributes of corpus


  • html method to inspect partitions


  • tf method for partition and partitionCluster
  • summary for partitionCluster
  • selective setup of metadata for speeding up things


  • stopwords option introduced in context function for filtering and brute disambiguation
  • cqpQuery class thrown out again - it does not improve usability
  • partitionCluster and contextCluster are now S4 classes, with some more methods
  • adjust function is now trim method for contextCluster objects


  • zoom function added to specify partitions
  • partitionCluster faster as it relies on zoom function
  • context function will work on character vectors and cqpQuery object
  • some modifications of backend functions


  • technical update: automatic generation of NAMESPACE file with roxygen


  • new cqpQuery-class introduced
  • distribution-function renamed as dispersion
  • trim function for sorting tables


  • documentation streamlined, package fully roxygenized


  • using options (not exported list drillingControls)
  • getting Encoding of corpus from registry


  • method 'addPos' added for partition object, and keyness objects


  • inclusion of sample Data


  • reduction of dependencies for publication of the package on CRAN


  • adapting partition and distribution functions to cope with nested xml


  • wordCloud visualization
  • usability of concordances improved


  • partitionMerge changed so that it will use full functionality of partition-function
  • helper functions for distribution on two dimensions improved, tremendous gain in speed
  • started to fill in sample code into vignette


  • new function for frequency counts at partition setup


  • inclusion of shiny apps
  • function 'distribution' as wrapper for functions to inspect distribution


  • improved usability of the package, started to use lowerCamelCase


  • partition can now be called without explicitly stating a label
  • partition does not require sAttributes to be set, function .sattributes2cpos streamlined
  • no labels in partition.cluster
  • in context, the pos.filter can also be set as an exclusion


  • started to make functions more usable by shortening function names, 'partition.init' is now 'partition'
  • xterm highlighting for collocates
  • context can be called with parameters given by drillingControls


  • xtermStyle used for kwic output on console
  • partition object can be indexed


  • partition.init expanded to allow for the generation of partitions with specified start and end dates
  • wordscore analysis adapted so that it can be really used for performing wordscore analysis


  • export functionality to tm introduced with as.TermDocumentMatrix.partitioncluster


  • combine.collocates improved (new columns for plotting)


  • bug-fix for partition.merge


  • partition.init can be used without setting up frequency lists and metadata. This may be useful, if a quick partition.init is desired and term frequencies and/or metadata are not needed.
  • partition.init can handle sattribute-lists with length == 1. partition.init will still not work, if no sattributes are given.
  • query.crosstab has been renamed to crosstab
  • crosstab will accept special characters (transforming them to .)

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.