Toolkit for Corpus Analysis

Library for corpus analysis using the Corpus Workbench as an efficient back end for indexing and querying large corpora. The package offers functionality to flexibly create partitions and to carry out basic statistical operations (count, co-occurrences etc.). The original full text of documents can be reconstructed and inspected at any time. Beyond that, the package is intended to serve as an interface to packages implementing advanced statistical procedures. Respective data structures (document term matrices, term co-occurrence matrices etc.) can be created based on the indexed corpora.


The purpose of the package 'polmineR' is to facilitate the interactive analysis of corpora using R. Core objectives for the development of the package are performance, usability, and a modular design.

There are many tools already for text mining. Why yet another one? Important incentives for developing the package were:

  • to create a package that makes the creation and analysis of subcorpora (called 'partitions' here) as easy as possible. A particular strength of the package should be to support contrastive/comparative research.
  • to keep the original text accessible. The polmineR is based on the conviction that statistical analysis alone may be blind and deaf.
  • to provide an open source platform that will make text mining more productive, avoiding prohibitive costs of any kind. Well, some familiarity with R is still necessary.

The polmineR relies on the Open Corpus Workbench (CWB) as a backend and uses the rcqp package as an interface. The CWB is particularly efficient for storing large corpora and offers a powerful language for querying corpora, the Corpus Query Processor (CQP). The architecture may be overengineered if you work with smaller corpora. It is meant to make working with larger corpora efficient, both locally, or on a server.

The polmineR-package was specifically developed to make full use of the XML annotation structure of the corpora created in the PolMine project (see polmine.sowi.uni-due.de). The core PolMine corpora are corpora of plenary protocols. In these corpora, speakers, parties etc. are structurally annotated. The polmineR-package is meant to help making full use of the rich annotation structure.

  • partition: Set up a partition (i.e. subcorpus);
  • context: Analyse the context of a query (including some statistics);
  • dispersion: Analyse the dispersion of a query across one or two dimensions (absolute and relative frequencies);
  • compare: Compare two partitions to identify specific vocabulary (using a chi-square test).
  • count: Count features

There are quite a few further functions, some of which are experimental. The publication of the polmineR-package on CRAN is planned as soon as the portability of the package is ensured. Most recent developments will be available here on GitHub.

Theoretically, it sould be easy to install the package with the devtools mechanism. It has been checked on a preliminary basis that the package is portable, but feedback is most welcome. The tricky part of the installation will usually be the rcqp package. See the package vignette for some advice.

Getting feedback is most welcome! I want this to be a useful package not just for me. Please do get in touch: Andreas Blaette, University of Duisburg-Essen (andreas.blaette@uni-due.de).

News

v0.6.2

  • refactoring of context-method to prepare more consistent usage
  • progress bar for context-method (using blapply)
  • progress bar for partitionBundle (using blapply)
  • more coherent naming of parameters in partitionBundle-method
  • partitionBundle,character-method debugged and more robust
  • usage of blapply in as.speeches-method
  • hits-method: paramter cqp defaults to FALSE for hits-method, size defaults to FALSE
  • new parameter cqp for dispersion-method
  • aggregation for dispersion-method when length(sAttribute) == 1
  • bugfix for ngrams-method, sample code for the method

v0.6.1

  • configure file removed to avoid unwanted bugs

v0.6.0

  • this is the first version that passes all CRAN tests and that is available via CRAN
  • the 'rcqp' remains the interface to the CWB, but usage of rcqp functions is wrapped into an new new CQI.rcqp (R6) class. CQI.perl and CQI.cqpserver are introduced as alternative interfaces to prepare portability to Windows systems
  • code in the vignette and method examples will be executed conditionally, if rcqp and the polmineR.sampleCorpus are available
  • the polmineR.sampleCorpus package is available in a drat repo at www.github.com/PolMine
  • a series of bug fixes

v0.5.6

  • slot tf renamed to stat, class is data.table now
  • keyness_method moved to data.tables

v0.5.3

  • renamed collocations to cooccurrences, seems more appropriate

v0.5.2

  • multicore for term frequency counts (param for partition)

v0.5.0

  • renamed xxxCluster to xxxBundle, bundle-superclass introduced
  • slot label/labels renamed to name/names
  • name/names-method instead of label/labels

v0.4.57

  • debugging of tf,partition-method: There was an error, if all hits for a query in a corpus were obtained outside of the partition
  • tf,context-method to provide quick access to number of query results
  • view-method as a wrapper for View
  • multicore for tf-method if cqp=TRUE

v0.4.56

  • pAttributes-method for character, and partition objects for easy access to inspect available p-attributes
  • (hidden) helper functions .parseRegistry and .parseInfoFile
  • automatic detection of corpus type, if .info file is available
  • def parameter of partition-method may be NULL, and anchor element will be read from .info-file

v0.4.49

  • cooccurences method for partition objects

v0.46

  • multicore for keyness,partitionCluster-method

v0.45

  • bubblegraph-method removed, turned into an independent package (available at github.com/ablaette/bubblegraph)
  • tag-method using the treetag function from the koRpus package

v0.41

  • DataTables-functionality moved to DataTablesR-package which is imported

v0.40

  • browse-method for textstat objects: use DataTables.js

v0.39

  • reshape collocations-class-objects with trim(object, reshape=TRUE)
  • collocationsReshaped-class
  • keynessCollocations-class

v0.37

  • turned partition in a character-method
  • verbose=FALSE really implemented for

v0.36

  • context,collocations-method
  • call-slot in most objects
  • as.TermDocumentMatrix,collocations-method finalized

v0.35

  • textstat-class introduced to serve as superclass for keyness- and context- classes
  • chisquare-, ll-, pmi-methods for statistics
  • collocations-method introduced
  • multicore for partition-constructor (parallel preparation of tf-lists)
  • parallelisation of as.sparseMatrix,collocations-method
  • introduction of keyness,collocations-method

v0.4.34

  • plot-method for partitionCluster
  • rm.blank-functionality extended

v0.4.33

  • code re-ordered

v0.4.32

  • meta,character-method to learn about a corpus without first generating a partition
  • rudimentary barplot-method for partitionCluster
  • functionality to remove empty rows in DocumentTermMatrix upon construction
  • tf/idf-weighting included in as.TermDocumentMatrix
  • summary-method for keynessCluster-class
  • [[- method for keynessCluster-class

v0.4.31

  • some changes to context method:
    • whether to use multicore can be stated explicitly
    • stopwords renamed to stoplist, and a positivelist is introduced
  • individual documentation for partitionCluster enrich-method
  • NULL object returned for partition call if s-attribute/value-combination not available
  • call to dispersion (2dim): metadata are set up if not available
  • as.data.frame-method for context and keyness class to access statistics table more easily

v0.4.30

  • minor bug fixes

v0.4.29

  • controls() function for setting drillingControls
  • mail concordances
  • rework of kwic as a S4 method for partition and context class
  • speeches method
  • tf for partitionCluster improved

v0.4.28

  • reorganization of the code in files so that shift to S4 methods is reflected
  • documentation for trim method
  • documentation for enrich method
  • addPos method integrated into enrich method [to be checked]
  • addPos is kept as a method, but not exported into namespace
  • set up of missing metadata for dispersion
  • warning if labels are missing in tf method for partitionCluster
  • Encoding of partition labels adjusted to encoding of console
  • adjust encoding for input to partitionCluster
  • speeches method is drafted at end of partition.R [final development, debugging]
  • context method: no explicit statement of posFilter required [to be checked]
  • methods for adjusting crosstab objects fused into trim,crosstab-method [BUG, needs to be checked]

v0.4.27

  • context and contextCluster functions turned into methods

v0.4.26

  • extended export functionality: mail statistics
  • parameters for partition and partitionCluster call simplified (tf, meta)
  • sAttributes method for character vectors to get sAttributes of corpus

v0.4.25

  • html method to inspect partitions

v0.4.24

  • tf method for partition and partitionCluster
  • summary for partitionCluster
  • selective setup of metadata for speeding up things

v0.4.23

  • stopwords option introduced in context function for filtering and brute disambiguation
  • cqpQuery class thrown out again - it does not improve usability
  • partitionCluster and contextCluster are now S4 classes, with some more methods
  • adjust function is now trim method for contextCluster objects

v0.4.22

  • zoom function added to specify partitions
  • partitionCluster faster as it relies on zoom function
  • context function will work on character vectors and cqpQuery object
  • some modifications of backend functions

v0.4.21

  • technical update: automatic generation of NAMESPACE file with roxygen

v0.4.20

  • new cqpQuery-class introduced
  • distribution-function renamed as dispersion
  • trim function for sorting tables

v0.4.19

  • documentation streamlined, package fully roxygenized

v0.4.18

  • using options (not exported list drillingControls)
  • getting Encoding of corpus from registry

v0.4.17

  • method 'addPos' added for partition object, and keyness objects

v0.4.16

  • inclusion of sample Data

v0.4.15

  • reduction of dependencies for publication of the package on CRAN

v0.4.14

  • adapting partition and distribution functions to cope with nested xml

v0.4.13

  • wordCloud visualization
  • usability of concordances improved

v0.4.12

  • partitionMerge changed so that it will use full functionality of partition-function
  • helper functions for distribution on two dimensions improved, tremendous gain in speed
  • started to fill in sample code into vignette

v0.4.12

  • new function for frequency counts at partition setup

v0.4.11

  • inclusion of shiny apps
  • function 'distribution' as wrapper for functions to inspect distribution

v0.4.10

  • improved usability of the package, started to use lowerCamelCase

v0.4.9

  • partition can now be called without explicitly stating a label
  • partition does not require sAttributes to be set, function .sattributes2cpos streamlined
  • no labels in partition.cluster
  • in context, the pos.filter can also be set as an exclusion

v0.4.8

  • started to make functions more usable by shortening function names, 'partition.init' is now 'partition'
  • xterm highlighting for collocates
  • context can be called with parameters given by drillingControls

v0.4.7

  • xtermStyle used for kwic output on console
  • partition object can be indexed

v0.4.6

  • partition.init expanded to allow for the generation of partitions with specified start and end dates
  • wordscore analysis adapted so that it can be really used for performing wordscore analysis

v0.4.5

  • export functionality to tm introduced with as.TermDocumentMatrix.partitioncluster

v0.4.4

  • combine.collocates improved (new columns for plotting)

v0.4.3

  • bug-fix for partition.merge

v0.4.1

  • partition.init can be used without setting up frequency lists and metadata. This may be useful, if a quick partition.init is desired and term frequencies and/or metadata are not needed.
  • partition.init can handle sattribute-lists with length == 1. partition.init will still not work, if no sattributes are given.
  • query.crosstab has been renamed to crosstab
  • crosstab will accept special characters (transforming them to .)

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.