Toolkit for Corpus Analysis

Library for corpus analysis using the Corpus Workbench as an efficient back end for indexing and querying large corpora. The package offers functionality to flexibly create partitions and to carry out basic statistical operations (count, co-occurrences etc.). The original full text of documents can be reconstructed and inspected at any time. Beyond that, the package is intended to serve as an interface to packages implementing advanced statistical procedures. Respective data structures (document term matrices, term co- occurrence matrices etc.) can be created based on the indexed corpora.


License CRAN_Status_Badge Downloads Travis-CI BuildStatus AppVeyor BuildStatus codecov DOI

Purpose: The focus of the package ‘polmineR’ is the interactive analysis of corpora using R. Core objectives for the development of the package are performance, usability, and a modular design.

Aims: Key aims for developing the package are:

  • To keep the original text accessible. A seamless integration of qualitative and quantitative steps in corpus analysis supports validation, based on inspecting the text behind the numbers.

  • To provide a library with standard tasks. It is an open source platform that will make text mining more productive, avoiding prohibitive costs to reimplement basics, or to run many lines of code to perform a basic tasks.

  • To create a package that makes the creation and analysis of subcorpora (‘partitions’) easy. A particular strength of the package is to support contrastive/comparative research.

  • To offer performance for users with a standard infrastructure. The package picks up the idea of a three-tier software design. Corpus data are managed and indexed by using the Open Corpus Workbench (CWB). The CWB is particularly efficient for storing large corpora and offers a powerful language for querying corpora, the Corpus Query Processor (CQP).

  • To support sharing consolidated and documented data, following the ideas of reproducible research.

Background: The polmineR-package was specifically developed to make full use of the XML annotation structure of the corpora created in the PolMine project (see polmine.sowi.uni-due.de). The core PolMine corpora are corpora of plenary protocols. In these corpora, speakers, parties etc. are structurally annotated. The polmineR-package is meant to help making full use of the rich annotation structure.

Core Functions

Upon loading polmineR, a message will report the version of the package and the location of a so-called ‘registry’-directory.

library(polmineR)
#> polmineR v0.7.10.9001
#> session registry:  /private/var/folders/r6/1k6mxnbj5077980k11xvr0q40000gn/T/Rtmpe923aa/polmineR_registry

The session registry directory is populated with files that describe the corpora that are present and accessible on the user’s system.

Install and use packaged corpora

Indexed corpora wrapped into R data packages can be installed from a (private) package repository.

install.packages("GermaParl", repos = "http://polmine.sowi.uni-due.de/packages")
install.packages("europarl.en", repos = "http://polmine.sowi.uni-due.de/packages")

Calling the use()-function will activate a corpus included in a data package. The registry files describing the corpora in a package are added to the session registry directory.

use("europarl.en") # activate the corpus in the europarl-en package
#> ... activating corpus: europarl-en

An advantage of keeping corpora in data packages are the versioning and documentation mechanisms that are the hallmark of packages. Of course, polmineR will work with the library of CWB indexed corpora stored on your machine. The corpora described in the registry directory defined by the environment variable CORPUS_REGISTRY will be added to the session registry directory when loading polmineR.

partition (and partition_bundle)

All methods can be applied to a whole corpus, as well as to partitions (i.e. subcorpora). Use the metadata of a corpus (so-called s-attributes) to define a subcorpus.

ep2005 <- partition("EUROPARL-EN", text_year = "2006")
#> ... get encoding: latin1
#> ... get cpos and strucs
size(ep2005)
#> [1] 3100529
barroso <- partition("EUROPARL-EN", speaker_name = "Barroso", regex = TRUE)
#> ... get encoding: latin1
#> ... get cpos and strucs
size(barroso)
#> [1] 98142

Partitions can be bundled into partition_bundle objects, and most methods can be applied to a whole corpus, a partition, or a partition_bundle object alike. Consult the package vignette to learn more.

count (using CQP syntax)

Counting occurrences of a feature in a corpus, a partition or in the partitions of a partition_bundle is a basic operation. By offering access to the query syntax of the Corpus Query Processor (CQP), polmineR package exposes a query syntax that goes far beyond regular expressions. See the CQP documentation to learn more.

count("EUROPARL-EN", "France")
#>     query count         freq
#> 1: France  5517 0.0001399122
count("EUROPARL-EN", c("France", "Germany", "Britain", "Spain", "Italy", "Denmark", "Poland"))
#>      query count         freq
#> 1:  France  5517 1.399122e-04
#> 2: Germany  4196 1.064114e-04
#> 3: Britain  1708 4.331523e-05
#> 4:   Spain  3378 8.566676e-05
#> 5:   Italy  3209 8.138089e-05
#> 6: Denmark  1615 4.095673e-05
#> 7:  Poland  1820 4.615557e-05
count("EUROPARL-EN", '"[pP]opulism"')
#>            query count         freq
#> 1: "[pP]opulism"   107 2.713542e-06

dispersion (across one or two dimensions)

The dispersion method is there to analyse the dispersion of a query, or a set of queries across one or two dimensions (absolute and relative frequencies). The CQP syntax can be used.

populism <- dispersion("EUROPARL-EN", "populism", s_Attribute = "text_year", progress = FALSE)
popRegex <- dispersion("EUROPARL-EN", '"[pP]opulism"', s_attribute = "text_year", cqp = TRUE, progress = FALSE)

cooccurrences (to analyse collocations)

The cooccurrences method is used to analyse the context of a query (including some statistics).

islam <- cooccurrences("EUROPARL-EN", query = 'Islam', left = 10, right = 10)
islam <- subset(islam, rank_ll <= 100)
dotplot(islam)
islam

http://polmine.sowi.uni-due.de/gallery/cooccurrences.png

features (keyword extraction)

Compare partitions to identify features / keywords (using statistical tests such as chi square).

ep2002 <- partition("EUROPARL-EN", text_year = "2002", p_attribute = "word")
epPre911 <- partition("EUROPARL-EN", text_year = 1997:2001, p_attribute = "word")
y <- features(ep2002, epPre911, included = FALSE)

kwic (also known as concordances)

So what happens in the context of a word, or a CQP query? To attain valid research results, reading will often be necessary. The kwic method will help, and uses the conveniences of DataTables, outputted in the Viewer pane of RStudio.

kwic("EUROPARL-EN", "Islam", meta = c("text_date", "speaker_name"))

http://polmine.sowi.uni-due.de/gallery/kwic.png

read (the full text)

Corpus analysis involves moving from text to numbers, and back again. Use the read method, to inspect the full text of a partition (a speech given by chancellor Angela Merkel in this case).

use("GermaParl")
merkel <- partition("GERMAPARL", speaker = "Angela Merkel", date = "2013-09-03")
read(merkel)

http://polmine.sowi.uni-due.de/gallery/read.png

as.TermDocumentMatrix (for text mining purposes)

Many advanced methods in text mining require term document matrices as input. Based on the metadata of a corpus, these data structures can be obtained in a fast and flexible manner, for performing topic modelling, machine learning etc.

use("europarl.en")
speakers <- partition_bundle(
  "EUROPARL-EN", s_attribute = "speaker_id",
  progress = FALSE, verbose = FALSE
)
speakers_count <- count(speakers, p_attribute = "word", progress = TRUE)
tdm <- as.TermDocumentMatrix(speakers_count, col = "count")
dim(tdm)

Installation

Windows

The following instructions assume that you have installed R. If not, install it fromCRAN. An installation of RStudio is highly recommended.

The CRAN release of polmineR can be installed using install.packages(), all dependencies will be installed, too.

install.packages("polmineR")

To install the most recent development version that is hosted in a GitHub repository, use the installation mechanism offered by the devtools package.

install.packages("devtools")
devtools::install_github("PolMine/polmineR", ref = "dev")

Check the installation by loading polmineR and activating the corpora included in the package.

library(polmineR)
corpus()

MacOS

The following instructions for Mac users assume that R is installed on your system. Binaries are available from the Homepage of the R Project. An installation of RStudio is highly recommended. Get the Open Source License version of RStudio Desktop.

At this stage, the RcppCWB dependency is not available as a pre-compiled binary and needs to be compiled. A set of system requirements needs to be fulfilled to do this.

First, you will need an installation of Xcode, which you can get it via the Mac App Store. You will also need the Command Line Tools for Xcode. It can be installed from a terminal with:

xcode-select --install

Please make sure that you agree to the license.

Second, an installation of XQuartz is required. It can be obtained from www.xquartz.org.

Third, to fulfill the system requirements of the RcppCWB package, the Glib and pcre libraries need to be installed. Using a package manager makes things considerably easier. We recommend using ‘Homebrew’. To install Homebrew, follow the instructions on the Homebrew Homepage. The following commands then need to be executed from a terminal window. They will install the C libraries that the RcppCWB package relies on:

brew -v install pkg-config
brew -v install glib --universal
brew -v install pcre --universal
brew -v install readline

The latest release of polmineR can be installed from CRAN using the usual install.packages-function.

install.packages("polmineR")

The development version of polmineR can be installed using devtools:

install.packages("devtools") # unless devtools is already installed
devtools::install_github("PolMine/polmineR", ref = "dev")

Check whether everything works by loading polmineR, and activating the demo corpora included in the package.

library(polmineR)
use("polmineR")
corpus()

Linux (Ubuntu)

If you have not yet installed R on your Ubuntu machine, there is a good instruction at ubuntuuser. To install base R, enter in the terminal.

sudo apt-get install r-base r-recommended

Make sure that you have installed the latest version of R. The following commands will add the R repository to the package sources and run an update. The second line assumes that you are using Ubuntu 16.04.

sudo apt-key adv --recv-keys --keyserver keyserver.ubuntu.com E084DAB9
sudo add-apt-repository 'deb http://ftp5.gwdg.de/pub/misc/cran/bin/linux/ubuntu xenial/'
sudo apt-get update
sudo apt-get upgrade

It is highly recommended to install RStudio, a powerful IDE for R. Output of polmineR methods is generally optimized to be displayed using RStudio facilities. If you are working on a remote server, running RStudio Server may be an interesting option to consider.

The RcppCWB package, the interface used by polmineR to query CWB corpora, will require the pcre, glib and pkg-config libraries. They can be installed as follows. In addition libxml2 is installed, a dependency of the R package xml2 that is used for manipulating html output.

sudo apt-get install libglib2.0-dev libssl-dev libcurl4-openssl-dev
sudo apt-get install libxml2-dev
sudo apt-get install libprotobuf-dev

The system requirements will now be fulfilled. From R, install dependencies for rcqp/polmineR first, and then rcqp and polmineR.

install.packages("RcppCWB")
install.packages("polmineR")

Use devtools to install the development version of polmineR from GitHub.

install.packages("devtools")
devtools::install_github("PolMine/polmineR", ref = "dev")

You may want to install packaged corpora to run examples in the vignette, and the man packages.

library(polmineR)
use("polmineR")
corpus()

To have access to all package functions and to run all package tests, the installation of further system requirements and packages is required. The xlsx dependency requires that rJava is installed and configured for R. That is done on the shell:

sudo apt-get install openjdk-8-jre
sudo R CMD javareconf

To run package tests including (re-)building the manual and vignettes, a working installation of Latex is required, too. Be aware that this may be a time-consuming operation.

sudo apt-get install texlive-full texlive-xetex 

Now install the remaining packages from within R.

install.packages(pkgs = c("rJava", "xlsx", "tidytext"))

News

polmineR 0.7.10

  • The use-function is limited now to activating the corpus in data packages. Having introduced the session registry, switching registry directories is not needed any more.
  • The partition_bundle-method for context-objects has been reworked entirely (and is working again); a new partition-method for context-objects has been introduced. Buth steps are intended for workflows for dictionary-based sentiment analysis.
  • A new summary-method for partition-class objects is introduced. If the Object has been weighed, the list that is returned will include a report on weights.
  • The partition_bundle-class, rather than inheriting from bundle-class directly, will now inherit from the count_bundle-class
  • The weigh-method is now implemented for the classes count and count_bundle. Via inheritance, it will also be available for partition and partition_bundle. Useful for dictionary-based sentiment analysis, for instance. There is an example that explains the workflow.
  • The as.regions-function has been turned into a as.regions-method to have a more generic tool.
  • The size_coi-slot of the context-object included the node; the node (i.e. matches for queries) is excluded now from the count of the size_coi.
  • Some refactoring of the context-method, so that full use of data.table speeds up things.
  • highlight-method allows definitions of terms to be highlighted to be passed in via three dots (...); no explicit list necessary.
  • highlight-method implemented now for class kwic.
  • A coerce-method to turn a kwic-object into a htmlwidget has been singled out from the show,kwic-method. Now it is possible to generate a htmlwidget from a kwic object, and to include the widget into a Rmarkdown document.
  • The script configure.win has been removed so that installation works on Windows without an installation of Rtools.
  • When calling use(), the registry directory is reset for CQP, so that the new corpus is available for using it with CQP syntax.
  • new coerce-method to turn textstat-objects into an htmlwidget (DataTable), very useful for Rmarkdown documents such as slides.
  • A new argument height for the html()-method will allow to define a scroll box. Useful to embed a fulltext output to a Rmarkdown document.
  • A new as.character,kwic-method
  • Bug removed from s_attributes,partition_method: "fast track" was activated without preconditions.
  • As a matter of consistency, the argument 'meta' has been renamed to s_attributes for the kwic,context-method, and for the enrich,kwic-method
  • To avoid confusion (with argument s_attributes), the argument s_attribute to check for integrity within a struc has been renamed into boundary.
  • A new knit_print()-method for textstat- and kwic-objects, to offer a seamless inclusion of analyses in Rmarkdown documents.
  • Bug removed that would swallow metadata/s-attributes display in kwic output after highlighting.
  • A new subset,kwic-method offers functionality to filter concordances based on metadata.

polmineR 0.7.9

  • new as.list,bundle-method for convenience, to access slot objects
  • as.bundle is more generic now, so that any kind of object can be coerced to a bundle now
  • as.speeches-method turned into function that allows partition and corpus as input
  • is.partition-function introduced
  • sAttributes,partition-method in line with RcppCWB requirements (no negative values of strucs)
  • count repaired for muliple p-attributes
  • bug removed causing a crash for as.markdown-method when cutoff is larger than number of tokens
  • polmineR will now work with a temporary registry in the temporary session directory
  • a (new) registry_move() function is used to copy files to the tmp registry
  • the (new) registry() function will get the temporary registry directory
  • the use() function will add the registry file of a package to the tmp registry
  • a bug removed that has prevented the name<- method to work properly for bundle objects
  • new partition_bundle,partition_bundle-method introduced
  • naming of methods and functions, classes and most arguments moved to snake_case, maintaining backwards compatibility
  • utility function getObjects not exported any more
  • for count,partition_bundle-method, column 'partition' will be a character vector now (not factor)
  • new argument 'type' added to partition_bundle
  • new method 'get_type' introduced to make getting corpus type more robust
  • bug removed that has caused a crash when cutoff is larger than number of tokens in a partition when calling get_token_stream
  • count-method will now return count-object if query is NULL, making it easier to write pipes

polmineR 0.7.8

  • upon loading the package, check that data directories are set correctly in registry files to make sure that sample data in pre-compiled packages can be used
  • startup messages adjusted slightly

polmineR 0.7.7

  • removed depracated classes: dispersion, Textstat (reference class), Partition (reference class)
  • divide-methode moved to package polmineR.misc
  • bug removed: size of ngrams object was always 1
  • dotplot-method added for featuresNgrams
  • sample corpus GermaParlMini added to the package (replacing suggested package polmineR.sampleCorpus)
  • configuration mechanism added to set path to data directory in registry file upon installation
  • class hits now inherits from class 'textstat', exposing a set of generic functions (such as dim, nrow etc.); slot 'dt' changed to 'stat' for this purpose
  • count,partitionBundle and hits,partitionBundle: cqp parameter added
  • RegistryFile class replaced by a set of leightweight-functions (corpus_...)
  • encode-method moved to cwbtools package
  • getTerms,character-method and terms,partition-method merged
  • examples using EUROPARL corpus have been replaced by REUTERS corpus (including vignette)
  • param id2str has been renamed to decode in all functions to avoid unwanted behavior
  • robust indexing of bundle objects for subsetting
  • optional settings have been cleaned
  • reliance on cwb command line tools removed
  • encoding issue with names of partitionBundle solved

polmineR 0.7.6

  • functionality of matches-method (breakdown of frequencies of matches) integrated into count-method (new param breakdown)
  • corpus REUTERS included (as data for testsuite)
  • adjust data directory of REUTERS corpus upon loading package
  • a pkgdown-generated website is included in the docs directory
  • consistent use of .message helper function to make shiny app work
  • bug removed for count-method when options("polmineR.cwb-lexdecode") is TRUE and options("polmineR.Rcpp") is FALSE
  • if CORPUS_REGISTRY is not defined, the registry directory in the package will be used, making REUTERS corpus available
  • getSettings-function removed, was not sufficiently useful, and was superseded by template mechanism
  • new class 'count' introduced to organize results from count operations
  • at startup, default template is assigned for corpora without explicitly defined templates to make read() work in a basic fashion
  • new cpos,hits-method to support highlight method
  • tooltips-method to reorder functionality of html/highlight/tooltip-methods
  • param charoffset added to html-method
  • coerce-method from partition to json and vice versa, potentially useful for storing partitions
  • sAttributes2cpos to work properly with nested xml
  • partition,partition-method reworked to work properly with nested XML
  • encoding of return value of sAttributes will be locale
  • references added to methods count, kwic, cooccurrences, features.
  • as.DocumentTermMatrix,character-method reworked to allow for subsetting and divergence of strucs and struc_str
  • html,partition-method has new option beautify, to remove whitespace before interpunctuation
  • output error removed in html,partition-method (that misinterprets `` as code block)
  • the class Corpus now has a slot sAttribute to keep/manage a data.table with corpus positions and struc values, and there is a new partition,Corpus-method. In compbination, it will be a lot faster to derive a partition, particularly if you need to do that repeatedly
  • a new function install.cwb() provides a convenient way to install CWB in the package
  • added a missing encoding conversion for the count method

polmineR 0.7.5

  • class 'Regions' renamed to class 'regions' as a matter of consistency
  • data type of slot cpos of class 'regions' is a matrix now
  • rework and improved documentation for decode- and encode-methods
  • new functions copy.corpus and rename.corpus
  • as.DocumentTermMatrix-method checks for strucs with value -1
  • improved as.speeches-method: reordering of speeches, default values
  • blapply-method: verbose output will be suppressed of progress is TRUE

polmineR 0.7.4

  • applying stoplists and positivelists working again for context-method
  • matches-method to learn about matches for CQP queries replacing frequencies-method
  • Rework of enrich-method, including documentation.
  • param 'neighbor' dropped from kwic,context-method; params positivelist and negativelist offer equivalent functionality
  • highlight-method for (newly exported) kwic-method (for validation purposes)
  • performance improvement for partitionBundle,character-method
  • a new Labels class and label method for generating test data
  • bug removed for partitionBundle,character-class, and performance improved
  • Improved explanation of the installation procedure for Mac in the package vignette
  • for context-method: param sAttribute working again to check boundaries of match regions
  • sample-method for objects of class kwic and context
  • kwic, cpos, and context method will accept queries of length > 1
  • use-function and resetRegistry-function reworked
  • more explicit startup message to get info about version, registry and interface
  • encoding issues solved for size-method, hits-method and dispersion-method
  • use-function will now work for users working with polmineR.Rcpp as interface

polmineR 0.7.3

  • new installed.corpora() convenience function to list all data packages with corpora
  • view-method and show-method for cooccurrences-objects now successfully redirect output to RStudio viewer
  • data.table-style indexing of objects inheriting from textstat-class
  • for windows compatibility, as.corpusEnc/as.nativeEnc for encoding conversion
  • performance gain for size-method by using polmineR.Rcpp
  • dissect-method dropped (replaced by size)
  • improved documentation of size-method
  • labels for cooccurrences-output
  • cooccurrencesBundle-class and cooccurrence-method for bundle restored
  • as.data.table for cooccurrencesBundle-class
  • count-method for whole corpus for pAttribute > 1
  • functionality of meta-method merged into sAttributes-method (meta-method dropped)
  • speed improvements for generating html output for reading
  • previously unexported highlight-method now exported, and more robust than before (using xml2)
  • progress bars for multicore operations now generated by pbapply package
  • starting to use testthat for unit testing

polmineR 0.7.2

  • updated documentation of partition-method.
  • documentation of hits-method improved
  • use-methode: default value for pkg ist NULL (return to default registry), function more robust
  • Rework for parsing the registry
  • rework of templates, are part of options now (see ?setTemplate, ?getTemplate)
  • experimental use of polmineR.Rcpp-package for fast counts for whole corpus
  • new convenience function install.corpus to install CWB corpus wrapped into R data package
  • adjustments to make package compatible with polmineR.shiny
  • cpos-method to get hits more robust if there are not matches for string
  • hits-method removes NAs
  • compare-method renamed to features-method
  • warnings caused by startup on windows removed

polmineR 0.7.1

  • size-method now allows for a param 'sAttribute'
  • hits-method reworked, allows for names query vectors
  • first version that can be installed on windows

polmineR 0.7.0

  • rcqp package moved to suggests, to facilitate installation

polmineR 0.6.3

  • more generic implementation of as.markdown-method to prepare use of templates
  • LICENSE file updated
  • getTokenStream,character-method: new default behavior for params left and right
  • use of templates for as.markdown-method
  • Regions and TokenStream class (not for frontend use, so far)
  • getTermFrequencies-method merged into count-method
  • Corpus class introduced
  • decode- and encode-methods introduced

polmineR 0.6.2

  • refactoring of context-method to prepare more consistent usage
  • progress bar for context-method (using blapply)
  • progress bar for partitionBundle (using blapply)
  • more coherent naming of parameters in partitionBundle-method
  • partitionBundle,character-method debugged and more robust
  • usage of blapply in as.speeches-method
  • hits-method: paramter cqp defaults to FALSE for hits-method, size defaults to FALSE
  • new parameter cqp for dispersion-method
  • aggregation for dispersion-method when length(sAttribute) == 1
  • bugfix for ngrams-method, sample code for the method

polmineR 0.6.1

  • configure file removed to avoid unwanted bugs

polmineR 0.6.0

  • this is the first version that passes all CRAN tests and that is available via CRAN
  • the 'rcqp' remains the interface to the CWB, but usage of rcqp functions is wrapped into an new new CQI.rcqp (R6) class. CQI.perl and CQI.cqpserver are introduced as alternative interfaces to prepare portability to Windows systems
  • code in the vignette and method examples will be executed conditionally, if rcqp and the polmineR.sampleCorpus are available
  • the polmineR.sampleCorpus package is available in a drat repo at www.github.com/PolMine
  • a series of bug fixes

polmineR 0.5.6

  • slot tf renamed to stat, class is data.table now
  • keyness_method moved to data.tables

polmineR 0.5.3

  • renamed collocations to cooccurrences, seems more appropriate

polmineR 0.5.2

  • multicore for term frequency counts (param for partition)

polmineR 0.5.0

  • renamed xxxCluster to xxxBundle, bundle-superclass introduced
  • slot label/labels renamed to name/names
  • name/names-method instead of label/labels

polmineR 0.4.57

  • debugging of tf,partition-method: There was an error, if all hits for a query in a corpus were obtained outside of the partition
  • tf,context-method to provide quick access to number of query results
  • view-method as a wrapper for View
  • multicore for tf-method if cqp=TRUE

polmineR 0.4.56

  • pAttributes-method for character, and partition objects for easy access to inspect available p-attributes
  • (hidden) helper functions .parseRegistry and .parseInfoFile
  • automatic detection of corpus type, if .info file is available
  • def parameter of partition-method may be NULL, and anchor element will be read from .info-file

polmineR 0.4.49

  • cooccurences method for partition objects

polmineR 0.4.46

  • multicore for keyness,partitionCluster-method

polmineR 0.4.45

  • bubblegraph-method removed, turned into an independent package (available at github.com/ablaette/bubblegraph)
  • tag-method using the treetag function from the koRpus package

polmineR 0.4.41

  • DataTables-functionality moved to DataTablesR-package which is imported

polmineR 0.4.40

  • browse-method for textstat objects: use DataTables.js

polmineR 0.4.39

  • reshape collocations-class-objects with trim(object, reshape=TRUE)
  • collocationsReshaped-class
  • keynessCollocations-class

polmineR 0.4.37

  • turned partition in a character-method
  • verbose=FALSE really implemented for

polmineR 0.36

  • context,collocations-method
  • call-slot in most objects
  • as.TermDocumentMatrix,collocations-method finalized

polmineR 0.35

  • textstat-class introduced to serve as superclass for keyness- and context- classes
  • chisquare-, ll-, pmi-methods for statistics
  • collocations-method introduced
  • multicore for partition-constructor (parallel preparation of tf-lists)
  • parallelisation of as.sparseMatrix,collocations-method
  • introduction of keyness,collocations-method

polmineR 0.4.34

  • plot-method for partitionCluster
  • rm.blank-functionality extended

polmineR 0.4.33

  • code re-ordered

polmineR 0.4.32

  • meta,character-method to learn about a corpus without first generating a partition
  • rudimentary barplot-method for partitionCluster
  • functionality to remove empty rows in DocumentTermMatrix upon construction
  • tf/idf-weighting included in as.TermDocumentMatrix
  • summary-method for keynessCluster-class
  • [[- method for keynessCluster-class

polmineR 0.4.31

  • some changes to context method:
    • whether to use multicore can be stated explicitly
    • stopwords renamed to stoplist, and a positivelist is introduced
  • individual documentation for partitionCluster enrich-method
  • NULL object returned for partition call if s-attribute/value-combination not available
  • call to dispersion (2dim): metadata are set up if not available
  • as.data.frame-method for context and keyness class to access statistics table more easily

polmineR 0.4.30

  • minor bug fixes

polmineR 0.4.29

  • controls() function for setting drillingControls
  • mail concordances
  • rework of kwic as a S4 method for partition and context class
  • speeches method
  • tf for partitionCluster improved

polmineR 0.4.28

  • reorganization of the code in files so that shift to S4 methods is reflected
  • documentation for trim method
  • documentation for enrich method
  • addPos method integrated into enrich method [to be checked]
  • addPos is kept as a method, but not exported into namespace
  • set up of missing metadata for dispersion
  • warning if labels are missing in tf method for partitionCluster
  • Encoding of partition labels adjusted to encoding of console
  • adjust encoding for input to partitionCluster
  • speeches method is drafted at end of partition.R [final development, debugging]
  • context method: no explicit statement of posFilter required [to be checked]
  • methods for adjusting crosstab objects fused into trim,crosstab-method [BUG, needs to be checked]

polmineR 0.4.27

  • context and contextCluster functions turned into methods

polmineR 0.4.26

  • extended export functionality: mail statistics
  • parameters for partition and partitionCluster call simplified (tf, meta)
  • sAttributes method for character vectors to get sAttributes of corpus

polmineR 0.4.25

  • html method to inspect partitions

polmineR 0.4.24

  • tf method for partition and partitionCluster
  • summary for partitionCluster
  • selective setup of metadata for speeding up things

polmineR 0.4.23

  • stopwords option introduced in context function for filtering and brute disambiguation
  • cqpQuery class thrown out again - it does not improve usability
  • partitionCluster and contextCluster are now S4 classes, with some more methods
  • adjust function is now trim method for contextCluster objects

polmineR 0.4.22

  • zoom function added to specify partitions
  • partitionCluster faster as it relies on zoom function
  • context function will work on character vectors and cqpQuery object
  • some modifications of backend functions

polmineR 0.4.21

  • technical update: automatic generation of NAMESPACE file with roxygen

polmineR 0.4.20

  • new cqpQuery-class introduced
  • distribution-function renamed as dispersion
  • trim function for sorting tables

polmineR 0.4.19

  • documentation streamlined, package fully roxygenized

polmineR 0.4.18

  • using options (not exported list drillingControls)
  • getting Encoding of corpus from registry

polmineR 0.4.17

  • method 'addPos' added for partition object, and keyness objects

polmineR 0.4.16

  • inclusion of sample Data

polmineR 0.4.15

  • reduction of dependencies for publication of the package on CRAN

polmineR 0.4.14

  • adapting partition and distribution functions to cope with nested xml

polmineR 0.4.13

  • wordCloud visualization
  • usability of concordances improved

polmineR 0.4.12

  • partitionMerge changed so that it will use full functionality of partition-function
  • helper functions for distribution on two dimensions improved, tremendous gain in speed
  • started to fill in sample code into vignette

polmineR 0.4.12

  • new function for frequency counts at partition setup

polmineR 0.4.11

  • inclusion of shiny apps
  • function 'distribution' as wrapper for functions to inspect distribution

polmineR 0.4.10

  • improved usability of the package, started to use lowerCamelCase

polmineR 0.4.9

  • partition can now be called without explicitly stating a label
  • partition does not require sAttributes to be set, function .sattributes2cpos streamlined
  • no labels in partition.cluster
  • in context, the pos.filter can also be set as an exclusion

polmineR 0.4.8

  • started to make functions more usable by shortening function names, 'partition.init' is now 'partition'
  • xterm highlighting for collocates
  • context can be called with parameters given by drillingControls

polmineR 0.4.7

  • xtermStyle used for kwic output on console

  • partition object can be indexed

polmineR 0.4.6

  • partition.init expanded to allow for the generation of partitions with specified start and end dates

  • wordscore analysis adapted so that it can be really used for performing wordscore analysis

polmineR 0.4.5

  • export functionality to tm introduced with as.TermDocumentMatrix.partitioncluster

polmineR 0.4.4

  • combine.collocates improved (new columns for plotting)

polmineR 0.4.3

  • bug-fix for partition.merge

polmineR 0.4.1

  • partition.init can be used without setting up frequency lists and metadata. This may be useful, if a quick partition.init is desired and term frequencies and/or metadata are not needed.
  • partition.init can handle sattribute-lists with length == 1. partition.init will still not work, if no sattributes are given.
  • query.crosstab has been renamed to crosstab
  • crosstab will accept special characters (transforming them to .)

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.