Conditional Random Fields for Labelling Sequential Data in Natural Language Processing

Wraps the 'CRFsuite' library < https://github.com/chokkan/crfsuite> allowing users to fit a Conditional Random Field model and to apply it on existing data. The focus of the implementation is in the area of Natural Language Processing where this R package allows you to easily build and apply models for named entity recognition, text chunking, part of speech tagging, intent recognition or classification of any category you have in mind. Next to training, a small web application is included in the package to allow you to easily construct training data.


This repository contains an R package which wraps the CRFsuite C/C++ library (https://github.com/chokkan/crfsuite), allowing the following:

  • Fit a Conditional Random Field model (1st-order linear-chain Markov)
  • Use the model to get predictions alongside the model on new data
  • The focus of the implementation is in the area of Natural Language Processing where this R package allows you to easily build and apply models for named entity recognition, text chunking, part of speech tagging, intent recognition or classification of any category you have in mind.

For users unfamiliar with Conditional Random Field (CRF) models, you can read this excellent tutorial http://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf

Installation

  • The package is on CRAN, so just install it with the command install.packages("crfsuite")
  • For installing the development version of this package: devtools::install_github("bnosac/crfsuite", build_vignettes = TRUE)

Model building and tagging

For detailed documentation on how to build your own CRF tagger for doing NER / Chunking. Look to the vignette.

library(crfsuite)
vignette("crfsuite-nlp", package = "crfsuite")

Short example

library(crfsuite)
 
## Get example training data + enrich with token and part of speech 2 words before/after each token
x <- ner_download_modeldata("conll2002-nl")
x <- crf_cbind_attributes(x, terms = c("token", "pos"), by = c("doc_id", "sentence_id"), 
                          from = -2, to = 2, ngram_max = 3, sep = "-")
 
## Split in train/test set
crf_train <- subset(x, data == "ned.train")
crf_test <- subset(x, data == "testa")
 
## Build the crf model
attributes <- grep("token|pos", colnames(x), value=TRUE)
model <- crf(y = crf_train$label, 
             x = crf_train[, attributes], 
             group = crf_train$doc_id, 
             method = "lbfgs", options = list(max_iterations = 25, feature.minfreq = 5, c1 = 0, c2 = 1)) 
model
 
## Use the model to score on existing tokenised data
scores <- predict(model, newdata = crf_test[, attributes], group = crf_test$doc_id)
 
table(scores$label)
 B-LOC B-MISC  B-ORG  B-PER  I-LOC I-MISC  I-ORG  I-PER      O 
   261    211    182    693     24    205    209    605  35297 

Build custom CRFsuite models

The package itself does not contain any models to do NER or Chunking. It's a package which facilitates creating your own CRF model for doing Named Entity Recognition or Chunking on your own data with your own categories.

In order to facilitate creating training data on your own data, a shiny app is made available in this R package which allows you to easily tag your own chunks of text, with your own categories. More details can be found in the vignette vignette("crfsuite-nlp", package = "crfsuite").

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be

News

CHANGES IN crfsuite VERSION 0.2

  • Fix for as.crf when loaded from file and adding more arguments than just the file
  • added txt_feature as a simple feature extraction to identify if a word is capitalised, an email, an url or a number
  • src/cqdb/src/lookup3.c, fix address sanitizer issue

CHANGES IN crfsuite VERSION 0.1.1

  • Change use of posix_memalign to memalign on Solaris

CHANGES IN crfsuite VERSION 0.1

  • Uses CRFsuite (https://github.com/chokkan/crfsuite) version 0.12 commit dc5b6c7b726de90ca63cbf269e6476e18f1dd0d9
  • Uses liblbfgs (https://github.com/chokkan/liblbfgs) commit dc5b6c7b726de90ca63cbf269e6476e18f1dd0d9
  • Allows to build a CRF model, to predict and to easily add attributes
  • Added flexdashboard app to easily get chunks with labels

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("crfsuite")

0.2 by Jan Wijffels, 4 months ago


Browse source code at https://github.com/cran/crfsuite


Authors: Jan Wijffels [aut, cre, cph] (R wrapper) , BNOSAC [cph] (R wrapper) , Naoaki Okazaki [aut, ctb, cph] (CRFsuite library (BSD licensed) , libLBFGS library (MIT licensed) , Constant Quark Database software (BSD licensed)) , Bob Jenkins [aut, ctb] (File src/cqdb/src/lookup3.c (Public Domain)) , Jorge Nocedal [aut, ctb, cph] (libLBFGS library (MIT licensed)) , Jesse Long [aut, ctb, cph] (RumAVL library (MIT licensed))


Documentation:   PDF Manual  


BSD_3_clause + file LICENSE license


Imports Rcpp, data.table, utils, tools

Suggests udpipe, knitr

Linking to Rcpp

System requirements: GNU make


See at CRAN