Approximate String Matching and String Distance Functions

Implements an approximate string matching version of R's native 'match' function. Can calculate various string distances based on edits (Damerau-Levenshtein, Hamming, Levenshtein, optimal sting alignment), qgrams (q- gram, cosine, jaccard distance) or heuristic metrics (Jaro, Jaro-Winkler). An implementation of soundex is provided as well. Distances can be computed between character vectors while taking proper care of encoding or between integer vectors representing generic sequences.


News

version 0.9.4.2

  • bugfix in stringdistmatrix(a): value of p, for jw-distance was ignored (thanks to Max Fritsche)
  • bugfix in stringdistmatrix(a): Would segfault on q-gram w/input > ~7k strings and q>1 (thanks to Connor McKay)
  • bugfix in jaccard distance: distance not always correct when passing multiple strings (thanks to Robert Carlson)

version 0.9.4.1

  • stringdistmatrix(a) now outputs long vectors (issue #45, thanks to Wouter Touw). For stringdistmatrix(a,b) this was already the case, but the length of rows and columns remains restricted to 2^31-1 since long input vectors are not supported (yet).
  • bugfix in osa/dl/lv distances w/unequal edit weights (thanks to Nathalia Potocka)

version 0.9.4

  • bugfix: edge case for zero-size for lower tridiagonal dist matrices (caused UBSAN to fire, but gave correct results).
  • bugfix in jw distance: not symmetric for certain cases (thanks to github user gtumuluri)

version 0.9.3

  • new function for tokenizing integer sequences: seq_qgrams
  • new function for matching integer sequences: seq_amatch
  • new functions computing distances between integer sequences: seq_dist, seq_distmatrix
  • q-gram based distances are now always 0 when q=0 (used to be Inf if at least one of the arguments was not the empty string)
  • stringdist, stringdistmatrix now emit warning when presented with 'list' argument
  • small c-side code optimizations
  • bugfix in dl, lv, osa distance: weights were not taken into account properly (thanks to Zach Price)

version 0.9.2

  • Update fixing some errors (missing documentation, tests) in the 0.9.1 release.
  • Fixed a few possible memory leaks.

version 0.9.1

  • Argument 'useNames' of 'stringdistmatrix' now accepts 'none', 'strings', and 'names'
  • New function 'stringsim' computes string similarities between 0 and 1 based on 'stringdist'
  • Calling 'stringdistmatrix' with a single argument returns an object of class 'dist'
  • Argument 'cluster' to stringdistmatrix is phased out. It is now ignored with a message.
  • Specifying 'ncores' was already ignored but now also causes a warning
  • internal: rewrite of the R/C interface, saving about 1/3 of C-code, making extending easier
  • bugfix in stringdistmatrix: output was transposed when length(a)==1 (Thanks to github user cpoonolly)
  • Safer core detection to avoid a failure under Cygwin (thanks to Lauri Koobas)

version 0.9.0

  • C-code underlying stringdist and amatch now automatically use multithreading based on openMP. The default number of threads is governed by options('sd_num_thread').
  • stringdist, stringdistmatrix, amatch and ain gain nthread argument which can overwrite the default maximum number of threads.
  • Argument 'maxDist' is phased out for 'stringdist' and 'stringdistmatrix'. Specifying it causes a message.
  • Argument 'ncores' is phased out for 'stringdistmatrix'. It is now ignored and specifying it causes a message.
  • bugfix in amatch/dl. In certain cases, the best match went undetected.
  • Documentation improved and rearranged with string metrics, encoding, and parallelization now documented as separate topics.

version 0.8.2

  • Fixed a few warnings issued by the CLANG compiler (thanks to Brian Ripley). This fixes a bug in amatch/jaccard
  • Fixed a bug in stringdist/osa, dl: NA incorectly returned (thanks to Lauri Koobas).

version 0.8.1

  • stringdistmatrix returns dimensionless matrix when both arguments have length zero (thanks to Richie Cotton)
  • stringdistmatrix gains argument 'useNames' (thanks to Richie Cotton)
  • Package now 'Imports' parallel rather than 'Depends' on it.
  • bugfix in optimal string alignment distance: the nr of transpositions was sometimes overcounted (thanks to Frank Binder)
  • rearranged the documentation.

version 0.8.0

  • Added soundex-based string distance (thanks to Jan van der Laan)
  • New function 'phonetic' translates strings to phonetic codes using soundex (thanks to Jan van der Laan)
  • New function 'printable_ascii' detects non-printable ascii or non-ascii characters.
  • Precision issue: cosine distance between equal strings would be O(1e-16) in stead of 0.0 (thanks to Ben Haller).
  • Code cleaning: somewhat better performance when maxDist is unspecified in stringdist. It remains deprecated.
  • Row names in the output array of 'qgrams' are now in system native encoding (used to be utf8 for all systems).
  • updated CITATION with page number info as the R Journal is now out.

version 0.7.3

  • bugfix in jw-distance: out-of-range access in C-code caused R to crash in some cases (thanks to Carol Gan)
  • bugfix in dl distance: in some cases, distances could be one unit too high.
  • Updated CITATION file: paper to appear in The R Journal vol 6 (2014).
  • Some updates in documentation.

version 0.7.2

  • function 'qgrams' gains .list argument
  • bugfix in multicore option of stringdistmatrix
  • bugfix in substitution weight of DL-distance (undercounted when w4 != 1 in some cases)
  • bugfix in dl.c: C-function read outside of array.

version 0.7.0

  • added useBytes option: up to ~3-fold speed gain at the cost of possible encoding-dependent results.
  • new memory allocation method for q-grams increases speed between ~5% and ~30% depending on q and input string.
  • function 'qgrams' gains useNames option.
  • jaro-winkler distance gains weight argument.
  • C-code optimization in edit-based distances: 10~20% speed increase depending on input.
  • bugfix in amatch: sometimes NA was erroneously returned.
  • bugfix in amatch/lcs: hamming distance method was called erroneously.

version 0.6.1

  • bugfix in parallel version of stringdistmatrix: parameter p was not passed (thanks to Ricardo Saporta)
  • bugfix in lv/osa/dl: maxDist ignored in certain cases

version 0.6.0

  • added amatch function: approximate matching version of 'match'
  • added ain function: approximate matching version of '%in%'
  • qgrams now accepts arbitrary number of arguments. Outputs array, not table
  • added cosine distance
  • added Jaccard distance
  • added Jaro and Jaro-Winkler distances
  • small performance tweeks in underlying C code
  • Edge case in stringdistmatrix: output is now always of class matrix
  • Default maxDist is now Inf (this is only to make it more intuitive and does not break previous code)
  • BREAKING CHANGE: output -1 is replaced by Inf for all distance methods

version 0.5.0

  • added qgram counting function 'qgrams'
  • faster edge case handling in osa method.
  • edge case in lv/osa/dl methods: distance returned length(b) in stead of -1 when length(a) == 0, maxDist < length(b).
  • bugfix in lv/osa/dl method: maxDist returned when length(a) > maxDist > 0 (thanks to Daniel Reckhard).
  • Hamming distance (method='h') now returns -1 for strings of unequal lengts (used to emit error).
  • added longest common substring distance (method='lcs').
  • added qgram distance method.
  • stringdistmatrix gains cluster argument.

version 0.4.2

  • Fix in error message for hamming distance
  • Workaround for system-dependent translation of utf8 NA characters

version 0.4.0

  • First release

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("stringdist")

0.9.4.4 by Mark van der Loo, 4 months ago


https://github.com/markvanderloo/stringdist


Report a bug at https://github.com/markvanderloo/stringdist/issues


Browse source code at https://github.com/cran/stringdist


Authors: Mark van der Loo [aut, cre], Jan van der Laan [ctb], R Core Team [ctb], Nick Logan [ctb]


Documentation:   PDF Manual  


Task views: Official Statistics & Survey Methodology


GPL-3 license


Imports parallel

Suggests testthat


Imported by TSTr, bcRep, deductive, diffrprojects, fuzzyjoin, lingtypology, lintr, qdap, sjmisc, tcR.

Depended on by brewdata, vwr.

Suggested by rlist, sprint.


See at CRAN