Discover Probable Duplicates in Plant Genetic Resources Collections

Provides functions to aid the identification of probable/possible duplicates in Plant Genetic Resources (PGR) collections using 'passport databases' comprising of information records of each constituent sample. These include methods for cleaning the data, creation of a searchable Key Word in Context (KWIC) index of keywords associated with sample records and the identification of nearly identical records with similar information by fuzzy, phonetic and semantic matching of keywords.


J. Aravind1, J. Radhamani1, Kalyani Srinivasan1, B. Ananda Subhash2 and R. K. Tyagi1
  1. ICAR-National Bureau of Plant Genetic Resources, New Delhi, India
  2. Centre for Development of Advanced Computing, Thiruvananthapuram, Kerala, India

minimal R version License: GPL v3 CRAN_Status_Badge rstudio mirror downloads Project Status: Inactive develVersion Last-changedate Rdoc Zenodo DOI


The R package PGRdup was developed as a tool to aid genebank managers in the identification of probable duplicate accessions from plant genetic resources (PGR) passport databases.

This package primarily implements a workflow designed to fetch groups or sets of germplasm accessions with similar passport data particularly in fields associated with accession names within or across PGR passport databases.

The functions in this package are primarily built using the following R packages:

Installation

The package can be installed from CRAN as follows:

# Install from CRAN
install.packages('PGRdup', dependencies=TRUE)
 
# Install development version from Github
devtools::install_github("aravind-j/PGRdup")

Workflow

The series of steps involve in the workflow along with the associated functions are are illustrated below:

Step 1

Function(s) :

  • DataClean
  • MergeKW
  • MergePrefix
  • MergeSuffix

Use these functions for the appropriate data standardisation of the relevant fields in the passport databases to harmonize punctuation, leading zeros, prefixes, suffixes etc. associated with accession names.

Step 2

Function(s) :

  • KWIC

Use this function to extract the information in the relevant fields as keywords or text strings in the form of a searchable Keyword in Context (KWIC) index.

Step 3

Function(s) :

  • ProbDup

Execute fuzzy, phonetic and semantic matching of keywords to identify probable duplicate sets either within a single KWIC index or between two indexes using this function. For fuzzy matching the levenshtein edit distance is used, while for phonetic matching, the double metaphone algorithm is used. For semantic matching, synonym sets or 'synsets' of accession names can be supplied as an input and the text strings in such sets will be treated as being identical for matching. Various options to tweak the matching strategies used are also available in this function.

Step 4

Function(s) :

  • DisProbDup
  • ReviewProbDup
  • ReconstructProbDup

Inspect, revise and improve the retrieved sets using these functions. If considerable intersections exist between the initially identified sets, then DisProbDup may be used to get the disjoint sets. The identified sets may be subjected to clerical review after transforming them into an appropriate spreadsheet format which contains the raw data from the original database(s) using ReviewProbDup and subsequently converted back using ReconstructProbDup.

Adjuncts

Function(s) :

  • ValidatePrimKey
  • DoubleMetaphone
  • ParseProbDup
  • AddProbDup
  • SplitProbDup
  • MergeProbDup
  • ViewProbDup
  • KWCounts
  • read.genesys

Use these helper functions if needed. ValidatePrimKey can be used to check whether a column chosen in a data.frame as the primary primary key/ID confirms to the constraints of absence of duplicates and NULL values.

DoubleMetaphone is an implementation of the Double Metaphone phonetic algorithm in R and is used for phonetic matching.

ParseProbDup and AddProbDup work with objects of class ProbDup. The former can be used to parse the probable duplicate sets in a ProbDup object to a data.frame while the latter can be used to add these sets data fields to the passport databases. SplitProbDup can be used to split an object of class ProbDup according to set counts. MergeProbDup can be used to merge together two objects of class ProbDup. ViewProbDup can be used to plot the summary visualizations of probable duplicate sets retrieved in an object of class ProbDup.

KWCounts can be used to compute keyword counts from PGR passport database fields(columns), which can give a rough indication of the completeness of the data.

read.genesys can be used to import PGR data in a Darwin Core - germplasm zip archive downloaded from genesys database into the R environment.

Tips

  • Use fread to rapidly read large files instead of read.csv or read.table in base.
  • In case the PGR passport data is in any DBMS, use the appropriate R-database interface packages to get the required table as a data.frame in R.

Note

  • The ProbDup function can be memory hungry with large passport databases. In such cases, ensure that the system has sufficient memory for smooth functioning (See ?ProbDup).

Detailed tutorial

For a detailed tutorial on how to used this package type:

browseVignettes(package = 'PGRdup')

What's new

To know whats new in this version type:

news(package='PGRdup')

Links

CRAN page

Github page

Github website

Zenodo DOI

Citing PGRdup

To cite the methods in the package use:

citation("PGRdup")
To cite the R package 'PGRdup' in publications use:

  Aravind, J., J. Radhamani, Kalyani Srinivasan, B. Ananda
  Subhash, and R. K. Tyagi (2018).  PGRdup: Discover Probable
  Duplicates in Plant Genetic Resources Collections. R package
  version 0.2.3.3, https://cran.r-project.org/package=PGRdup,
  https://doi.org/10.5281/zenodo.841963.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {PGRdup: Discover Probable Duplicates in Plant Genetic Resources Collections},
    author = {{Aravind J} and {Radhamani J} and {Kalyani Srinivasan} and {Ananda Subhash B} and {Rishi Kumar Tyagi}},
    note = {R package version 0.2.3.3},
    note = {https://cran.r-project.org/package=PGRdup},
    note = {https://doi.org/10.5281/zenodo.841963},
    year = {2018},
  }

This free and open-source software implements academic research by
the authors and co-workers. If you use it, please support the
project by citing the package.

News

PGRdup 0.2.3.3

OTHER NOTES:

  • Use of packages in Suggests (eg. microbenchmark) made conditional to avoid problems when they are not available for an OS.

PGRdup 0.2.3.2

UPDATED FUNCTIONS:

  • DoubleMetaphone - Fixed memory leak issues in underlying C code.

OTHER NOTES:

  • Minor corrections in vignette.
  • Added welcome message.
  • Added version history to vignette
  • Replaced all 1:nrow() and 1:length() usage in function with seq_len(nrow()) and seq_len(length()) respectively.
  • Added package to github.
  • Added package documentation website (https://aravind-j.github.io/PGRdup/) as a github page with pkgdown.
  • Added copyright to [email protected] along with original contributors for the underlying C code for DoubleMetaphone.

PGRdup 0.2.3.1

OTHER NOTES:

  • Registered native routines in the C code for DoubleMetaphone.

PGRdup 0.2.3

UPDATED FUNCTIONS:

  • KWCounts - Fixed error in case of large number of exceptions and fixed bug regarding non-exact removal of keyword exceptions.
  • ProbDup - Changed code with a column vector specifying the columns having with=FALSE argument to the new preferred syntax in data.table.
  • ViewProbDup - Fixed error 'formal argument "axis.ticks.y" matched by multiple actual arguments'.
  • ViewProbDup - Fixed bug in case when all factor names in select argument are not present in factor.db1 and/or factor.db2, the function stops. Now it gives an warning and stops only when none of the factor names in select are present in factor.db*.

OTHER NOTES:

  • Added rmarkdown to Suggests field in DESCRIPTION, as prompted by Jan Górecki.

PGRdup 0.2.2.1

OTHER NOTES:

  • Fixed memory access error in src/fdouble_metaphone.c (Thanks to Prof. Brian Ripley)

PGRdup 0.2.2

NEW FUNCTIONS:

  • read.genesys - Convert 'Darwin Core - Germplasm' zip archive to a flat file.
  • ViewProbDup - Visualize the probable duplicate sets retrieved in a ProbDup object.

UPDATED FUNCTIONS:

  • ReconstructProbDup - Fixed bug regarding failure to retrieve db2 fields when method "c" is used.
  • ProbDup - Updated code after bugfix in stringdist package (stringdistmatrix: output was transposed when length(a)==1).

OTHER NOTES:

  • Changed the contact email addresses of four authors (including maintainer) in DESCRIPTION.
  • Updated the vignette and README.md with the details of new functions.

PGRdup 0.2.1

NEW FUNCTIONS:

  • SplitProbDup - Split an object of class ProbDup.
  • MergeProbDup - Merges two objects of class ProbDup.
  • KWCounts - Generates keyword counts from database fields.
  • print.KWIC - Prints summary of an object of class KWIC to console.
  • print.ProbDup - Prints summary of an object of class ProbDup to console.

UPDATED FUNCTIONS:

  • ProbDup - Modified the phonetic matching for better handling of strings with digits.
  • ProbDup - Fixed throwing of error when no duplicate sets are retrieved.
  • ProbDup - Fixed issue regarding memory out error when large number of exceptions are there.
  • ProbDup - Further converted code to use data.table package for greater efficiency and speed.
  • ProbDup - Fixed bug regarding inconsistent matching when method "b" is used.
  • ProbDup - Reduced the dimensions of the string matching matrices produced for greater efficiency and speed.
  • MergeKW - Modified for better handling of regex special characters.
  • ReconstructProbDup - Modified to ignore sets with counts less than 2 after reconstruction.

OTHER NOTES:

  • Edited README.md formatting.
  • Added diagram, microbenchmark and wordcloud (required for vignette) to suggests field in DESCRIPTION.
  • Added imports to functions from methods, stats and utils as R CMD check --as-cran now checks code usage (via codetools) with only the base package attached.
  • Dropped the abbreviation PGR in the title in DESCRIPTION as it is mentioned in the description text.

VIGNETTE:

  • Added vignette "An Introduction to PGRdup package".

PGRdup 0.2.0

  • First release

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("PGRdup")

0.2.3.3 by J. Aravind, a year ago


https://cran.r-project.org/package=PGRdup, https://github.com/aravind-j/PGRdup, https://doi.org/10.5281/zenodo.841963, https://aravind-j.github.io/PGRdup/, https://www.rdocumentation.org/packages/PGRdup


Report a bug at https://github.com/aravind-j/PGRdup/issues


Browse source code at https://github.com/cran/PGRdup


Authors: J. Aravind [aut, cre] , J. Radhamani [aut] , Kalyani Srinivasan [aut] , B. Ananda Subhash [aut] , R. K. Tyagi [aut] , ICAR-NBGPR [cph] , Maurice Aubrey [ctb] (Double Metaphone) , Kevin Atkinson [ctb] (Double Metaphone) , Lawrence Philips [ctb] (Double Metaphone)


Documentation:   PDF Manual  


Task views:


GPL-2 | GPL-3 license


Imports data.table, igraph, stringdist, stringi, ggplot2, grid, gridExtra, methods, utils, stats

Suggests diagram, wordcloud, microbenchmark, XML, knitr, rmarkdown


See at CRAN