Functions to Work with NCBI Accessions and Taxonomy

Functions for assigning taxonomy to NCBI accession numbers and taxon IDs based on NCBI's accession2taxid and taxdump files. This package allows the user to downloads NCBI data dumps and create a local database for fast and local taxonomic assignment.


Build Status codecov.io

Introduction

taxonomizr provides some simple functions to parse NCBI taxonomy files and accession dumps and efficiently use them to assign taxonomy to accession numbers or taxonomic IDs. This is useful for example to assign taxonomy to BLAST results. This is all done locally after downloading the appropriate files from NCBI using included functions (see below).

Installation

Once the package is on CRAN, it should install with a simple:

install.packages("taxonomizr")

To install the development version directly from github, use the devtools library and run:

devtools::install_github("sherrillmix/taxonomizr")

To use the library, load it in R:

library(taxonomizr)

Preparation

In order to avoid constant internet access and slow APIs, the first step in using the package is to downloads all necessary files from NCBI. This uses a bit of disk space but makes future access reliable and fast.

Note: It is not necessary to manually check for the presence of these files since the functions automatically check to see if their output is present and if so skip downloading/processing. Delete the local files if you would like to redownload or reprocess them.

Download names and nodes

First, download the necessary names and nodes files from NCBI:

getNamesAndNodes()
## [1] "./names.dmp" "./nodes.dmp"

Download accession to taxa files

Then download accession to taxa id conversion files from NCBI. Note: this is a pretty big download (several gigabytes):

#this is a big download
getAccession2taxid()
## [1] "./nucl_gb.accession2taxid.gz"  "./nucl_est.accession2taxid.gz"
## [3] "./nucl_gss.accession2taxid.gz" "./nucl_wgs.accession2taxid.gz"

If you would also like to identify protein accession numbers, also download the prot file from NCBI (again this is a big download):

#this is a big download
getAccession2taxid(types='prot')
## [1] "./prot.accession2taxid.gz"

Convert accessions to database

Then process the downloaded accession files into a more easily accessed form (this could take a while):

read.accession2taxid(list.files('.','accession2taxid.gz$'),'accessionTaxa.sql')
## Reading nucl_est.accession2taxid.gz.
## Reading nucl_gb.accession2taxid.gz.
## Reading nucl_gss.accession2taxid.gz.
## Reading nucl_wgs.accession2taxid.gz.
## Reading in values. This may take a while.
## Adding index. This may also take a while.
## [1] TRUE

Now everything should be ready for processing. All files are cached locally and so the preparation is only required once (or whenever you would like to update the data). It is not necessary to manually check for the presence of these files since the functions automatically check to see if their output is present and if so skip downloading/processing. Delete the local files if you would like to redownload or reprocess them.

Assigning taxonomy

Finding taxonomy for NCBI accession numbers

First, load the nodes and names files into memory:

taxaNodes<-read.nodes('nodes.dmp')
taxaNames<-read.names('names.dmp')

Then we are ready to convert NCBI accession numbers to taxonomic IDs. For example, to find the taxonomic IDs associated with NCBI accession numbers "LN847353.1" and "AL079352.3":

taxaId<-accessionToTaxa(c("LN847353.1","AL079352.3"),"accessionTaxa.sql")
print(taxaId)
## [1] 1313 9606

And to get the taxonomy for those IDs:

getTaxonomy(taxaId,taxaNodes,taxaNames)
##      superkingdom phylum       class      order            
## 1313 "Bacteria"   "Firmicutes" "Bacilli"  "Lactobacillales"
## 9606 "Eukaryota"  "Chordata"   "Mammalia" "Primates"       
##      family             genus           species                   
## 1313 "Streptococcaceae" "Streptococcus" "Streptococcus pneumoniae"
## 9606 "Hominidae"        "Homo"          "Homo sapiens"

Finding taxonomy for taxonomic names

If you'd like to find IDs for taxonomic names then you can do something like:

taxaId<-getId(c('Homo sapiens','Bos taurus','Homo'),taxaNames)
print(taxaId)
## [1] "9606" "9913" "9605"

And again to get the taxonomy for those IDs use getTaxonomy:

getTaxonomy(taxaId,taxaNodes,taxaNames)
##      superkingdom phylum     class      order      family      genus 
## 9606 "Eukaryota"  "Chordata" "Mammalia" "Primates" "Hominidae" "Homo"
## 9913 "Eukaryota"  "Chordata" "Mammalia" NA         "Bovidae"   "Bos" 
## 9605 "Eukaryota"  "Chordata" "Mammalia" "Primates" "Hominidae" "Homo"
##      species       
## 9606 "Homo sapiens"
## 9913 "Bos taurus"  
## 9605 NA

News

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("taxonomizr")

0.2.2 by Scott Sherrill-Mix, a year ago


Browse source code at https://github.com/cran/taxonomizr


Authors: Scott Sherrill-Mix [aut, cre]


Documentation:   PDF Manual  


GPL-2 license


Imports parallel, RSQLite, data.table, R.utils

Suggests testthat, knitr, rmarkdown


See at CRAN