Functions for assigning taxonomy to NCBI accession numbers and taxon IDs based on NCBI's accession2taxid and taxdump files. This package allows the user to downloads NCBI data dumps and create a local database for fast and local taxonomic assignment.
taxonomizr provides some simple functions to parse NCBI taxonomy files and accession dumps and efficiently use them to assign taxonomy to accession numbers or taxonomic IDs. This is useful for example to assign taxonomy to BLAST results. This is all done locally after downloading the appropriate files from NCBI using included functions (see below).
Once the package is on CRAN, it should install with a simple:
To install the development version directly from github, use the
devtools library and run:
To use the library, load it in R:
In order to avoid constant internet access and slow APIs, the first step in using the package is to downloads all necessary files from NCBI. This uses a bit of disk space but makes future access reliable and fast.
Note: It is not necessary to manually check for the presence of these files since the functions automatically check to see if their output is present and if so skip downloading/processing. Delete the local files if you would like to redownload or reprocess them.
First, download the necessary names and nodes files from NCBI:
##  "./names.dmp" "./nodes.dmp"
Then download accession to taxa id conversion files from NCBI. Note: this is a pretty big download (several gigabytes):
#this is a big downloadgetAccession2taxid()
##  "./nucl_gb.accession2taxid.gz" "./nucl_est.accession2taxid.gz" ##  "./nucl_gss.accession2taxid.gz" "./nucl_wgs.accession2taxid.gz"
If you would also like to identify protein accession numbers, also download the prot file from NCBI (again this is a big download):
#this is a big downloadgetAccession2taxid(types='prot')
##  "./prot.accession2taxid.gz"
Then process the downloaded accession files into a more easily accessed form (this could take a while):
## Reading nucl_est.accession2taxid.gz.
## Reading nucl_gb.accession2taxid.gz.
## Reading nucl_gss.accession2taxid.gz.
## Reading nucl_wgs.accession2taxid.gz.
## Reading in values. This may take a while.
## Adding index. This may also take a while.
##  TRUE
Now everything should be ready for processing. All files are cached locally and so the preparation is only required once (or whenever you would like to update the data). It is not necessary to manually check for the presence of these files since the functions automatically check to see if their output is present and if so skip downloading/processing. Delete the local files if you would like to redownload or reprocess them.
First, load the nodes and names files into memory:
Then we are ready to convert NCBI accession numbers to taxonomic IDs. For example, to find the taxonomic IDs associated with NCBI accession numbers "LN847353.1" and "AL079352.3":
##  1313 9606
And to get the taxonomy for those IDs:
## superkingdom phylum class order ## 1313 "Bacteria" "Firmicutes" "Bacilli" "Lactobacillales" ## 9606 "Eukaryota" "Chordata" "Mammalia" "Primates" ## family genus species ## 1313 "Streptococcaceae" "Streptococcus" "Streptococcus pneumoniae" ## 9606 "Hominidae" "Homo" "Homo sapiens"
If you'd like to find IDs for taxonomic names then you can do something like:
taxaId<-getId(c('Homo sapiens','Bos taurus','Homo'),taxaNames)print(taxaId)
##  "9606" "9913" "9605"
And again to get the taxonomy for those IDs use
## superkingdom phylum class order family genus ## 9606 "Eukaryota" "Chordata" "Mammalia" "Primates" "Hominidae" "Homo" ## 9913 "Eukaryota" "Chordata" "Mammalia" NA "Bovidae" "Bos" ## 9605 "Eukaryota" "Chordata" "Mammalia" "Primates" "Hominidae" "Homo" ## species ## 9606 "Homo sapiens" ## 9913 "Bos taurus" ## 9605 NA