Biological taxonomies establish conventions by which researchers can
catalogue and systematically compare their work using nomenclature such
as numeric identifiers and binomial names. The ideal taxonomy is
unambiguous and exhaustive; however, no perfect taxonomy exists. The
degree to which a taxonomy is useful to a researcher depends on context
provided by, for example, the taxonomic neighborhood of a species or the
geological timeframe of the study. Collating the most relevant taxonomic
information from multiple taxonomies is hampered by arbitrary assignment
of numeric identifiers by database administrators, ambiguity in
scientific names, and duplication. The NCBI is the go-to resource for
many scientists, but its taxonomy only includes data on species with
sequence data. In contrast, the Global Biodiversity Information Facility
(GBIF) backbone taxonomy references a more extensive list of extinct and
extant species, and it is integrated with 100 other taxonomic databases.
Unfortunately, the GBIF backbone taxonomy excludes the NCBI taxonomy.
Since the NCBI and GBIF use different numeric identifiers, it is easy to
imagine how using scientific names could lead to errors when mapping
from one taxonomy to the other. As a case in point, additional lineage
information could be used to validate mapping by recursively comparing
parental taxon names. The goal of taxonbridge
is hence to
provide a set of tools for merging the GBIF backbone and NCBI taxonomies
in order to derive a consistent, deduplicated and disambiguated custom
taxonomy for any given study. See the data
provenance flow diagram and scientific
poster for more details.
To install taxonbridge
from CRAN type:
install.packages("taxonbridge")
library(taxonbridge)
Note that the version on CRAN might not reflect the most recent
changes made to the development version of taxonbridge
.
You can install the development version of taxonbridge
with devtools
:
install.packages(c("devtools", "rmarkdown"))
::install_github("MoultDB/taxonbridge", build_vignettes = TRUE)
devtoolslibrary(taxonbridge)
taxonbridge
can be also be
updated/re-installed/overwritten with either of the preceding
installation options.
See the taxonbridge
documentation
for detailed descriptions of the available methods and see the workflow
for how to use the methods. Custom taxonomies in
taxonbridge
always have the following 29 columns. All
columns have the character data type. Column names with links are GBIF
column names that are also Darwin Core
controlled vocabulary terms.
Column name | Description |
---|---|
taxonID | GBIF identifier |
canonicalName | GBIF/NCBI scientific name |
taxonRank | GBIF rank |
parentNameUsageID | GBIF parent ID |
acceptedNameUsageID | GBIF accepted ID |
originalNameUsageID | GBIF original ID |
taxonomicStatus | GBIF taxonomic status |
kingdom | GBIF kingdom name |
phylum | GBIF phylum name |
class | GBIF class name |
order | GBIF order name |
family | GBIF family name |
genericName | GBIF genus name |
specificEpithet | GBIF species name |
infraspecificEpithet | GBIF subspecies name |
from_GBIF | GBIF provenance indicator |
ncbi_id | NCBI identifier |
ncbi_lineage_names | NCBI full lineage names |
ncbi_lineage_ids | NCBI full lineage IDs |
ncbi_rank | NCBI rank |
ncbi_lineage_ranks | NCBI full lineage ranks |
ncbi_kingdom | NCBI kingdom name |
ncbi_phylum | NCBI phylum name |
ncbi_class | NCBI class name |
ncbi_order | NCBI order name |
ncbi_family | NCBI family name |
ncbi_genus | NCBI genus name |
ncbi_species | NCBI species name |
from_NCBI | NCBI provenance indicator |
A 2000 row example subset of a previously merged taxonomy is bundled
with taxonbridge
and can be loaded as follow:
library(taxonbridge)
<- load_sample() example_1
Want to try more than a sample? Download a larger dataset and load it as follow:
library(taxonbridge)
<- load_population("path/to/downloaded/dataset") example_2
You can also prepare a dataset yourself which requires the use of external data and software. The most current NCBI and GBIF taxonomic data can be downloaded as follow:
download_gbif()
download_ncbi()
Once the downloads are complete, the paths to the downloaded files
will be reported to your terminal. A single file is downloaded from the
GBIF (Taxon.tsv
) while four files are downloaded from the
NCBI (nodes.dmp
, names.dmp
,
delnodes.dmp
and merged.dmp
). Take note of the
location of these files. The NCBI files require further parsing with Taxonkit. Read
the download_ncbi()
documentation
for instructions on how to parse the NCBI files.
Downloading the GBIF and NCBI taxonomic data, parsing the NCBI files, and merging the taxonomies can easily be carried out in one command if Taxonkit is already installed on your system:
library(taxonbridge)
<- load_taxonomies(download_gbif(), download_ncbi(taxonkitpath = "/path/to/taxonkit")) custom_taxonomy
See the workflow
and vignette for
more ideas on what to do with loaded data in
taxonbridge
.
Global Biodiversity Information Facility (GBIF) backbone taxonomy
National Center for Biotechnology Information (NCBI) taxonomy
To cite taxonbridge
in publications use:
Veldsman WP, Campli G, Dind S, Rech de Laval V, Drage HB, Waterhouse RM and Robinson-Rechavi M (2022) Taxonbridge: an R package to create custom taxonomies based on the NCBI and GBIF taxonomies, bioRxiv, 490269. DOI: https://doi.org/10.1101/2022.05.02.490269