https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/该目录中的文件提供了来自核苷酸、蛋白质、 WGS 或 TSA 序列记录的 accession.version 与来自 NCBI Taxonomy 数据库的分类 ID (taxid)之间的映射。
nucl_wgs.accession2taxid.gz 2023-07-03 03:28 4.5G
nucl_gb.accession2taxid.gz 2023-07-03 03:28 2.1G
prot.accession2taxid.gz 2023-07-03 03:29 7.8G
prot.accession2taxid.FULL.1.gz 2023-07-07 23:18 973M
...
prot.accession2taxid.FULL.gz 2023-07-07 23:20 13G
nucl_wgs.accession2taxid.EXTRA.gz 2023-07-08 14:56 1.6M
pdb.accession2taxid.gz 2023-07-03 03:28 5.4M
dead_nucl.accession2taxid.gz 2023-07-03 03:27 282M
dead_prot.accession2taxid.gz 2023-07-03 03:27 1.0G
dead_wgs.accession2taxid.gz 2023-07-03 03:27 748M
有两组文件可供下载:
nucl_wgs.accession2taxid.gz
: TaxID mapping for live nucleotide sequence records of type WGS or TSA.nucl_gb.accession2taxid.gz
: TaxID mapping for live nucleotide sequence records that are not WGS or TSA.prot.accession2taxid.gz
: TaxID mapping for live protein sequence records which have GI identifiers.prot.accession2taxid.FULL.gz
: TaxID mapping for all live protein sequence records, including GI-less WGS proteinsprot.accession2taxid.FULL.NN.gz
: TaxID mapping for all live protein sequence records, split into smaller files containing 400 million rows each.dead_nucl.accession2taxid.gz
: TaxID mapping for dead nucleotide sequence records that are not WGS or TSA.dead_wgs.accession2taxid.gz
: TaxID mapping for dead nucleotide sequence records of type WGS or TSA.dead_prot.accession2taxid.gz
: TaxID mapping for dead protein sequence records.所有文件都有四列,用 TAB 字符分隔
each file is a header line:
accession<TAB>accession.version<TAB>taxid<TAB>gi
Accession
: Accession of the sequence record, without a version. e.g. BA000005Accession.version
: Accession of the sequence record together with the version number. e.g. BA000005.3. Some dead sequence records do not have any version number in which case the value in this column will be the accession followed by a dot. e.g. X53318.TaxId
: Taxonomy identifier of the source organism for the sequence record. e.g. 9606. If for some reason the source organism cannot be mapped to the taxonomy database, the column will contain 0.GI
:GI of the sequence record. e.g. 55417888. NCBI is phasing out use of gi numbers, see:http://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/. Some sequences such as unannotated WGS and TSA records already lack a GI. If a sequence record does not have a GI assigned, the column will contain na.Krakenkraken-build --standard --db db
下载的文件是ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
https://cran.r-project.org/web/packages/taxonomizr/vignettes/usage.html