生信小木屋

https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/该目录中的文件提供了来自核苷酸、蛋白质、 WGS 或 TSA 序列记录的 accession.version 与来自 NCBI Taxonomy 数据库的分类 ID (taxid)之间的映射。

nucl_wgs.accession2taxid.gz           2023-07-03 03:28  4.5G  
nucl_gb.accession2taxid.gz            2023-07-03 03:28  2.1G  
prot.accession2taxid.gz               2023-07-03 03:29  7.8G 

prot.accession2taxid.FULL.1.gz        2023-07-07 23:18  973M  
...
prot.accession2taxid.FULL.gz          2023-07-07 23:20   13G  
nucl_wgs.accession2taxid.EXTRA.gz     2023-07-08 14:56  1.6M  
pdb.accession2taxid.gz                2023-07-03 03:28  5.4M  
dead_nucl.accession2taxid.gz          2023-07-03 03:27  282M  
dead_prot.accession2taxid.gz          2023-07-03 03:27  1.0G  
dead_wgs.accession2taxid.gz           2023-07-03 03:27  748M  

有两组文件可供下载:

所有文件都有四列,用 TAB 字符分隔

each file is a header line:
accession<TAB>accession.version<TAB>taxid<TAB>gi
  1. Accession: Accession of the sequence record, without a version. e.g. BA000005
  2. Accession.version: Accession of the sequence record together with the version number. e.g. BA000005.3. Some dead sequence records do not have any version number in which case the value in this column will be the accession followed by a dot. e.g. X53318.
  3. TaxId: Taxonomy identifier of the source organism for the sequence record. e.g. 9606. If for some reason the source organism cannot be mapped to the taxonomy database, the column will contain 0.
  4. GI:GI of the sequence record. e.g. 55417888. NCBI is phasing out use of gi numbers, see:http://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/. Some sequences such as unannotated WGS and TSA records already lack a GI. If a sequence record does not have a GI assigned, the column will contain na.

Krakenkraken-build --standard --db db下载的文件是ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz

https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz

https://cran.r-project.org/web/packages/taxonomizr/vignettes/usage.html