NCBI taxonomy数据库的使用

最后发布时间:2023-07-09 21:28:01 浏览量:

生信小木屋

https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/该目录中的文件提供了来自核苷酸、蛋白质、 WGS 或 TSA 序列记录的 accession.version 与来自 NCBI Taxonomy 数据库的分类 ID (taxid)之间的映射。

nucl_wgs.accession2taxid.gz           2023-07-03 03:28  4.5G  
nucl_gb.accession2taxid.gz            2023-07-03 03:28  2.1G  
prot.accession2taxid.gz               2023-07-03 03:29  7.8G 

prot.accession2taxid.FULL.1.gz        2023-07-07 23:18  973M  
...
prot.accession2taxid.FULL.gz          2023-07-07 23:20   13G  
nucl_wgs.accession2taxid.EXTRA.gz     2023-07-08 14:56  1.6M  
pdb.accession2taxid.gz                2023-07-03 03:28  5.4M  
dead_nucl.accession2taxid.gz          2023-07-03 03:27  282M  
dead_prot.accession2taxid.gz          2023-07-03 03:27  1.0G  
dead_wgs.accession2taxid.gz           2023-07-03 03:27  748M  

有两组文件可供下载:

  • The first set contains accession to taxid mapping for live sequence records:
    • nucl_wgs.accession2taxid.gz: TaxID mapping for live nucleotide sequence records of type WGS or TSA.
    • nucl_gb.accession2taxid.gz: TaxID mapping for live nucleotide sequence records that are not WGS or TSA.
    • prot.accession2taxid.gz: TaxID mapping for live protein sequence records which have GI identifiers.
    • prot.accession2taxid.FULL.gz: TaxID mapping for all live protein sequence records, including GI-less WGS proteins
    • prot.accession2taxid.FULL.NN.gz: TaxID mapping for all live protein sequence records, split into smaller files containing 400 million rows each.
  • The second set of files contains accession to taxid mappings for dead
    • dead_nucl.accession2taxid.gz: TaxID mapping for dead nucleotide sequence records that are not WGS or TSA.
    • dead_wgs.accession2taxid.gz: TaxID mapping for dead nucleotide sequence records of type WGS or TSA.
    • dead_prot.accession2taxid.gz: TaxID mapping for dead protein sequence records.

所有文件都有四列,用 TAB 字符分隔

each file is a header line:
accession<TAB>accession.version<TAB>taxid<TAB>gi
  1. Accession: Accession of the sequence record, without a version. e.g. BA000005
  2. Accession.version: Accession of the sequence record together with the version number. e.g. BA000005.3. Some dead sequence records do not have any version number in which case the value in this column will be the accession followed by a dot. e.g. X53318.
  3. TaxId: Taxonomy identifier of the source organism for the sequence record. e.g. 9606. If for some reason the source organism cannot be mapped to the taxonomy database, the column will contain 0.
  4. GI:GI of the sequence record. e.g. 55417888. NCBI is phasing out use of gi numbers, see:http://www.ncbi.nlm.nih.gov/news/03-02-2016-phase-out-of-GI-numbers/. Some sequences such as unannotated WGS and TSA records already lack a GI. If a sequence record does not have a GI assigned, the column will contain na.

Krakenkraken-build --standard --db db下载的文件是ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz

https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz

https://cran.r-project.org/web/packages/taxonomizr/vignettes/usage.html

快捷入口
生物数据库 思维导图 浏览PDF 下载PDF
分享到:
标签