展开

blast

最后发布时间 : 2023-08-08 22:12:04 浏览量 :

学习资料

blast的种类

blast的全称是Basic Local Alignment Search Tool,用于发现生物序列之间相似的区域。

生信小木屋

BLASTp 也就是用蛋白质序列搜索蛋白质序列数据库,
BLASTn 是用核酸序列搜索核酸序列数据库,这是最直接也是最常用的两种 BLAST。
BLASTx 是将核酸序列按 6 条链翻译成蛋白质序列后搜索蛋白质序列数据库。为什么是按 6 条链翻译?在无法得知翻译起始位点在情况下,翻译可能是从第一个碱基开始,三个三个的往后翻译,也可能是从第 2 个碱基开始,也可能从第 3 个碱基开始。另外还有可能是从这条链的互补链上开始,这样又有三个可能的开始位置,加起来一共会产生 6 条可能被翻译出来的蛋白质序列。这 6 条中有些是真实存在的,有些是不存在,但是谁真谁假我们无从知晓,所以 6 条序列都要到数据库中去搜索一下试试。接下来的问题是,既然是核酸序列,为什么不做 BLASTn 直接到核酸数据库里去搜索,而是要到蛋白质数据库里搜索呢?我们说这样做是有意义的,比如,从核酸序列数据库里找不到跟你手里这条核酸序列相似的序列,或找到了相似的序列但这些找到的序列无法提供有意义的注释信息。这时,就可以去蛋白质数据库试试,看看这条核酸序列的翻译产物能不能从蛋白质数据库里找到相似的序列以及有意义的注释信息。或者说,你不是想找跟你这条核酸序列相似的核酸序列,而是想找跟你这条核酸序列编码蛋白质相似的蛋白质序列,这时就要做 BLASTx。

反之,当你不是想找跟你手上这条蛋白质序列相似的蛋白质序列,而是想找跟编码这条蛋白质序列的核酸序列相似的核酸序列的时候,就要做 tBLASTn。tBLASTn 是用蛋白质序列搜核酸序列数据库,核酸数据库中的核酸序列要按 6 条链翻译成蛋白质序列后再被搜索。你可能要问了,核酸数据库里不是已经注释了某条核酸序列能够翻译成什么蛋白质序列吗?为什么还要把这些序列可能翻译出来的 6 条蛋白质序列都翻译出来搜索呢?我们说,你看到的是已经注释的,还有没注释的呢!就算是已经注释的,你看到的也只是已经研究出来的成果,还有没研究出来的呢!别忘了,基因可以重叠,注释上说某段 DNA 序列可以编码某个蛋白,但是可能某个未被发现的基因也用到了这段 DNA 序列。而你要搜索的这个蛋白质序列可能刚好就是这个未被发现的基因的翻译产物。这样就必须把核酸序列所有可能的翻译产物都翻译出来,才能搜索得到。

上述研究方法运用到极限就是 tBLASTx。它是将核酸序列按 6 条链翻译成蛋白质序列后搜索核酸序列数据库,核酸数据库中的所有核酸序列也要按 6 条链翻译成的蛋白质序列后再被搜索。这样用 BLASTn 搜不着的,用 tBLASTx 就能搜着了。

这三种需要先翻译再搜索的 BLAST 主要是用于对新发现的序列进行搜索。那些已经研究的很透彻的序列,用前两种 BLAST 就可以。图 1 是各种 BLAST 的示意图,可以更加清晰的帮你记忆,不同的 BLAST 是用什么序列搜索什么数据库。

除了按照搜索内容分类,BLAST 还可以根据搜索算法不同分为标准 BLAST,PSI-BLAST,和 PHI-BLAST 等。

blast原理

生信小木屋

BLAST 的基本原理很简单,要点是片段对的概念。所谓片段对是指两个给定序列中的一对子序列,它们的长度相等,且可以形成无空位的完全匹配。图 A 中方框里的就是两个片段对。BLAST 从头至尾将两条序列扫描一遍并找出所有片段对,并在允许的阈值范围内对片段对进行延伸,最终找出高分值片段对(high-scoring pairs, HSPs)(图 B)。这样的计算复杂度是 n 的一次方(n 是序列的长度)。如果做双序列比对话需要构建一个 n 乘以 n 的表格,计算复杂度是 n 的二次方。所以找高分值片段对比做双序列比对节省了大量的时间,当然,前提是牺牲了一定的准确度。

Database searching with DNA and protein sequences: An introduction

生信小木屋

一个小例子

下载数据库

mkdir blastdb && cd blastdb
pdate_blastdb.pl --passive --decompress 16S_ribosomal_RNA

调用blastdbcmd从已安装的数据库(16S _ ribosomes _ RNA)中提取NR _ 025000的序列到一个文本文件(16S _ query. fa)中

blastdbcmd -db blastdb/16S_ribosomal_RNA -entry nr_025000 -out 16S_query.fa
head 16S_query.fa
>NR_025000.1 Mycobacterium kubicae strain CDC 941078 16S ribosomal RNA, partial sequence
GTGCTTAACACATGCAAGTCGAACGGAAAGGCCCCTTCGGGGGTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGA
TCTACCCTGCACTTCGGGATAAGCCTGGGAAACTGGGTCTAATACCGGATAGGACCATGAGATGCATGTCTTATGGTGGA
AAGCTTTTGCGGTGTGGGATGGGCCCGCGGCCTATCAGCTTGTTGGTGGGGTGACGGCCTACCAAGGCGACGACGGGTAG

运行blastn使用16S_query.fa在数据库blastdb/16S_ribosomal_RNA中查询

blastn \
	-db blastdb/16S_ribosomal_RNA  \
	-query 16S_query.fa \
	-task blastn \
	-dust no \
	-outfmt "7 delim=, qacc sacc evalue bitscore qcovus pident" \
	-max_target_seqs 5
  • -task blastn: 指定算法blastn、blastn-short、dc-megablast、megablast(默认)、rmblastn
  • -dust no: Filter query sequence with DUST
  • -outfmt “7 delim=. etc自定义表格输出
    +-max_target_seqs 5: 最大显示5条序列
  • 没有指定-out属性将直接打印在控制台
# BLASTN 2.13.0+
# Query: NR_025000.1 Mycobacterium kubicae strain CDC 941078 16S ribosomal RNA, partial sequence
# Database: blastdb/16S_ribosomal_RNA
# Fields: query acc., subject acc., evalue, bit score, % query coverage per uniq subject, % identity
# 5 hits found
NR_025000.1,NR_025000,0.0,2383,100,100.000
NR_025000.1,NR_028940,0.0,2334,100,99.243
NR_025000.1,NR_125568,0.0,2320,100,98.940
NR_025000.1,NR_118110,0.0,2302,100,98.637
NR_025000.1,NR_117220,0.0,2302,100,98.637
# BLAST processed 1 queries

输出格式参数

-outfmt <String>
  alignment view options:
    0 = Pairwise,
    1 = Query-anchored showing identities,
    2 = Query-anchored no identities,
    3 = Flat query-anchored showing identities,
    4 = Flat query-anchored no identities,
    5 = BLAST XML,
    6 = Tabular,
    7 = Tabular with comment lines,
    8 = Seqalign (Text ASN.1),
    9 = Seqalign (Binary ASN.1),
   10 = Comma-separated values,
   11 = BLAST archive (ASN.1),
   12 = Seqalign (JSON),
   13 = Multiple-file BLAST JSON,
   14 = Multiple-file BLAST XML2,
   15 = Single-file BLAST JSON,
   16 = Single-file BLAST XML2,
   17 = Sequence Alignment/Map (SAM),
   18 = Organism Report
  
  Options 6, 7, 10 and 17 can be additionally configured to produce
  a custom format specified by space delimited format specifiers,
  or in the case of options 6, 7, and 10, by a token specified
  by the delim keyword. E.g.: "17 delim=@ qacc sacc score".
  The delim keyword must appear after the numeric output format
  specification.
  The supported format specifiers for options 6, 7 and 10 are:
  	    qseqid means Query Seq-id
  	       qgi means Query GI
  	      qacc means Query accesion
  	   qaccver means Query accesion.version
  	      qlen means Query sequence length
  	    sseqid means Subject Seq-id
  	 sallseqid means All subject Seq-id(s), separated by a ';'
  	       sgi means Subject GI
  	    sallgi means All subject GIs
  	      sacc means Subject accession
  	   saccver means Subject accession.version
  	   sallacc means All subject accessions
  	      slen means Subject sequence length
  	    qstart means Start of alignment in query
  	      qend means End of alignment in query
  	    sstart means Start of alignment in subject
  	      send means End of alignment in subject
  	      qseq means Aligned part of query sequence
  	      sseq means Aligned part of subject sequence
  	    evalue means Expect value
  	  bitscore means Bit score
  	     score means Raw score
  	    length means Alignment length
  	    pident means Percentage of identical matches
  	    nident means Number of identical matches
  	  mismatch means Number of mismatches
  	  positive means Number of positive-scoring matches
  	   gapopen means Number of gap openings
  	      gaps means Total number of gaps
  	      ppos means Percentage of positive-scoring matches
  	    frames means Query and subject frames separated by a '/'
  	    qframe means Query frame
  	    sframe means Subject frame
  	      btop means Blast traceback operations (BTOP)
  	    staxid means Subject Taxonomy ID
  	  ssciname means Subject Scientific Name
  	  scomname means Subject Common Name
  	sblastname means Subject Blast Name
  	 sskingdom means Subject Super Kingdom
  	   staxids means unique Subject Taxonomy ID(s), separated by a ';'
  			 (in numerical order)
  	 sscinames means unique Subject Scientific Name(s), separated by a ';'
  	 scomnames means unique Subject Common Name(s), separated by a ';'
  	sblastnames means unique Subject Blast Name(s), separated by a ';'
  			 (in alphabetical order)
  	sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
  			 (in alphabetical order) 
  	    stitle means Subject Title
  	salltitles means All Subject Title(s), separated by a '<>'
  	   sstrand means Subject Strand
  	     qcovs means Query Coverage Per Subject
  	   qcovhsp means Query Coverage Per HSP
  	    qcovus means Query Coverage Per Unique Subject (blastn only)
  When not provided, the default value is:
  'qaccver saccver pident length mismatch gapopen qstart qend sstart send
  evalue bitscore', which is equivalent to the keyword 'std'
  The supported format specifier for option 17 is:
  	        SQ means Include Sequence Data
  	        SR means Subject as Reference Seq
  Default = `0'

不使用任何参数blast的默认输出包括三个部分

BLASTN 2.13.0+


Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.



Database: 16S ribosomal RNA (Bacteria and Archaea type strains)
           26,807 sequences; 38,858,731 total letters



Query= NR_025000.1 Mycobacterium kubicae strain CDC 941078 16S ribosomal
RNA, partial sequence

Length=1321
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

NR_025000.1 Mycobacterium kubicae strain CDC 941078 16S ribosomal...  2440    0.0  
NR_028940.1 Mycobacterium palustre strain E846 16S ribosomal RNA,...  2383    0.0  
NR_125568.1 Mycobacterium europaeum strain DSM 45397 16S ribosoma...  2362    0.0  
NR_113062.1 Mycobacterium simiae strain ATCC 25275 16S ribosomal ...  2346    0.0  
NR_117227.1 Mycobacterium simiae strain ATCC 25275 16S ribosomal ...  2346    0.0  
>NR_025000.1 Mycobacterium kubicae strain CDC 941078 16S ribosomal RNA, partial 
sequence
Length=1321

 Score = 2440 bits (1321),  Expect = 0.0
 Identities = 1321/1321 (100%), Gaps = 0/1321 (0%)
 Strand=Plus/Plus

Query  1     GTGCTTAACACATGCAAGTCGAACGGAAAGGCCCCTTCGGGGGTACTCGAGTGGCGAACG  60
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  1     GTGCTTAACACATGCAAGTCGAACGGAAAGGCCCCTTCGGGGGTACTCGAGTGGCGAACG  60

Query  61    GGTGAGTAACACGTGGGTGATCTACCCTGCACTTCGGGATAAGCCTGGGAAACTGGGTCT  120
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  61    GGTGAGTAACACGTGGGTGATCTACCCTGCACTTCGGGATAAGCCTGGGAAACTGGGTCT  120

Query  121   AATACCGGATAGGACCATGAGATGCATGTCTTATGGTGGAAAGCTTTTGCGGTGTGGGAT  180
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  121   AATACCGGATAGGACCATGAGATGCATGTCTTATGGTGGAAAGCTTTTGCGGTGTGGGAT  180

Query  181   GGGCCCGCGGCCTATCAGCTTGTTGGTGGGGTGACGGCCTACCAAGGCGACGACGGGTAG  240
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  181   GGGCCCGCGGCCTATCAGCTTGTTGGTGGGGTGACGGCCTACCAAGGCGACGACGGGTAG  240

Query  241   CCGGCCTGAGAGGGTGTCCGGCCACACTGGGACTGAGATACGGCCCAGACTCCTACGGGA  300
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  241   CCGGCCTGAGAGGGTGTCCGGCCACACTGGGACTGAGATACGGCCCAGACTCCTACGGGA  300

Query  301   GGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGGGG  360
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  301   GGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGGGG  360

Query  361   GATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCGCAAGTGACGGTACC  420
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  361   GATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCGCAAGTGACGGTACC  420

Query  421   TGCAGAAGAAGCACCGGCCAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGC  480
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  421   TGCAGAAGAAGCACCGGCCAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGC  480

Query  481   GTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCGCGTTGTTCGTGAAA  540
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  481   GTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCGCGTTGTTCGTGAAA  540

Query  541   ACCGGGGGCTTAACCCTCGGCGTGCGGGCGATACGGGCAGACTGGAGTACTGCAGGGGAG  600
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  541   ACCGGGGGCTTAACCCTCGGCGTGCGGGCGATACGGGCAGACTGGAGTACTGCAGGGGAG  600

Query  601   ACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAA  660
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  601   ACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAA  660

Query  661   GGCGGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATT  720
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  661   GGCGGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATT  720

Query  721   AGATACCCTGGTAGTCCACGCCGTAAACGGTGGGTACTAGGTGTGGGTTTCCTTCCTTGG  780
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  721   AGATACCCTGGTAGTCCACGCCGTAAACGGTGGGTACTAGGTGTGGGTTTCCTTCCTTGG  780

Query  781   GATCCGTGCCGTAGCTAACGCATTAAGTACCCCGCCTGGGGAGTACGGCCGCAAGGCTAA  840
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  781   GATCCGTGCCGTAGCTAACGCATTAAGTACCCCGCCTGGGGAGTACGGCCGCAAGGCTAA  840

Query  841   AACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGCGGAGCATGTGGATTAATTCGATGC  900
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  841   AACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGCGGAGCATGTGGATTAATTCGATGC  900

Query  901   AACGCGAAGAACCTTACCTGGGTTTGACATGCACAGGACGCGTCTAGAGATAGGCGTTCC  960
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  901   AACGCGAAGAACCTTACCTGGGTTTGACATGCACAGGACGCGTCTAGAGATAGGCGTTCC  960

Query  961   CTTGTGGCCTGTGTGCAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGG  1020
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  961   CTTGTGGCCTGTGTGCAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGG  1020

Query  1021  TTAAGTCCCGCAACGAGCGCAACCCTTGTCTCATGTTGCCAGCGGGTAATGCCGGGGACT  1080
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  1021  TTAAGTCCCGCAACGAGCGCAACCCTTGTCTCATGTTGCCAGCGGGTAATGCCGGGGACT  1080

Query  1081  CGTGAGAGACTGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGCC  1140
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  1081  CGTGAGAGACTGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGCC  1140

Query  1141  CCTTATGTCCAGGGCTTCACACATGCTACAATGGCCGGTACAAAGGGCTGCGATGCCGCG  1200
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  1141  CCTTATGTCCAGGGCTTCACACATGCTACAATGGCCGGTACAAAGGGCTGCGATGCCGCG  1200

Query  1201  AGGTTAAGCGAATCCTTTTAAAGCCGGTCTCAGTTCGGATCGGGGTCTGCAACTCGACCC  1260
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  1201  AGGTTAAGCGAATCCTTTTAAAGCCGGTCTCAGTTCGGATCGGGGTCTGCAACTCGACCC  1260

Query  1261  CGTGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGG  1320
             ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  1261  CGTGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGG  1320

Query  1321  G  1321
             |
Sbjct  1321  G  1321
Lambda      K        H
    1.33    0.621     1.12 

Gapped
Lambda      K        H
    1.28    0.460    0.850 

Effective search space used: 49492368576


  Database: 16S ribosomal RNA (Bacteria and Archaea type strains)
    Posted date:  Feb 4, 2023  5:36 AM
  Number of letters in database: 38,858,731
  Number of sequences in database:  26,807
  
  
Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0, Extension: 2.5