blast
学习资料
- blast
- 下载blast
- BLAST Command Line Applications User Manual
- BLAST Help
- A High-scoring Segment Pair (HSP) is a local alignment with no gaps that achieves one of the highest alignment scores in a given search.
- Each protein was assigned to a KO by the highest scoring annotated hit(s) containing at least one high-scoring segment pair (HSP) scoring over 60 bits.S2405-4712(16)30323-4
blast的种类
blast
的全称是Basic Local Alignment Search Tool,用于发现生物序列之间相似的区域。
BLASTp 也就是用蛋白质序列搜索蛋白质序列数据库,
BLASTn 是用核酸序列搜索核酸序列数据库,这是最直接也是最常用的两种 BLAST。
BLASTx 是将核酸序列按 6 条链翻译成蛋白质序列后搜索蛋白质序列数据库。为什么是按 6 条链翻译?在无法得知翻译起始位点在情况下,翻译可能是从第一个碱基开始,三个三个的往后翻译,也可能是从第 2 个碱基开始,也可能从第 3 个碱基开始。另外还有可能是从这条链的互补链上开始,这样又有三个可能的开始位置,加起来一共会产生 6 条可能被翻译出来的蛋白质序列。这 6 条中有些是真实存在的,有些是不存在,但是谁真谁假我们无从知晓,所以 6 条序列都要到数据库中去搜索一下试试。接下来的问题是,既然是核酸序列,为什么不做 BLASTn 直接到核酸数据库里去搜索,而是要到蛋白质数据库里搜索呢?我们说这样做是有意义的,比如,从核酸序列数据库里找不到跟你手里这条核酸序列相似的序列,或找到了相似的序列但这些找到的序列无法提供有意义的注释信息。这时,就可以去蛋白质数据库试试,看看这条核酸序列的翻译产物能不能从蛋白质数据库里找到相似的序列以及有意义的注释信息。或者说,你不是想找跟你这条核酸序列相似的核酸序列,而是想找跟你这条核酸序列编码蛋白质相似的蛋白质序列,这时就要做 BLASTx。
反之,当你不是想找跟你手上这条蛋白质序列相似的蛋白质序列,而是想找跟编码这条蛋白质序列的核酸序列相似的核酸序列的时候,就要做 tBLASTn。tBLASTn 是用蛋白质序列搜核酸序列数据库,核酸数据库中的核酸序列要按 6 条链翻译成蛋白质序列后再被搜索。你可能要问了,核酸数据库里不是已经注释了某条核酸序列能够翻译成什么蛋白质序列吗?为什么还要把这些序列可能翻译出来的 6 条蛋白质序列都翻译出来搜索呢?我们说,你看到的是已经注释的,还有没注释的呢!就算是已经注释的,你看到的也只是已经研究出来的成果,还有没研究出来的呢!别忘了,基因可以重叠,注释上说某段 DNA 序列可以编码某个蛋白,但是可能某个未被发现的基因也用到了这段 DNA 序列。而你要搜索的这个蛋白质序列可能刚好就是这个未被发现的基因的翻译产物。这样就必须把核酸序列所有可能的翻译产物都翻译出来,才能搜索得到。
上述研究方法运用到极限就是 tBLASTx。它是将核酸序列按 6 条链翻译成蛋白质序列后搜索核酸序列数据库,核酸数据库中的所有核酸序列也要按 6 条链翻译成的蛋白质序列后再被搜索。这样用 BLASTn 搜不着的,用 tBLASTx 就能搜着了。
这三种需要先翻译再搜索的 BLAST 主要是用于对新发现的序列进行搜索。那些已经研究的很透彻的序列,用前两种 BLAST 就可以。图 1 是各种 BLAST 的示意图,可以更加清晰的帮你记忆,不同的 BLAST 是用什么序列搜索什么数据库。
除了按照搜索内容分类,BLAST 还可以根据搜索算法不同分为标准 BLAST,PSI-BLAST,和 PHI-BLAST 等。
blast原理
BLAST 的基本原理很简单,要点是片段对的概念。所谓片段对是指两个给定序列中的一对子序列,它们的长度相等,且可以形成无空位的完全匹配。图 A 中方框里的就是两个片段对。BLAST 从头至尾将两条序列扫描一遍并找出所有片段对,并在允许的阈值范围内对片段对进行延伸,最终找出高分值片段对(high-scoring pairs, HSPs)(图 B)。这样的计算复杂度是 n 的一次方(n 是序列的长度)。如果做双序列比对话需要构建一个 n 乘以 n 的表格,计算复杂度是 n 的二次方。所以找高分值片段对比做双序列比对节省了大量的时间,当然,前提是牺牲了一定的准确度。
Database searching with DNA and protein sequences: An introduction
一个小例子
下载数据库
mkdir blastdb && cd blastdb
pdate_blastdb.pl --passive --decompress 16S_ribosomal_RNA
调用blastdbcmd
从已安装的数据库(16S _ ribosomes _ RNA)中提取NR _ 025000的序列到一个文本文件(16S _ query. fa)中
blastdbcmd -db blastdb/16S_ribosomal_RNA -entry nr_025000 -out 16S_query.fa
head 16S_query.fa
>NR_025000.1 Mycobacterium kubicae strain CDC 941078 16S ribosomal RNA, partial sequence
GTGCTTAACACATGCAAGTCGAACGGAAAGGCCCCTTCGGGGGTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGA
TCTACCCTGCACTTCGGGATAAGCCTGGGAAACTGGGTCTAATACCGGATAGGACCATGAGATGCATGTCTTATGGTGGA
AAGCTTTTGCGGTGTGGGATGGGCCCGCGGCCTATCAGCTTGTTGGTGGGGTGACGGCCTACCAAGGCGACGACGGGTAG
运行blastn
使用16S_query.fa
在数据库blastdb/16S_ribosomal_RNA
中查询
blastn \
-db blastdb/16S_ribosomal_RNA \
-query 16S_query.fa \
-task blastn \
-dust no \
-outfmt "7 delim=, qacc sacc evalue bitscore qcovus pident" \
-max_target_seqs 5
-task blastn
: 指定算法blastn、blastn-short、dc-megablast、megablast(默认)、rmblastn-dust no
: Filter query sequence with DUST-outfmt “7 delim=. etc
自定义表格输出
+-max_target_seqs 5
: 最大显示5条序列- 没有指定
-out
属性将直接打印在控制台
# BLASTN 2.13.0+
# Query: NR_025000.1 Mycobacterium kubicae strain CDC 941078 16S ribosomal RNA, partial sequence
# Database: blastdb/16S_ribosomal_RNA
# Fields: query acc., subject acc., evalue, bit score, % query coverage per uniq subject, % identity
# 5 hits found
NR_025000.1,NR_025000,0.0,2383,100,100.000
NR_025000.1,NR_028940,0.0,2334,100,99.243
NR_025000.1,NR_125568,0.0,2320,100,98.940
NR_025000.1,NR_118110,0.0,2302,100,98.637
NR_025000.1,NR_117220,0.0,2302,100,98.637
# BLAST processed 1 queries
输出格式参数
-outfmt <String>
alignment view options:
0 = Pairwise,
1 = Query-anchored showing identities,
2 = Query-anchored no identities,
3 = Flat query-anchored showing identities,
4 = Flat query-anchored no identities,
5 = BLAST XML,
6 = Tabular,
7 = Tabular with comment lines,
8 = Seqalign (Text ASN.1),
9 = Seqalign (Binary ASN.1),
10 = Comma-separated values,
11 = BLAST archive (ASN.1),
12 = Seqalign (JSON),
13 = Multiple-file BLAST JSON,
14 = Multiple-file BLAST XML2,
15 = Single-file BLAST JSON,
16 = Single-file BLAST XML2,
17 = Sequence Alignment/Map (SAM),
18 = Organism Report
Options 6, 7, 10 and 17 can be additionally configured to produce
a custom format specified by space delimited format specifiers,
or in the case of options 6, 7, and 10, by a token specified
by the delim keyword. E.g.: "17 delim=@ qacc sacc score".
The delim keyword must appear after the numeric output format
specification.
The supported format specifiers for options 6, 7 and 10 are:
qseqid means Query Seq-id
qgi means Query GI
qacc means Query accesion
qaccver means Query accesion.version
qlen means Query sequence length
sseqid means Subject Seq-id
sallseqid means All subject Seq-id(s), separated by a ';'
sgi means Subject GI
sallgi means All subject GIs
sacc means Subject accession
saccver means Subject accession.version
sallacc means All subject accessions
slen means Subject sequence length
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
sseq means Aligned part of subject sequence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive-scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive-scoring matches
frames means Query and subject frames separated by a '/'
qframe means Query frame
sframe means Subject frame
btop means Blast traceback operations (BTOP)
staxid means Subject Taxonomy ID
ssciname means Subject Scientific Name
scomname means Subject Common Name
sblastname means Subject Blast Name
sskingdom means Subject Super Kingdom
staxids means unique Subject Taxonomy ID(s), separated by a ';'
(in numerical order)
sscinames means unique Subject Scientific Name(s), separated by a ';'
scomnames means unique Subject Common Name(s), separated by a ';'
sblastnames means unique Subject Blast Name(s), separated by a ';'
(in alphabetical order)
sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
(in alphabetical order)
stitle means Subject Title
salltitles means All Subject Title(s), separated by a '<>'
sstrand means Subject Strand
qcovs means Query Coverage Per Subject
qcovhsp means Query Coverage Per HSP
qcovus means Query Coverage Per Unique Subject (blastn only)
When not provided, the default value is:
'qaccver saccver pident length mismatch gapopen qstart qend sstart send
evalue bitscore', which is equivalent to the keyword 'std'
The supported format specifier for option 17 is:
SQ means Include Sequence Data
SR means Subject as Reference Seq
Default = `0'
不使用任何参数blast
的默认输出包括三个部分
BLASTN 2.13.0+
Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.
Database: 16S ribosomal RNA (Bacteria and Archaea type strains)
26,807 sequences; 38,858,731 total letters
Query= NR_025000.1 Mycobacterium kubicae strain CDC 941078 16S ribosomal
RNA, partial sequence
Length=1321
Score E
Sequences producing significant alignments: (Bits) Value
NR_025000.1 Mycobacterium kubicae strain CDC 941078 16S ribosomal... 2440 0.0
NR_028940.1 Mycobacterium palustre strain E846 16S ribosomal RNA,... 2383 0.0
NR_125568.1 Mycobacterium europaeum strain DSM 45397 16S ribosoma... 2362 0.0
NR_113062.1 Mycobacterium simiae strain ATCC 25275 16S ribosomal ... 2346 0.0
NR_117227.1 Mycobacterium simiae strain ATCC 25275 16S ribosomal ... 2346 0.0
>NR_025000.1 Mycobacterium kubicae strain CDC 941078 16S ribosomal RNA, partial
sequence
Length=1321
Score = 2440 bits (1321), Expect = 0.0
Identities = 1321/1321 (100%), Gaps = 0/1321 (0%)
Strand=Plus/Plus
Query 1 GTGCTTAACACATGCAAGTCGAACGGAAAGGCCCCTTCGGGGGTACTCGAGTGGCGAACG 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 1 GTGCTTAACACATGCAAGTCGAACGGAAAGGCCCCTTCGGGGGTACTCGAGTGGCGAACG 60
Query 61 GGTGAGTAACACGTGGGTGATCTACCCTGCACTTCGGGATAAGCCTGGGAAACTGGGTCT 120
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 61 GGTGAGTAACACGTGGGTGATCTACCCTGCACTTCGGGATAAGCCTGGGAAACTGGGTCT 120
Query 121 AATACCGGATAGGACCATGAGATGCATGTCTTATGGTGGAAAGCTTTTGCGGTGTGGGAT 180
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 121 AATACCGGATAGGACCATGAGATGCATGTCTTATGGTGGAAAGCTTTTGCGGTGTGGGAT 180
Query 181 GGGCCCGCGGCCTATCAGCTTGTTGGTGGGGTGACGGCCTACCAAGGCGACGACGGGTAG 240
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 181 GGGCCCGCGGCCTATCAGCTTGTTGGTGGGGTGACGGCCTACCAAGGCGACGACGGGTAG 240
Query 241 CCGGCCTGAGAGGGTGTCCGGCCACACTGGGACTGAGATACGGCCCAGACTCCTACGGGA 300
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 241 CCGGCCTGAGAGGGTGTCCGGCCACACTGGGACTGAGATACGGCCCAGACTCCTACGGGA 300
Query 301 GGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGGGG 360
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 301 GGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGGGG 360
Query 361 GATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCGCAAGTGACGGTACC 420
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 361 GATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCGCAAGTGACGGTACC 420
Query 421 TGCAGAAGAAGCACCGGCCAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGC 480
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 421 TGCAGAAGAAGCACCGGCCAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGC 480
Query 481 GTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCGCGTTGTTCGTGAAA 540
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 481 GTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCGCGTTGTTCGTGAAA 540
Query 541 ACCGGGGGCTTAACCCTCGGCGTGCGGGCGATACGGGCAGACTGGAGTACTGCAGGGGAG 600
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 541 ACCGGGGGCTTAACCCTCGGCGTGCGGGCGATACGGGCAGACTGGAGTACTGCAGGGGAG 600
Query 601 ACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAA 660
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 601 ACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAA 660
Query 661 GGCGGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATT 720
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 661 GGCGGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATT 720
Query 721 AGATACCCTGGTAGTCCACGCCGTAAACGGTGGGTACTAGGTGTGGGTTTCCTTCCTTGG 780
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 721 AGATACCCTGGTAGTCCACGCCGTAAACGGTGGGTACTAGGTGTGGGTTTCCTTCCTTGG 780
Query 781 GATCCGTGCCGTAGCTAACGCATTAAGTACCCCGCCTGGGGAGTACGGCCGCAAGGCTAA 840
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 781 GATCCGTGCCGTAGCTAACGCATTAAGTACCCCGCCTGGGGAGTACGGCCGCAAGGCTAA 840
Query 841 AACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGCGGAGCATGTGGATTAATTCGATGC 900
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 841 AACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGCGGAGCATGTGGATTAATTCGATGC 900
Query 901 AACGCGAAGAACCTTACCTGGGTTTGACATGCACAGGACGCGTCTAGAGATAGGCGTTCC 960
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 901 AACGCGAAGAACCTTACCTGGGTTTGACATGCACAGGACGCGTCTAGAGATAGGCGTTCC 960
Query 961 CTTGTGGCCTGTGTGCAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGG 1020
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 961 CTTGTGGCCTGTGTGCAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGG 1020
Query 1021 TTAAGTCCCGCAACGAGCGCAACCCTTGTCTCATGTTGCCAGCGGGTAATGCCGGGGACT 1080
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 1021 TTAAGTCCCGCAACGAGCGCAACCCTTGTCTCATGTTGCCAGCGGGTAATGCCGGGGACT 1080
Query 1081 CGTGAGAGACTGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGCC 1140
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 1081 CGTGAGAGACTGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGCC 1140
Query 1141 CCTTATGTCCAGGGCTTCACACATGCTACAATGGCCGGTACAAAGGGCTGCGATGCCGCG 1200
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 1141 CCTTATGTCCAGGGCTTCACACATGCTACAATGGCCGGTACAAAGGGCTGCGATGCCGCG 1200
Query 1201 AGGTTAAGCGAATCCTTTTAAAGCCGGTCTCAGTTCGGATCGGGGTCTGCAACTCGACCC 1260
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 1201 AGGTTAAGCGAATCCTTTTAAAGCCGGTCTCAGTTCGGATCGGGGTCTGCAACTCGACCC 1260
Query 1261 CGTGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGG 1320
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 1261 CGTGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGG 1320
Query 1321 G 1321
|
Sbjct 1321 G 1321
Lambda K H
1.33 0.621 1.12
Gapped
Lambda K H
1.28 0.460 0.850
Effective search space used: 49492368576
Database: 16S ribosomal RNA (Bacteria and Archaea type strains)
Posted date: Feb 4, 2023 5:36 AM
Number of letters in database: 38,858,731
Number of sequences in database: 26,807
Matrix: blastn matrix 1 -2
Gap Penalties: Existence: 0, Extension: 2.5