学习资料
blast的全称是Basic Local Alignment Search Tool,用于发现生物序列之间相似的区域。
blast
反之,当你不是想找跟你手上这条蛋白质序列相似的蛋白质序列,而是想找跟编码这条蛋白质序列的核酸序列相似的核酸序列的时候,就要做 tBLASTn。tBLASTn 是用蛋白质序列搜核酸序列数据库,核酸数据库中的核酸序列要按 6 条链翻译成蛋白质序列后再被搜索。你可能要问了,核酸数据库里不是已经注释了某条核酸序列能够翻译成什么蛋白质序列吗?为什么还要把这些序列可能翻译出来的 6 条蛋白质序列都翻译出来搜索呢?我们说,你看到的是已经注释的,还有没注释的呢!就算是已经注释的,你看到的也只是已经研究出来的成果,还有没研究出来的呢!别忘了,基因可以重叠,注释上说某段 DNA 序列可以编码某个蛋白,但是可能某个未被发现的基因也用到了这段 DNA 序列。而你要搜索的这个蛋白质序列可能刚好就是这个未被发现的基因的翻译产物。这样就必须把核酸序列所有可能的翻译产物都翻译出来,才能搜索得到。
上述研究方法运用到极限就是 tBLASTx。它是将核酸序列按 6 条链翻译成蛋白质序列后搜索核酸序列数据库,核酸数据库中的所有核酸序列也要按 6 条链翻译成的蛋白质序列后再被搜索。这样用 BLASTn 搜不着的,用 tBLASTx 就能搜着了。
这三种需要先翻译再搜索的 BLAST 主要是用于对新发现的序列进行搜索。那些已经研究的很透彻的序列,用前两种 BLAST 就可以。图 1 是各种 BLAST 的示意图,可以更加清晰的帮你记忆,不同的 BLAST 是用什么序列搜索什么数据库。
除了按照搜索内容分类,BLAST 还可以根据搜索算法不同分为标准 BLAST,PSI-BLAST,和 PHI-BLAST 等。
Database searching with DNA and protein sequences: An introduction
下载数据库
mkdir blastdb && cd blastdb pdate_blastdb.pl --passive --decompress 16S_ribosomal_RNA
调用blastdbcmd从已安装的数据库(16S _ ribosomes _ RNA)中提取NR _ 025000的序列到一个文本文件(16S _ query. fa)中
blastdbcmd
blastdbcmd -db blastdb/16S_ribosomal_RNA -entry nr_025000 -out 16S_query.fa
head 16S_query.fa >NR_025000.1 Mycobacterium kubicae strain CDC 941078 16S ribosomal RNA, partial sequence GTGCTTAACACATGCAAGTCGAACGGAAAGGCCCCTTCGGGGGTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGA TCTACCCTGCACTTCGGGATAAGCCTGGGAAACTGGGTCTAATACCGGATAGGACCATGAGATGCATGTCTTATGGTGGA AAGCTTTTGCGGTGTGGGATGGGCCCGCGGCCTATCAGCTTGTTGGTGGGGTGACGGCCTACCAAGGCGACGACGGGTAG
运行blastn使用16S_query.fa在数据库blastdb/16S_ribosomal_RNA中查询
blastn
16S_query.fa
blastdb/16S_ribosomal_RNA
blastn \ -db blastdb/16S_ribosomal_RNA \ -query 16S_query.fa \ -task blastn \ -dust no \ -outfmt "7 delim=, qacc sacc evalue bitscore qcovus pident" \ -max_target_seqs 5
-task blastn
-dust no
-outfmt “7 delim=. etc
-max_target_seqs 5
-out
# BLASTN 2.13.0+ # Query: NR_025000.1 Mycobacterium kubicae strain CDC 941078 16S ribosomal RNA, partial sequence # Database: blastdb/16S_ribosomal_RNA # Fields: query acc., subject acc., evalue, bit score, % query coverage per uniq subject, % identity # 5 hits found NR_025000.1,NR_025000,0.0,2383,100,100.000 NR_025000.1,NR_028940,0.0,2334,100,99.243 NR_025000.1,NR_125568,0.0,2320,100,98.940 NR_025000.1,NR_118110,0.0,2302,100,98.637 NR_025000.1,NR_117220,0.0,2302,100,98.637 # BLAST processed 1 queries
-outfmt <String> alignment view options: 0 = Pairwise, 1 = Query-anchored showing identities, 2 = Query-anchored no identities, 3 = Flat query-anchored showing identities, 4 = Flat query-anchored no identities, 5 = BLAST XML, 6 = Tabular, 7 = Tabular with comment lines, 8 = Seqalign (Text ASN.1), 9 = Seqalign (Binary ASN.1), 10 = Comma-separated values, 11 = BLAST archive (ASN.1), 12 = Seqalign (JSON), 13 = Multiple-file BLAST JSON, 14 = Multiple-file BLAST XML2, 15 = Single-file BLAST JSON, 16 = Single-file BLAST XML2, 17 = Sequence Alignment/Map (SAM), 18 = Organism Report Options 6, 7, 10 and 17 can be additionally configured to produce a custom format specified by space delimited format specifiers, or in the case of options 6, 7, and 10, by a token specified by the delim keyword. E.g.: "17 delim=@ qacc sacc score". The delim keyword must appear after the numeric output format specification. The supported format specifiers for options 6, 7 and 10 are: qseqid means Query Seq-id qgi means Query GI qacc means Query accesion qaccver means Query accesion.version qlen means Query sequence length sseqid means Subject Seq-id sallseqid means All subject Seq-id(s), separated by a ';' sgi means Subject GI sallgi means All subject GIs sacc means Subject accession saccver means Subject accession.version sallacc means All subject accessions slen means Subject sequence length qstart means Start of alignment in query qend means End of alignment in query sstart means Start of alignment in subject send means End of alignment in subject qseq means Aligned part of query sequence sseq means Aligned part of subject sequence evalue means Expect value bitscore means Bit score score means Raw score length means Alignment length pident means Percentage of identical matches nident means Number of identical matches mismatch means Number of mismatches positive means Number of positive-scoring matches gapopen means Number of gap openings gaps means Total number of gaps ppos means Percentage of positive-scoring matches frames means Query and subject frames separated by a '/' qframe means Query frame sframe means Subject frame btop means Blast traceback operations (BTOP) staxid means Subject Taxonomy ID ssciname means Subject Scientific Name scomname means Subject Common Name sblastname means Subject Blast Name sskingdom means Subject Super Kingdom staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order) sscinames means unique Subject Scientific Name(s), separated by a ';' scomnames means unique Subject Common Name(s), separated by a ';' sblastnames means unique Subject Blast Name(s), separated by a ';' (in alphabetical order) sskingdoms means unique Subject Super Kingdom(s), separated by a ';' (in alphabetical order) stitle means Subject Title salltitles means All Subject Title(s), separated by a '<>' sstrand means Subject Strand qcovs means Query Coverage Per Subject qcovhsp means Query Coverage Per HSP qcovus means Query Coverage Per Unique Subject (blastn only) When not provided, the default value is: 'qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore', which is equivalent to the keyword 'std' The supported format specifier for option 17 is: SQ means Include Sequence Data SR means Subject as Reference Seq Default = `0'
不使用任何参数blast的默认输出包括三个部分
BLASTN 2.13.0+ Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb Miller (2000), "A greedy algorithm for aligning DNA sequences", J Comput Biol 2000; 7(1-2):203-14. Database: 16S ribosomal RNA (Bacteria and Archaea type strains) 26,807 sequences; 38,858,731 total letters Query= NR_025000.1 Mycobacterium kubicae strain CDC 941078 16S ribosomal RNA, partial sequence Length=1321 Score E Sequences producing significant alignments: (Bits) Value NR_025000.1 Mycobacterium kubicae strain CDC 941078 16S ribosomal... 2440 0.0 NR_028940.1 Mycobacterium palustre strain E846 16S ribosomal RNA,... 2383 0.0 NR_125568.1 Mycobacterium europaeum strain DSM 45397 16S ribosoma... 2362 0.0 NR_113062.1 Mycobacterium simiae strain ATCC 25275 16S ribosomal ... 2346 0.0 NR_117227.1 Mycobacterium simiae strain ATCC 25275 16S ribosomal ... 2346 0.0
>NR_025000.1 Mycobacterium kubicae strain CDC 941078 16S ribosomal RNA, partial sequence Length=1321 Score = 2440 bits (1321), Expect = 0.0 Identities = 1321/1321 (100%), Gaps = 0/1321 (0%) Strand=Plus/Plus Query 1 GTGCTTAACACATGCAAGTCGAACGGAAAGGCCCCTTCGGGGGTACTCGAGTGGCGAACG 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1 GTGCTTAACACATGCAAGTCGAACGGAAAGGCCCCTTCGGGGGTACTCGAGTGGCGAACG 60 Query 61 GGTGAGTAACACGTGGGTGATCTACCCTGCACTTCGGGATAAGCCTGGGAAACTGGGTCT 120 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 61 GGTGAGTAACACGTGGGTGATCTACCCTGCACTTCGGGATAAGCCTGGGAAACTGGGTCT 120 Query 121 AATACCGGATAGGACCATGAGATGCATGTCTTATGGTGGAAAGCTTTTGCGGTGTGGGAT 180 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 121 AATACCGGATAGGACCATGAGATGCATGTCTTATGGTGGAAAGCTTTTGCGGTGTGGGAT 180 Query 181 GGGCCCGCGGCCTATCAGCTTGTTGGTGGGGTGACGGCCTACCAAGGCGACGACGGGTAG 240 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 181 GGGCCCGCGGCCTATCAGCTTGTTGGTGGGGTGACGGCCTACCAAGGCGACGACGGGTAG 240 Query 241 CCGGCCTGAGAGGGTGTCCGGCCACACTGGGACTGAGATACGGCCCAGACTCCTACGGGA 300 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 241 CCGGCCTGAGAGGGTGTCCGGCCACACTGGGACTGAGATACGGCCCAGACTCCTACGGGA 300 Query 301 GGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGGGG 360 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 301 GGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGGGG 360 Query 361 GATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCGCAAGTGACGGTACC 420 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 361 GATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCGCAAGTGACGGTACC 420 Query 421 TGCAGAAGAAGCACCGGCCAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGC 480 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 421 TGCAGAAGAAGCACCGGCCAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGC 480 Query 481 GTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCGCGTTGTTCGTGAAA 540 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 481 GTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCGCGTTGTTCGTGAAA 540 Query 541 ACCGGGGGCTTAACCCTCGGCGTGCGGGCGATACGGGCAGACTGGAGTACTGCAGGGGAG 600 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 541 ACCGGGGGCTTAACCCTCGGCGTGCGGGCGATACGGGCAGACTGGAGTACTGCAGGGGAG 600 Query 601 ACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAA 660 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 601 ACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAA 660 Query 661 GGCGGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATT 720 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 661 GGCGGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATT 720 Query 721 AGATACCCTGGTAGTCCACGCCGTAAACGGTGGGTACTAGGTGTGGGTTTCCTTCCTTGG 780 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 721 AGATACCCTGGTAGTCCACGCCGTAAACGGTGGGTACTAGGTGTGGGTTTCCTTCCTTGG 780 Query 781 GATCCGTGCCGTAGCTAACGCATTAAGTACCCCGCCTGGGGAGTACGGCCGCAAGGCTAA 840 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 781 GATCCGTGCCGTAGCTAACGCATTAAGTACCCCGCCTGGGGAGTACGGCCGCAAGGCTAA 840 Query 841 AACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGCGGAGCATGTGGATTAATTCGATGC 900 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 841 AACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGCGGAGCATGTGGATTAATTCGATGC 900 Query 901 AACGCGAAGAACCTTACCTGGGTTTGACATGCACAGGACGCGTCTAGAGATAGGCGTTCC 960 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 901 AACGCGAAGAACCTTACCTGGGTTTGACATGCACAGGACGCGTCTAGAGATAGGCGTTCC 960 Query 961 CTTGTGGCCTGTGTGCAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGG 1020 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 961 CTTGTGGCCTGTGTGCAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGG 1020 Query 1021 TTAAGTCCCGCAACGAGCGCAACCCTTGTCTCATGTTGCCAGCGGGTAATGCCGGGGACT 1080 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1021 TTAAGTCCCGCAACGAGCGCAACCCTTGTCTCATGTTGCCAGCGGGTAATGCCGGGGACT 1080 Query 1081 CGTGAGAGACTGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGCC 1140 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1081 CGTGAGAGACTGCCGGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGCC 1140 Query 1141 CCTTATGTCCAGGGCTTCACACATGCTACAATGGCCGGTACAAAGGGCTGCGATGCCGCG 1200 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1141 CCTTATGTCCAGGGCTTCACACATGCTACAATGGCCGGTACAAAGGGCTGCGATGCCGCG 1200 Query 1201 AGGTTAAGCGAATCCTTTTAAAGCCGGTCTCAGTTCGGATCGGGGTCTGCAACTCGACCC 1260 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1201 AGGTTAAGCGAATCCTTTTAAAGCCGGTCTCAGTTCGGATCGGGGTCTGCAACTCGACCC 1260 Query 1261 CGTGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGG 1320 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1261 CGTGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGG 1320 Query 1321 G 1321 | Sbjct 1321 G 1321
Lambda K H 1.33 0.621 1.12 Gapped Lambda K H 1.28 0.460 0.850 Effective search space used: 49492368576 Database: 16S ribosomal RNA (Bacteria and Archaea type strains) Posted date: Feb 4, 2023 5:36 AM Number of letters in database: 38,858,731 Number of sequences in database: 26,807 Matrix: blastn matrix 1 -2 Gap Penalties: Existence: 0, Extension: 2.5