文章: https://academic.oup.com/bioinformatics/article/39/1/btad041/6994180源代码: https://github.com/RAHenriksen/NGSNGS
NGSNGS is capable of simulating ancient and modern DNA data as both single end (SE) or paired end (PE), storing the reads in multiple different output formats, including .fasta, .fastq and Sequence Alignment/Map formats (.bam,.sam,.cram).
The simulated sequences are sampled (with replacement) from a reference DNA genome, which can represent a haploid genome, polyploid assemblies or even population haplotypes and allows the user to simulate known variable sites directly.
docker构建
FROM ubuntu RUN apt-get update && apt-get install curl -y && apt-get clean COPY amplicon /bin/amplicon COPY ngsngs /bin/ngsngs # docker run --rm -it ubuntu bash # docker build -t wybioinfo/ngsngs .
输入文件
$ zcat Test_Examples/Mycobacterium_leprae.fa.gz | head >NZ_CP029543.1 Mycobacterium leprae strain MRHRU-235-G chromosome, complete genome ATGTTTGTACCGCACGCCAAAAAGCCCGAAATTTACGAGAACCAGAGAGATACGTCGTTGGCCGATGACCTTAGTCTAGG TTTCACCACGGTTTGGAACGCAGTCGTCTCCGAACTCAACGGCGAATCCAACACAGACGACGAAGCCACCAACGACAGCA CCCTAGTCACTCCGCTAACTCCTCAGCAAAGAGCATGGCTAAATCTGGTTCAACCACTCACCATCATCGAGGGATTTGCT CTTTTATCGGTGCCCAGCAGCTTTGTCCAAAATGAAATTGAACGTCATCTACGAACGCCAATCACCGATGCACTCAGCCG TCGACTCGGACAACAGATACAGCTCGGAGTCCGTATCGCACCGCCCTCTACCGACCATATTGACGACAATTCCTCGTCAG CCGACGTCCTTCTAACCGACGATTGCGGCACAGATACAGACGAAAATTACGGGGAGCCTCTTACAGGCGAGTACCAGGGT TTGCCAACCTACTTCACCGAACGTCCGCACCATACCGAATCAACCGTCACGGGAGGTACCAGCCTTAATCGCCGTTACAC CTTCGAAACGTTCGTTATTGGCGCGTCGAATCGGTTCGCGCATGCTGCCGCGCTAGCGATAGCCGAAGCACCGGCCCGAG CCTACAACCCCCTTTTCATTTGGGGCGAGTCAGGTCTTGGCAAAACCCACCTATTGCACGCCGCCGGGAACTACGCACAA
生成单端150bp的fasta文件
docker run --rm -it \ -v $PWD:$PWD -w $PWD \ --user $(id -u):$(id -g) \ wybioinfo/ngsngs \ ngsngs \ -i Test_Examples/Mycobacterium_leprae.fa.gz \ -r 5\ -t 2 \ -s 1 \ -l 150 \ -seq SE \ -f fa \ -o test2
查看结果
$ head test2.fa >T0_RID50_S0_NZ_CP029543.1:3101496-3101645_length:150_mod0000 F0 R1 AGATACGCGGCATCAGCTGGCAGTAAAACGCCATATGGCCATACGGCATTAGTCACCACCGGCGGCGGGCATCACCGAACTACAAGACCAAATCGGACGGCCACTGATTAGCTTCTTAGCTTCGTCCAGCAGTACCTGTCCCAGGAATTT >T0_RID114_S0_NZ_CP029543.1:1247017-1247166_length:150_mod0000 F0 R1 CGCCGGGCTGCGGCTGTTGTTTCTCGTGCAGGGTTGCATTAGCGTGCGAGGCCGCACGCGGAGCAACCGAATTTCAGGCCCCTTTAAAGCAAATTAGGGCTTGTTCGTTGCCCCATGCGGCTGAGTGGTTCGATGTTCGTGTGTTCATCC >T1_RID5_S1_NZ_CP029543.1:2689729-2689878_length:150_mod0000 F0 R1 CCTTACCAGACTGCGAGTTGACCCGGATCACCGCCTCGTAGGTGCGGCCGACGTCGCGCGGGTCGATCGGCAGGTACGGTACTTGCCACAGGATGTCGTCGACATCGGAGTCGGCGGCGTCCGCATCGATCTTCATCTGGTCCAGGCCCT >T1_RID112_S0_NZ_CP029543.1:2730683-2730832_length:150_mod0000 F0 R1 GGCGCCCAGTCAACTTCGATTCCAGCCGACCAGAGTTCACCCAGTGCGCGCAGGAACGTGTCGTGGTCATCAACGTTCTGAAGCGGATGGCGCATGAGCCGAACAGCGCGGTGCCCACTCGACCACCTCGGGTGACGCATCGCCGAACCG >T1_RID99_S1_NZ_CP029543.1:1857937-1858086_length:150_mod0000 F0 R1 TTTATGCACCTCGACCACTACCGCCGCGGGTATGGTAGCGACGCCCTGTAGACGCTGATCAACTGGCTGTTCATCGAAACAGACCGACCGTCGCATCACGATTGACCTAGCCTTTGGACAATGCCGGGGCTATTCAGTGTTACGAATCTG
使用-seq PE生成双端的fasta文件
-seq PE
test2_R1.fa test2_R2.fa
当输出格式为<fq, fq.gz, sam, bam, cram>需要提供参数-q1 for SE and -q1, -q2 for PE
<fq, fq.gz, sam, bam, cram>
-q1 for SE and -q1, -q2 for PE
docker run --rm -it \ -v $PWD:$PWD -w $PWD \ --user $(id -u):$(id -g) \ wybioinfo/ngsngs \ ngsngs \ -i Test_Examples/Mycobacterium_leprae.fa.gz \ -r 5\ -t 2 \ -s 1 \ -l 150 \ -seq PE \ -q1 Test_Examples/AccFreqL150R1.txt \ -q2 Test_Examples/AccFreqL150R2.txt \ -f fq \ -o test2
或者使用-qs 40:Fixed quality score, for both read pairs in fastq or sequence alignment map formats. It overwrites the quality profiles.
-qs 40
@T0_RID50_S0_NZ_CP029543.1:3101496-3101645_length:150_mod0000 F0 R2 AAATTCCTGGGACAGGTACTGCTGGACGAAGCTAAGAAGCTAATCAGTGGCCGTCCGATTTGGTCTTGTAGTTCGGTGATGCCCGCCGCCGGTGGTGACTAATGCCGTATGGCCATATGGCGTTTTACTGCCAGCTGATGCCGCGTATCT + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII @T0_RID139_S1_NZ_CP029543.1:703272-703421_length:150_mod0000 F0 R2 GCCGCGATGGGTGCGCTGGAGATCGATCCGGTTTTCGCCGAACGCAGTGTCAACGAGGGCTTCTCCGGTGGCGAAAAGAAGCGCCATGAGATCCTGCAACTGGAACTGCTCAAGCCTAAAATCGCCATCTTAGACGAGACCGATTCCGGG + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Gffread 可以从genome和gtf文件中提取所有转录本