文章: https://academic.oup.com/bioinformatics/article/39/1/btad041/6994180
源代码: https://github.com/RAHenriksen/NGSNGS
NGSNGS is capable of simulating ancient and modern DNA data as both single end (SE) or paired end (PE), storing the reads in multiple different output formats, including .fasta, .fastq and Sequence Alignment/Map formats (.bam,.sam,.cram).
The simulated sequences are sampled (with replacement) from a reference DNA genome, which can represent a haploid genome, polyploid assemblies or even population haplotypes and allows the user to simulate known variable sites directly.
docker构建
FROM ubuntu
RUN apt-get update && apt-get install curl -y && apt-get clean
COPY amplicon /bin/amplicon
COPY ngsngs /bin/ngsngs
# docker run --rm -it ubuntu bash
# docker build -t wybioinfo/ngsngs .
输入文件
$ zcat Test_Examples/Mycobacterium_leprae.fa.gz | head
>NZ_CP029543.1 Mycobacterium leprae strain MRHRU-235-G chromosome, complete genome
ATGTTTGTACCGCACGCCAAAAAGCCCGAAATTTACGAGAACCAGAGAGATACGTCGTTGGCCGATGACCTTAGTCTAGG
TTTCACCACGGTTTGGAACGCAGTCGTCTCCGAACTCAACGGCGAATCCAACACAGACGACGAAGCCACCAACGACAGCA
CCCTAGTCACTCCGCTAACTCCTCAGCAAAGAGCATGGCTAAATCTGGTTCAACCACTCACCATCATCGAGGGATTTGCT
CTTTTATCGGTGCCCAGCAGCTTTGTCCAAAATGAAATTGAACGTCATCTACGAACGCCAATCACCGATGCACTCAGCCG
TCGACTCGGACAACAGATACAGCTCGGAGTCCGTATCGCACCGCCCTCTACCGACCATATTGACGACAATTCCTCGTCAG
CCGACGTCCTTCTAACCGACGATTGCGGCACAGATACAGACGAAAATTACGGGGAGCCTCTTACAGGCGAGTACCAGGGT
TTGCCAACCTACTTCACCGAACGTCCGCACCATACCGAATCAACCGTCACGGGAGGTACCAGCCTTAATCGCCGTTACAC
CTTCGAAACGTTCGTTATTGGCGCGTCGAATCGGTTCGCGCATGCTGCCGCGCTAGCGATAGCCGAAGCACCGGCCCGAG
CCTACAACCCCCTTTTCATTTGGGGCGAGTCAGGTCTTGGCAAAACCCACCTATTGCACGCCGCCGGGAACTACGCACAA
生成单端150bp的fasta文件
docker run --rm -it \
-v $PWD:$PWD -w $PWD \
--user $(id -u):$(id -g) \
wybioinfo/ngsngs \
ngsngs \
-i Test_Examples/Mycobacterium_leprae.fa.gz \
-r 5\
-t 2 \
-s 1 \
-l 150 \
-seq SE \
-f fa \
-o test2
查看结果
$ head test2.fa
>T0_RID50_S0_NZ_CP029543.1:3101496-3101645_length:150_mod0000 F0 R1
AGATACGCGGCATCAGCTGGCAGTAAAACGCCATATGGCCATACGGCATTAGTCACCACCGGCGGCGGGCATCACCGAACTACAAGACCAAATCGGACGGCCACTGATTAGCTTCTTAGCTTCGTCCAGCAGTACCTGTCCCAGGAATTT
>T0_RID114_S0_NZ_CP029543.1:1247017-1247166_length:150_mod0000 F0 R1
CGCCGGGCTGCGGCTGTTGTTTCTCGTGCAGGGTTGCATTAGCGTGCGAGGCCGCACGCGGAGCAACCGAATTTCAGGCCCCTTTAAAGCAAATTAGGGCTTGTTCGTTGCCCCATGCGGCTGAGTGGTTCGATGTTCGTGTGTTCATCC
>T1_RID5_S1_NZ_CP029543.1:2689729-2689878_length:150_mod0000 F0 R1
CCTTACCAGACTGCGAGTTGACCCGGATCACCGCCTCGTAGGTGCGGCCGACGTCGCGCGGGTCGATCGGCAGGTACGGTACTTGCCACAGGATGTCGTCGACATCGGAGTCGGCGGCGTCCGCATCGATCTTCATCTGGTCCAGGCCCT
>T1_RID112_S0_NZ_CP029543.1:2730683-2730832_length:150_mod0000 F0 R1
GGCGCCCAGTCAACTTCGATTCCAGCCGACCAGAGTTCACCCAGTGCGCGCAGGAACGTGTCGTGGTCATCAACGTTCTGAAGCGGATGGCGCATGAGCCGAACAGCGCGGTGCCCACTCGACCACCTCGGGTGACGCATCGCCGAACCG
>T1_RID99_S1_NZ_CP029543.1:1857937-1858086_length:150_mod0000 F0 R1
TTTATGCACCTCGACCACTACCGCCGCGGGTATGGTAGCGACGCCCTGTAGACGCTGATCAACTGGCTGTTCATCGAAACAGACCGACCGTCGCATCACGATTGACCTAGCCTTTGGACAATGCCGGGGCTATTCAGTGTTACGAATCTG
使用-seq PE
生成双端的fasta文件
test2_R1.fa
test2_R2.fa
当输出格式为<fq, fq.gz, sam, bam, cram>
需要提供参数-q1 for SE and -q1, -q2 for PE
docker run --rm -it \
-v $PWD:$PWD -w $PWD \
--user $(id -u):$(id -g) \
wybioinfo/ngsngs \
ngsngs \
-i Test_Examples/Mycobacterium_leprae.fa.gz \
-r 5\
-t 2 \
-s 1 \
-l 150 \
-seq PE \
-q1 Test_Examples/AccFreqL150R1.txt \
-q2 Test_Examples/AccFreqL150R2.txt \
-f fq \
-o test2
或者使用-qs 40
:Fixed quality score, for both read pairs in fastq or sequence alignment map formats. It overwrites the quality profiles.
@T0_RID50_S0_NZ_CP029543.1:3101496-3101645_length:150_mod0000 F0 R2
AAATTCCTGGGACAGGTACTGCTGGACGAAGCTAAGAAGCTAATCAGTGGCCGTCCGATTTGGTCTTGTAGTTCGGTGATGCCCGCCGCCGGTGGTGACTAATGCCGTATGGCCATATGGCGTTTTACTGCCAGCTGATGCCGCGTATCT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@T0_RID139_S1_NZ_CP029543.1:703272-703421_length:150_mod0000 F0 R2
GCCGCGATGGGTGCGCTGGAGATCGATCCGGTTTTCGCCGAACGCAGTGTCAACGAGGGCTTCTCCGGTGGCGAAAAGAAGCGCCATGAGATCCTGCAACTGGAACTGCTCAAGCCTAAAATCGCCATCTTAGACGAGACCGATTCCGGG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Gffread 可以从genome和gtf文件中提取所有转录本