NGSNHS

文章: https://academic.oup.com/bioinformatics/article/39/1/btad041/6994180
源代码: https://github.com/RAHenriksen/NGSNGS

NGSNGS is capable of simulating ancient and modern DNA data as both single end (SE) or paired end (PE), storing the reads in multiple different output formats, including .fasta, .fastq and Sequence Alignment/Map formats (.bam,.sam,.cram).

The simulated sequences are sampled (with replacement) from a reference DNA genome, which can represent a haploid genome, polyploid assemblies or even population haplotypes and allows the user to simulate known variable sites directly.

docker构建

FROM ubuntu
RUN apt-get update && apt-get install curl -y && apt-get clean
COPY amplicon /bin/amplicon
COPY ngsngs /bin/ngsngs

# docker run --rm -it  ubuntu bash
# docker build -t wybioinfo/ngsngs .

输入文件

$ zcat Test_Examples/Mycobacterium_leprae.fa.gz | head
>NZ_CP029543.1 Mycobacterium leprae strain MRHRU-235-G chromosome, complete genome
ATGTTTGTACCGCACGCCAAAAAGCCCGAAATTTACGAGAACCAGAGAGATACGTCGTTGGCCGATGACCTTAGTCTAGG
TTTCACCACGGTTTGGAACGCAGTCGTCTCCGAACTCAACGGCGAATCCAACACAGACGACGAAGCCACCAACGACAGCA
CCCTAGTCACTCCGCTAACTCCTCAGCAAAGAGCATGGCTAAATCTGGTTCAACCACTCACCATCATCGAGGGATTTGCT
CTTTTATCGGTGCCCAGCAGCTTTGTCCAAAATGAAATTGAACGTCATCTACGAACGCCAATCACCGATGCACTCAGCCG
TCGACTCGGACAACAGATACAGCTCGGAGTCCGTATCGCACCGCCCTCTACCGACCATATTGACGACAATTCCTCGTCAG
CCGACGTCCTTCTAACCGACGATTGCGGCACAGATACAGACGAAAATTACGGGGAGCCTCTTACAGGCGAGTACCAGGGT
TTGCCAACCTACTTCACCGAACGTCCGCACCATACCGAATCAACCGTCACGGGAGGTACCAGCCTTAATCGCCGTTACAC
CTTCGAAACGTTCGTTATTGGCGCGTCGAATCGGTTCGCGCATGCTGCCGCGCTAGCGATAGCCGAAGCACCGGCCCGAG
CCTACAACCCCCTTTTCATTTGGGGCGAGTCAGGTCTTGGCAAAACCCACCTATTGCACGCCGCCGGGAACTACGCACAA

生成单端150bp的fasta文件

docker run --rm -it \
    -v $PWD:$PWD -w $PWD \
    --user $(id -u):$(id -g) \
    wybioinfo/ngsngs \
    ngsngs \
    -i Test_Examples/Mycobacterium_leprae.fa.gz  \
    -r 5\
    -t 2 \
    -s 1 \
    -l 150 \
    -seq SE \
    -f fa \
    -o test2

查看结果

$ head test2.fa 
>T0_RID50_S0_NZ_CP029543.1:3101496-3101645_length:150_mod0000 F0 R1
AGATACGCGGCATCAGCTGGCAGTAAAACGCCATATGGCCATACGGCATTAGTCACCACCGGCGGCGGGCATCACCGAACTACAAGACCAAATCGGACGGCCACTGATTAGCTTCTTAGCTTCGTCCAGCAGTACCTGTCCCAGGAATTT
>T0_RID114_S0_NZ_CP029543.1:1247017-1247166_length:150_mod0000 F0 R1
CGCCGGGCTGCGGCTGTTGTTTCTCGTGCAGGGTTGCATTAGCGTGCGAGGCCGCACGCGGAGCAACCGAATTTCAGGCCCCTTTAAAGCAAATTAGGGCTTGTTCGTTGCCCCATGCGGCTGAGTGGTTCGATGTTCGTGTGTTCATCC
>T1_RID5_S1_NZ_CP029543.1:2689729-2689878_length:150_mod0000 F0 R1
CCTTACCAGACTGCGAGTTGACCCGGATCACCGCCTCGTAGGTGCGGCCGACGTCGCGCGGGTCGATCGGCAGGTACGGTACTTGCCACAGGATGTCGTCGACATCGGAGTCGGCGGCGTCCGCATCGATCTTCATCTGGTCCAGGCCCT
>T1_RID112_S0_NZ_CP029543.1:2730683-2730832_length:150_mod0000 F0 R1
GGCGCCCAGTCAACTTCGATTCCAGCCGACCAGAGTTCACCCAGTGCGCGCAGGAACGTGTCGTGGTCATCAACGTTCTGAAGCGGATGGCGCATGAGCCGAACAGCGCGGTGCCCACTCGACCACCTCGGGTGACGCATCGCCGAACCG
>T1_RID99_S1_NZ_CP029543.1:1857937-1858086_length:150_mod0000 F0 R1
TTTATGCACCTCGACCACTACCGCCGCGGGTATGGTAGCGACGCCCTGTAGACGCTGATCAACTGGCTGTTCATCGAAACAGACCGACCGTCGCATCACGATTGACCTAGCCTTTGGACAATGCCGGGGCTATTCAGTGTTACGAATCTG

使用-seq PE生成双端的fasta文件

test2_R1.fa
test2_R2.fa

当输出格式为<fq, fq.gz, sam, bam, cram>需要提供参数-q1 for SE and -q1, -q2 for PE

docker run --rm -it \
    -v $PWD:$PWD -w $PWD \
    --user $(id -u):$(id -g) \
    wybioinfo/ngsngs \
    ngsngs \
    -i Test_Examples/Mycobacterium_leprae.fa.gz  \
    -r 5\
    -t 2 \
    -s 1 \
    -l 150 \
    -seq PE \
    -q1 Test_Examples/AccFreqL150R1.txt \
    -q2 Test_Examples/AccFreqL150R2.txt \
    -f fq \
    -o test2

或者使用-qs 40:Fixed quality score, for both read pairs in fastq or sequence alignment map formats. It overwrites the quality profiles.

@T0_RID50_S0_NZ_CP029543.1:3101496-3101645_length:150_mod0000 F0 R2
AAATTCCTGGGACAGGTACTGCTGGACGAAGCTAAGAAGCTAATCAGTGGCCGTCCGATTTGGTCTTGTAGTTCGGTGATGCCCGCCGCCGGTGGTGACTAATGCCGTATGGCCATATGGCGTTTTACTGCCAGCTGATGCCGCGTATCT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@T0_RID139_S1_NZ_CP029543.1:703272-703421_length:150_mod0000 F0 R2
GCCGCGATGGGTGCGCTGGAGATCGATCCGGTTTTCGCCGAACGCAGTGTCAACGAGGGCTTCTCCGGTGGCGAAAAGAAGCGCCATGAGATCCTGCAACTGGAACTGCTCAAGCCTAAAATCGCCATCTTAGACGAGACCGATTCCGGG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

不考虑数据分布的模拟器

RNA-seq

Gffread 可以从genome和gtf文件中提取所有转录本

宏基因组中