模拟二代测序的reads

最后发布时间:2024-09-07 19:42:59 浏览量:

NGSNHS

文章: https://academic.oup.com/bioinformatics/article/39/1/btad041/6994180
源代码: https://github.com/RAHenriksen/NGSNGS

NGSNGS is capable of simulating ancient and modern DNA data as both single end (SE) or paired end (PE), storing the reads in multiple different output formats, including .fasta, .fastq and Sequence Alignment/Map formats (.bam,.sam,.cram).

The simulated sequences are sampled (with replacement) from a reference DNA genome, which can represent a haploid genome, polyploid assemblies or even population haplotypes and allows the user to simulate known variable sites directly.

docker构建

FROM ubuntu
RUN apt-get update && apt-get install curl -y && apt-get clean
COPY amplicon /bin/amplicon
COPY ngsngs /bin/ngsngs

# docker run --rm -it  ubuntu bash
# docker build -t wybioinfo/ngsngs .

输入文件

$ zcat Test_Examples/Mycobacterium_leprae.fa.gz | head
>NZ_CP029543.1 Mycobacterium leprae strain MRHRU-235-G chromosome, complete genome
ATGTTTGTACCGCACGCCAAAAAGCCCGAAATTTACGAGAACCAGAGAGATACGTCGTTGGCCGATGACCTTAGTCTAGG
TTTCACCACGGTTTGGAACGCAGTCGTCTCCGAACTCAACGGCGAATCCAACACAGACGACGAAGCCACCAACGACAGCA
CCCTAGTCACTCCGCTAACTCCTCAGCAAAGAGCATGGCTAAATCTGGTTCAACCACTCACCATCATCGAGGGATTTGCT
CTTTTATCGGTGCCCAGCAGCTTTGTCCAAAATGAAATTGAACGTCATCTACGAACGCCAATCACCGATGCACTCAGCCG
TCGACTCGGACAACAGATACAGCTCGGAGTCCGTATCGCACCGCCCTCTACCGACCATATTGACGACAATTCCTCGTCAG
CCGACGTCCTTCTAACCGACGATTGCGGCACAGATACAGACGAAAATTACGGGGAGCCTCTTACAGGCGAGTACCAGGGT
TTGCCAACCTACTTCACCGAACGTCCGCACCATACCGAATCAACCGTCACGGGAGGTACCAGCCTTAATCGCCGTTACAC
CTTCGAAACGTTCGTTATTGGCGCGTCGAATCGGTTCGCGCATGCTGCCGCGCTAGCGATAGCCGAAGCACCGGCCCGAG
CCTACAACCCCCTTTTCATTTGGGGCGAGTCAGGTCTTGGCAAAACCCACCTATTGCACGCCGCCGGGAACTACGCACAA

生成单端150bp的fasta文件

docker run --rm -it \
    -v $PWD:$PWD -w $PWD \
    --user $(id -u):$(id -g) \
    wybioinfo/ngsngs \
    ngsngs \
    -i Test_Examples/Mycobacterium_leprae.fa.gz  \
    -r 5\
    -t 2 \
    -s 1 \
    -l 150 \
    -seq SE \
    -f fa \
    -o test2
  • -r :Number of reads to simulate.
  • -t :Number of sampling threads, default = 1.
  • -s :Random seed, default = current calendar time (s).
  • -l :Fixed length of simulated fragments
  • -se :Simulate single-end or paired-end reads(SE\PE).
  • -f :File format of the simulated output reads(fa/fa.gz/fq/fq.gz/sam/bam).

查看结果

$ head test2.fa 
>T0_RID50_S0_NZ_CP029543.1:3101496-3101645_length:150_mod0000 F0 R1
AGATACGCGGCATCAGCTGGCAGTAAAACGCCATATGGCCATACGGCATTAGTCACCACCGGCGGCGGGCATCACCGAACTACAAGACCAAATCGGACGGCCACTGATTAGCTTCTTAGCTTCGTCCAGCAGTACCTGTCCCAGGAATTT
>T0_RID114_S0_NZ_CP029543.1:1247017-1247166_length:150_mod0000 F0 R1
CGCCGGGCTGCGGCTGTTGTTTCTCGTGCAGGGTTGCATTAGCGTGCGAGGCCGCACGCGGAGCAACCGAATTTCAGGCCCCTTTAAAGCAAATTAGGGCTTGTTCGTTGCCCCATGCGGCTGAGTGGTTCGATGTTCGTGTGTTCATCC
>T1_RID5_S1_NZ_CP029543.1:2689729-2689878_length:150_mod0000 F0 R1
CCTTACCAGACTGCGAGTTGACCCGGATCACCGCCTCGTAGGTGCGGCCGACGTCGCGCGGGTCGATCGGCAGGTACGGTACTTGCCACAGGATGTCGTCGACATCGGAGTCGGCGGCGTCCGCATCGATCTTCATCTGGTCCAGGCCCT
>T1_RID112_S0_NZ_CP029543.1:2730683-2730832_length:150_mod0000 F0 R1
GGCGCCCAGTCAACTTCGATTCCAGCCGACCAGAGTTCACCCAGTGCGCGCAGGAACGTGTCGTGGTCATCAACGTTCTGAAGCGGATGGCGCATGAGCCGAACAGCGCGGTGCCCACTCGACCACCTCGGGTGACGCATCGCCGAACCG
>T1_RID99_S1_NZ_CP029543.1:1857937-1858086_length:150_mod0000 F0 R1
TTTATGCACCTCGACCACTACCGCCGCGGGTATGGTAGCGACGCCCTGTAGACGCTGATCAACTGGCTGTTCATCGAAACAGACCGACCGTCGCATCACGATTGACCTAGCCTTTGGACAATGCCGGGGCTATTCAGTGTTACGAATCTG

使用-seq PE生成双端的fasta文件

test2_R1.fa
test2_R2.fa

当输出格式为<fq, fq.gz, sam, bam, cram>需要提供参数-q1 for SE and -q1, -q2 for PE

docker run --rm -it \
    -v $PWD:$PWD -w $PWD \
    --user $(id -u):$(id -g) \
    wybioinfo/ngsngs \
    ngsngs \
    -i Test_Examples/Mycobacterium_leprae.fa.gz  \
    -r 5\
    -t 2 \
    -s 1 \
    -l 150 \
    -seq PE \
    -q1 Test_Examples/AccFreqL150R1.txt \
    -q2 Test_Examples/AccFreqL150R2.txt \
    -f fq \
    -o test2
  • q1 : Read Quality profile for single-end reads (SE) or first read pair (PE) for fastq or sequence alignment map formats.
  • q2 :Read Quality profile for for second read pair (PE) for fastq or sequence alignment map formats.

或者使用-qs 40:Fixed quality score, for both read pairs in fastq or sequence alignment map formats. It overwrites the quality profiles.

@T0_RID50_S0_NZ_CP029543.1:3101496-3101645_length:150_mod0000 F0 R2
AAATTCCTGGGACAGGTACTGCTGGACGAAGCTAAGAAGCTAATCAGTGGCCGTCCGATTTGGTCTTGTAGTTCGGTGATGCCCGCCGCCGGTGGTGACTAATGCCGTATGGCCATATGGCGTTTTACTGCCAGCTGATGCCGCGTATCT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@T0_RID139_S1_NZ_CP029543.1:703272-703421_length:150_mod0000 F0 R2
GCCGCGATGGGTGCGCTGGAGATCGATCCGGTTTTCGCCGAACGCAGTGTCAACGAGGGCTTCTCCGGTGGCGAAAAGAAGCGCCATGAGATCCTGCAACTGGAACTGCTCAAGCCTAAAATCGCCATCTTAGACGAGACCGATTCCGGG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

不考虑数据分布的模拟器

RNA-seq

  • 生成一个基因组
  • 生成基因组的gtf文件
  • 生成不同处理(特定基因的reads数量的多少)的样本

Gffread 可以从genome和gtf文件中提取所有转录本

宏基因组中

  • 生成多个细菌的genome
  • 生成多个细菌的reads
  • 自定义MetaPhlAn3的marker库
  • MetaPhlAn3接受生成的reads和自定义marker库对细菌进行定性和定量的分析