miRNA生物信息数据分析流程初探(一)
snakemake-qiaseq-mirna流程
在github的snakemake-qiaseq-mirnaSL:snakemake_qiaseq_mirna流程中,MiRNA的分析步骤如下:
- Step 1: Read trimming and UMI detection
- cutadaptSL:cutadapt
- umi_toolsSL:umi_tools
- Step 2: Read alignment
- Step 3: Deduplication
- Step 4: Read counting
第一步: clone Github的仓库
git clone https://github.com/jounikuj/snakemake-qiaseq-mirna.git
要测试snakemake流程,可以先使用其提供的测试数据。在该仓库中的其测试数据在.test
目录,可以使用snakemake -np --directory .test
命令直接运行,测试流程是否正确,并且查看流程图。
.
├── config.yaml
├── fastq
├── LICENSE.md
├── README.md
├── samples.tsv
├── Snakefile
├── .test
│ ├── config.yaml
│ ├── fastq
│ │ ├── sampleA_S1_L001_R1_001.fastq.gz
│ │ ├── sampleA_S1_L002_R1_001.fastq.gz
│ │ ├── sampleA_S1_L003_R1_001.fastq.gz
│ │ ├── sampleA_S1_L004_R1_001.fastq.gz
│ │ ├── sampleB_S2_L001_R1_001.fastq.gz
│ │ ├── sampleB_S2_L002_R1_001.fastq.gz
│ │ ├── sampleB_S2_L003_R1_001.fastq.gz
│ │ ├── sampleB_S2_L004_R1_001.fastq.gz
│ │ ├── sampleC_S3_L001_R1_001.fastq.gz
│ │ ├── sampleC_S3_L002_R1_001.fastq.gz
│ │ ├── sampleC_S3_L003_R1_001.fastq.gz
│ │ └── sampleC_S3_L004_R1_001.fastq.gz
│ ├── samples.tsv
│ └── .snakemake
│ ├── auxiliary
│ ├── conda
│ ├── conda-archive
│ ├── incomplete
│ ├── locks
│ ├── log
│ ├── metadata
│ ├── shadow
│ └── singularity
└── workflow
├── envs
│ └── env.yaml
├── rules
│ ├── alignment.smk
│ ├── common.smk
│ ├── counting.smk
│ ├── deduplication.smk
│ ├── resources.smk
│ ├── targets.smk
│ └── trimming.smk
├── schemas
│ ├── config.schema.yaml
│ └── samples.schema.yaml
└── scripts
└── merge_read_counts.py
第二步: 查看流程图
在生成流程图的时候,可能需要安装以下软件
sudo apt install graphviz
生成流程图的具体命令是:
snakemake --dag --directory .test | dot -Tpng > dag.png
第三步:用一个样本解释miRNA的流程
使用下面命令生成sampleA
的流程图
snakemake --dag --directory .test results/sampleA/sampleA_counts_dedup.tsv | dot -Tpng > dag_single.png
3.1 数据库下载及建立索引
get_pirna_sequences : piRNA数据下载
curl -s http://www.regulatoryrna.org/database/piRNA/download/archive/v2.0/fasta/hsa.fa.gz | zcat - > resources/pirna.fasta 2> resources/logs/get_pirna_index.log
get_pirna_index:使用bowtie2建立piRNA数据的index
bowtie2-build --threads 1 --quiet resources/pirna.fasta resources/pirna > resources/logs/get_pirna_index.log
get_mirna_sequences : miRNA数据下载
curl -s ftp://mirbase.org/pub/mirbase/CURRENT/mature.fa.gz | zcat - | seqkit grep -r -p ^hsa | seqkit seq --rna2dna > resources/mirna_mature.fasta 2> resources/logs/get_mirna_sequences.log
curl -s ftp://mirbase.org/pub/mirbase/CURRENT/hairpin.fa.gz | zcat - | seqkit grep -r -p ^hsa | seqkit seq --rna2dna > resources/mirna_hairpin.fasta 2> resources/logs/get_mirna_sequences.log
cat resources/mirna_mature.fasta resources/mirna_hairpin.fasta > resources/mirna.fasta
get_mirna_index : 使用bowtie2
建立miRNA数据的index
bowtie2-build --threads 1 --quiet resources/mirna.fasta resources/mirna > resources/logs/get_mirna_index.log
3.2 下机fastq数据质控
merge_fastq_files : 合并单个样本所有的fastq文件到一个文件中
cat fastq/sampleA_S1_L001_R1_001.fastq.gz \
fastq/sampleA_S1_L002_R1_001.fastq.gz \
fastq/sampleA_S1_L003_R1_001.fastq.gz \
fastq/sampleA_S1_L004_R1_001.fastq.gz > results/sampleA/fastq/sampleA_untrimmed.fastq.gz
trim_illumina_adapters : 从reads的3'端去除Illumina的通用接头序列(Adapter sequences)
cutadapt -a AGATCGGAAGAGCACACGTCTGAAC \
--discard-untrimmed \
-m 20 -q 10 -j 1 \
-o results/sampleA/fastq/sampleA_illumina_trimmed.fastq.gz \
--quiet results/sampleA/fastq/sampleA_untrimmed.fastq.gz > results/sampleA/logs/trim_illumina_adapters.log
trim_umi_sequences : 从reads的3'端去除UMI
(Unique Molecular Identifier)序列,并将其添加到reads的头部
umi_tools extract --3prime \
--stdin=results/sampleA/fastq/sampleA_illumina_trimmed.fastq.gz \
--bc-pattern=NNNNNNNNNNNN \
--stdout=results/sampleA/fastq/sampleA_umi_trimmed.fastq.gz > results/sampleA/logs/trim_umi_sequences.log
拓展阅读:单细胞数据分析中的UMI序列
trim_5_adapter : 从reads 5'端去除接头
cutadapt -g GTTCAGAGTTCTACAGTCCGACGATC \
-m 20 -j 1 \
-o results/sampleA/fastq/sampleA_5adapter_trimmed.fastq.gz \
--quiet results/sampleA/fastq/sampleA_umi_trimmed.fastq.gz > results/sampleA/logs/trim_5_adapter.log
trim_3_adapter: 从reads 3'端去除接头,丢弃未修剪的reads并执行一些质量调整
cutadapt -a AACTGTAGGCACCATCAAT \
--discard-untrimmed \
-m 20 -q 10 -j 1 \
-o results/sampleA/fastq/sampleA_trimmed.fastq.gz \
--quiet results/sampleA/fastq/sampleA_5adapter_trimmed.fastq.gz > results/sampleA/logs/trim_3_adapter.lo
3.3 比对reads到数据库
map_mirna_sequences: 将reads比对到mirBase数据库序列,直接转换为排序的.bam
文件,并生成.bai
的索引文件。没有比对上的reads将保存为新的fastq
文件,随后与piRNA的数据库比对。
bowtie2 --very-sensitive-local \
-p 1 -x resources/mirna \
--un-gz results/sampleA/fastq/sampleA_mirna_unmapped.fastq.gz \
results/sampleA/fastq/sampleA_trimmed.fastq.gz | samtools sort - > results/sampleA/bam/sampleA_mirna.bam 2> results/sampleA/logs/map_mirna_sequences.log
samtools index results/sampleA/bam/sampleA_mirna.bam results/sampleA/bam/sampleA_mirna.bai
map_pirna_sequences: 将reads比对到piRNAdb数据库序列,直接转换为排序的.bam
文件,并生成.bai
的索引文件。有比对上的reads将保存为新的fastq
文件,随后与参考基因组比对(本文没有执行该步骤)。
bowtie2 --very-sensitive-local -p 1 \
-x resources/pirna \
--un-gz results/sampleA/fastq/sampleA_unmapped.fastq.gz \
results/sampleA/fastq/sampleA_mirna_unmapped.fastq.gz | samtools sort - > results/sampleA/bam/sampleA_pirna.bam 2> results/sampleA/logs/map_pirna_sequences.log
samtools index results/sampleA/bam/sampleA_pirna.bam results/sampleA/bam/sampleA_pirna.bai
3.4 合并bam文件
merge_bam_files: 将单个 样本BAM 文件合并为一个BAM文件,以减少处理文件的数量并简化后续工作流。
samtools merge --threads 1 \
results/sampleA/bam/sampleA_merged.bam \
results/sampleA/bam/sampleA_mirna.bam \
results/sampleA/bam/sampleA_pirna.bam > results/sampleA/logs/merge_bam_files.log
samtools index results/sampleA/bam/sampleA_merged.bam results/sampleA/bam/sampleA_merged.bai
3.5 去除PCR重复
deduplicate_reads: 使用UMItools dedup命令根据先前识别的UMI序列,去除重复的reads
umi_tools dedup \
-I results/sampleA/bam/sampleA_merged.bam \
-S results/sampleA/bam/sampleA_merged_dedup.bam > results/sampleA/logs/deduplicate_reads.log
samtools index results/sampleA/bam/sampleA_merged_dedup.bam results/sampleA/bam/sampleA_merged_dedup.bai
3.6 获得count表
count_deduplicated_read_counts: 统计BAM文件中去除重复reads的计数,并从输出中提取相关列。
samtools idxstats results/sampleA/bam/sampleA_merged_dedup.bam | \
awk 'BEGIN{OFS='\t'} $3 != 0' - | \
cut -f1,3 -| sed '1 i\Molecule Count' - > results/sampleA/sampleA_counts_dedup.tsv