参考基因组及注释文件

最后发布时间:2024-11-30 11:50:06 浏览量:

统计基因组中碱基的个数

cat ${testData}/RNA-seq/genomic/chr22_with_ERCC92.fa | grep -v ">" | perl -ne 'chomp $_; $bases{$_}++ for split //; if (eof){print "$_ $bases{$_}\n" for sort keys %bases}'

GRCh37和GRCh38都是Genome Reference Consortium(GRC)的人类基因组组装。GRCh38(也称为“build 38”)是在2009年GRCh37发布四年后发布的,因此它可以被视为一个版本,其中包含对早期版本的更新注释。

  • GRCh38 Genome Reference Consortium Human Build 38 Organism:

参考基因的下载

GENCODE

图片alt

图片alt


图片alt

图片alt

  • GTF文件下载
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.annotation.gtf.gz

生信小木屋

  • Genome sequence (GRCh38.p14)
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/GRCh38.p14.genome.fa.gz
  • Transcript sequences
    生信小木屋
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.transcripts.fa.gz

生信小木屋

ensembl

人类基因组下载:https://asia.ensembl.org/Homo_sapiens/Info/Index

  • GTF文件下载
wget https://ftp.ensembl.org/pub/release-113/gtf/homo_sapiens/Homo_sapiens.GRCh38.113.gtf.gz

生信小木屋

  • fasta
wget https://ftp.ensembl.org/pub/release-113/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz

生信小木屋

  • Transcript
wget https://ftp.ensembl.org/pub/release-113/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

生信小木屋

注意

  • GENCODE与ensembl的染色体编号不同
  • GENCODE的转录本编号和GTF编号都带有版本(ID.version)
    • >ENST00000424770
    • gene_id "ENSMUSG00000102693.1"; transcript_id "ENSMUST00000193812.1";
  • ensembl的转录本编号带有版本,而GTF编号和版本是分开的
    • >ENST00000390473.1 cdna chromosome:GRCh38:14:22450089:22450139:1 gene:ENSG00000211825.1
    • gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2";

hg19与hg38序列的比较

图片alt

图片alt


图片alt

图片alt


图片alt

图片alt

library(tidyverse)
library("scales")
library(rtracklayer)

# gtf_data = import('reference/gencode.v39.annotation.gtf')
gtf_data = import('reference/gencode.v19.annotation.gtf.gz')

gtf_data = as.data.frame(gtf_data)
write_tsv(gtf_data, file="gtf_data.tsv")
gtf_data <- read_tsv("gtf_data.tsv")
chrom_order <- c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", 
                 "chr8", "chr9", "chr10", "chr11", "chr12", "chr13", "chr14", 
                 "chr15", "chr16", "chr17", "chr18", "chr19", "chr20", "chr21", 
                 "chr22", "chrX", "chrY", "chrM")
chrom_key <- setNames(object = as.character(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 
                                              12, 13, 14, 15, 16, 17, 18, 19, 20, 
                                              21, 22, 23, 24, 25)), 
                      nm = chrom_order)
chrom_order <- factor(x = chrom_order, levels = rev(chrom_order))
chrom_sizes2 <- gtf_data |>
    mutate(chromosome=seqnames) |>
    group_by(chromosome) |>
    summarise(size=max(end)-min(start))
chrom_sizes2$chromosome <-  factor(x=chrom_sizes2$chromosome, levels = chrom_order)

sample_cns <- gtf_data |>
    filter(gene_type == "protein_coding") |>
    select(chromosome=seqnames,start,end,gene_type) 
sample_cns$chromosome <-  factor(x=sample_cns$chromosome, levels = chrom_order)
ggplot(data = chrom_sizes2) + 
    # base rectangles for the chroms, with numeric value for each chrom on the x-axis
    geom_rect(aes(xmin = as.numeric(chromosome) - 0.2, 
                  xmax = as.numeric(chromosome) + 0.2, 
                  ymax = size, ymin = 0), 
              colour="black", fill = "white") + 
    # rotate the plot 90 degrees
    coord_flip() +
    theme(axis.text.x = element_text(colour = "black"), 
          panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(), 
          panel.background = element_blank(),legend.position="bottom") +
    scale_x_discrete(name = "chromosome", limits = names(chrom_key)) +
    geom_rect(data = sample_cns, aes(xmin = as.numeric(chromosome) - 0.2, 
                                     xmax = as.numeric(chromosome) + 0.2, 
                                     ymax = end, ymin = start)) +labs(title="gencode.GRCh37.p13.v19")

参考

快捷入口
基因组 思维导图 浏览PDF 下载PDF
分享到:
标签