Question

Highly mapped to introns

1

Entering edit mode

3.9 years ago

Gotumbtai ▴ 10

Hi,

I am analyzing RNA-seq data from human blood samples. I checked the read distribution using RSeQC read_distribution after mapping by STAR. Usually, I get more than 80% of reads mapped to exons. However, at this time, the result showed only several % were mapped to exons, even though the STAR outputs showed more than 90% were uniquely mapped. I am wondering if this result was correct or my setting for the RSeQC was wrong.

The command I used: read_distribution.py -i my.bam -r hg19_Ensembl_gene.bed

The bam files were output from STAR and sorted by samtools. the bed file was downloaded from https://sourceforge.net/projects/rseqc/files/BED/Human_Homo_sapiens/ The reference genome for mapping was Homo_sapiens.GRCh38.dna.primary_assembly.fa and the annotation file was Homo_sapiens.GRCh38.104.gtf

One of the output from RSeQC was below: enter image description here

The multiqc image was below: enter image description here

Thank you for your help!!

distribution read RSeQC RNA-seq exon intron • 1.6k views

ADD COMMENT • link updated 3.9 years ago by rpolicastro 13k • written 3.9 years ago by Gotumbtai ▴ 10

0

Entering edit mode

Do you know the library type that was used? E.g. if this is total RNA (not poly-A enriched), then you may simply have lots of immature transcripts and non-coding transcripts. There also seems to be a bit of genomic DNA contamination ("other_intergenic").

ADD REPLY • link 3.9 years ago by Friederike 9.0k

rpolicastro · Answer 1 · 2021-09-14

I think your problem is that your bed file doesn't match the genome/gtf you used. I think it's too old. My $gtf is the version 104 one like yours.

zcat hg19_Ensembl_gene.bed.gz | head
chr1    **66999065**        67210057        **ENST00000237247** 0       +       67000041        67208778        0       27      25,123,64,25,84,57,55,176,12,12,25,52,86,93,75,501,81,128,127,60,112,156,133,203,65,165,1302,   0,863,92464,99687,100697,106394,109427,110161,127130,134147,137612,138561,139898,143621,146295,148486,150724,155765,156807,162051,185911,195881,200365,205952,207275,207889,209690,

grep ENST00000237247 $gtf
1       havana  transcript      **66533383**        66744374        .       +       .       gene_id "ENSG00000118473"; gene_version "23"; transcript_id "**ENST00000237247**"; transcript_version "10"; gene_name "SGIP1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "SGIP1-201"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "basic"; transcript_support_level "5";

1       havana  exon    **66533383**        66533407        .       +       .       gene_id "ENSG00000118473"; gene_version "23"; transcript_id "ENST00000237247"; transcript_version "10"; exon_number "1"; gene_name "SGIP1"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "SGIP1-201"; transcript_source "havana"; transcript_biotype "protein_coding"; exon_id "ENSE00001454196"; exon_version "1"; tag "basic"; transcript_support_level "5";