Question

What is the advantage of mapping RNA-Seq against genome?

0

Entering edit mode

2.7 years ago

PK ▴ 130

Hi all,

I have a basic question in my mind. What is the advantage of doing mapping RNA-Seq against genome? and How to do that?. I read couple of posts but not able to get the correct answer.

I would assume if i map against transcriptome, i will not get the information about intron and intergenic regions. Is there any gtf file for genome wide mapping?

Can you please suggest some links or information to read.

Thanks

RNA-SEQ mapping • 2.3k views

ADD COMMENT • link 2.7 years ago by PK ▴ 130

score 2 · Answer 1 · 2022-10-29

2

Entering edit mode

2.7 years ago

GenoMax 152k

See answers in this past thread : Could you explain the difference between STAR, KALLISTO, SALMON etc. to experimental Biologist/non-bioinformatician

You would want to align against the genome, if you are interested in identifying previously unknown parts of the genome that may be transcribed. By choosing a transcriptome (collection of transcripts) you are restricting yourself to "known" regions (this is perfectly acceptable for normal uses and thus the basis for tools like salmon and kallisto).

Alignments to genome will need significantly more resources in terms of time/memory compared to transcriptome alignments/mapping.

Is there any gtf file for genome wide mapping?

GTF file for a "genome" should by definition have information about the features from the entire genome.

ADD COMMENT • link 2.7 years ago by GenoMax 152k

1

Entering edit mode

However, worth noting that if you are mapping to the transcriptome, you need to use a proper transcript quantification program, like salmon, kallisto or RSEM, rather than simple read counting. This is because most transcribed regions of the genome are part of multiple transcripts. This means that if you use a simple read aligner, like BWA, Bowtie2 or HISAT2 to align to the transcriptome, the vast majority of reads will map to multiple locations (and be ignored by most read counters under default settings, or counted twice under alternate settings).

If you are using a simple align and count strategy, genomic alignment can allow you distinguish a read that maps to one location in the genome, but that part is part of multiple transcripts, from a read that maps to multiple locations in the genome, and thus its identity cannot be established.

Salmon, Kallisto, RSEM etc, don't suffer this problem because the use the reads to estimate the expression using an EM model, rather than just counting reads.

ADD REPLY • link 2.7 years ago by i.sudbery 22k

0

Entering edit mode

Hi,

Thanks for the explanation and valuable links. The reason why i'm asking this question is that in the 3rd column of GTF, i could not find intron or intergenic information. Moreover, when i use feature counts it takes the exon counts or gene level. which means i'm loosing the information about introns and intergenic right. correct me if i'm wrong. I'm using STAR for aligning. hg38 genome and GTF and i have alignIntronMin ,alignIntronMax parameters ON. The BAM file has some good amount of reads in the introns. How can i extract the counts that belongs to introns or intergenic?

ADD REPLY • link 2.7 years ago by PK ▴ 130

1

Entering edit mode

All of the standard bulk RNA-seq quantification pipelines discard reads mapping to intronic and intergenic regions. Depending on your biological question there is an argument to be made for counting intronic reads, although it's generally considered that intronic reads are a better reporter of rate of transcription, rather than steady state RNA levels, as the primary transcript is usually considered short lived. Either way, I'm not aware of an easy way of doing this, other than doing some GTF-fu to add in the introic regions to the GTF records (for something like STAR/featureCounts) or the fasta for something like salmon.

As for intergenic reads .... how would ou like to count them? featureCounts reports a total number of reads that didn't overlap any annotations.

Becareful if there are a lot of intergenic reads, like if you look in IGV at your BAM files, and there is just a continuous low level read depth across the whole genome, and this is roughly the same in intronic regions. This might suggest that you have genomic DNA contamination in your RNA prep. I'd expect around 1/3 of reads from a total RNA prep to map to exons, with a good chunk of the remaining mapping to introns, and 50%-2/3 of reads for a poly A sample.

ADD REPLY • link 2.7 years ago by i.sudbery 22k

0

Entering edit mode

Thanks. yeah, this is the case right in the RNA-Seq introns are not considered. First i apologise for the silly question. I want to quantify the introns for two reasons. First, to quantify the intron retention (only). Second, since i have total RNA seq it is highly likely to introns right compare to poly-A selection. I checked in the intergenic regions, it is fine(good suggestion from you). So i would assume they are rather pre mRNAs and not yet fully spliced and might have some information like gene length or biotypes.

As for intergenic reads .... how would ou like to count them? featureCounts reports a total number of reads that didn't overlap any annotations

We can compute the intergenic regions right like bedtools-complement or some other tools.

ADD REPLY • link 2.7 years ago by PK ▴ 130

0

Entering edit mode

Yep, as I said, there is no standard way to do it, but you can use some GTF-fu to create them yourself.