Question

Estimation of RNA-seq protocol from bam files

0

Entering edit mode

4.8 years ago

JJ ▴ 710

Dear all,

As I am working with public data, I would like to confirm the stated information and estimate whether the RNA-seq data comes from a total RNA protocol or one with a PolyA enrichment step. How would you recommend estimating this? Or is there even a tool available for this? I was thinking about using the proportion of exonic and intronic reads (qualimap output) as a measure - but that probably varies quite a bit between datasets. Any other suggestions? Thanks for you input.

Best,

RNA-Seq • 1.0k views

ADD COMMENT • link updated 4.8 years ago by yhoogstrate ▴ 150 • written 4.8 years ago by JJ ▴ 710

1

Entering edit mode

The approach sounds reasonable. I would try though to get as "positive controls" some published data which used one or the other method and then see if this gives you enough confidence to really call your sample polyA-enriched or rRNA-depleted.

ADD REPLY • link 4.8 years ago by ATpoint 85k

1

Entering edit mode

If the data is public you could try to look the information up in associated publication or write to the submitter and ask.

ADD REPLY • link 4.8 years ago by GenoMax 147k

0

Entering edit mode

Thanks for your input. I extracted the information of the associated publication - It's not always well described though. I will try to contact the submitters - however I am in general looking for confirmation of the data extraction.

ADD REPLY • link 4.8 years ago by JJ ▴ 710

score 1 · Answer 1 · 2020-02-01

If you make a discordant alignment and browse to CDR1 (circRNA) you will find quite a number of back-splice junctions in the ribo-minus/random primed data and not in the polyA+. I must admit that I only know this works in human data and I am not sure if that's what you're aiming for.

Intronic content can be done as well, though the intronic/exonic ratio typically differs per gene. If I remember correctly (not my workstation close by..) there's a paper from 2014? in which these overall gene differences are visualised rather well and provides statics in intron/exon/intergenic mapping percentages.

I once had samples with DNA contamination in the RNA-seq (yes...) and that would give a bit of background to the introns and may falsely mark a dataset as 'total RNA' in the intron/exon ratio test.

I am curious what other tricks people will suggest :)