I am looking for any bioinformatics tool that might detect RNA contamination in DNA data (specifically exome data). Although RNA contamination is rare it's something we'd like to study on our samples nonetheless.
Edit: some ideas:
check for rRNA contamination, something like Picard's
CollectRnaSeqMetrics
, maybe after mapping with a splice aware aligner and filtering reads that span a junction first.picard's
CollectGcBiasMetrics
tool. while this won't give me a definitive contamination percentage, a contaminated sample may show differences against other samples that do not have significant RNA contamination.
How would you detect that since all sequencing happens on DNA (with exception of direct RNA sequencing that is possible with Nanopore). RNA can't be sequenced directly with Illumina and if it is converted to DNA it would no longer be recognizable as RNA.
FAQ #8 should be useful.
I guess you could look for sequences that span a splicing junction. Do lab people put RNAase into DNA preps?
not sure how my previous reply didn't show up so maybe i'm double-posting. I would think you would see a lot of breakpoints coinciding with known splice sites, and maybe there's a statistical test to see if there's a preference for breakpoints at splice sites. any splice event in DNA data could be retrotransposons, but i don't think you would see that many splice-presenting features all over the exome/genome from retrotransposons alone. My purpose here is not to identify each feature that looks like it might be splice event but get an overall estimate of the likelihood of RNA contamination as a QC metric. i see what you mean about the wet lab protocols minimizing RNA contamination, but you also want to know that the RNase worked at all.
This would be something more simply handled on experimental bench side of things rather than informatics. If you don't want any RNA contamination then asking the provider to do an RNAase step (which some may already do) should take care of this issue.
wouldn't you want to know if RNase treatment even worked, even if your provider says they performed it?
Sure. I am out of my zone of expertise here but there are methods to identify if that treatment has worked or not on experimental side of things.
See: https://www.thermofisher.com/order/catalog/product/Q32852#/Q32852
I never really performed such an analysis, I am just thinking out loud:
Map the reads with any RNAseq aligner (STAR, HISAT2, etc) and quantify the amount of spliced reads over well annotated, long introns.
Another feature to look at would be how homogeneous the exome coverage is, one would expect more varying coverage if there is RNA contamination.
For both of these metrics, it would be good to have good exome samples to establish a baseline.
I also wonder how to detect RNA contamination in DNA-seq (WGS). when I check the deletions in my WGS file though IGV software, I found some RNA splicing feature (peak in exon region) in my data. I guess than there may be some RNA contaminated my DNA, but how can I compute the percentage of RNA contamination? Any one who have good ideas?