Entering edit mode
10 months ago
DKA
▴
40
Hi everyone,
If there has been DNA contamination in the RNA samples and RNA-seq was conducted. After RNA-seq, I tried using SeqMonk for analysis and I estimated contamination levels, which were highly variable between samples, ranging from 1.7% to 10.5%.
My initial step was to try SeqMonk's built-in correction for DNA contamination. However, this approach unfortunately resulted in no differentially expressed genes (DEGs) being identified when analyzed with DESeq2.
Given these challenges, I have two questions:
- Can DESeq2 still be used if I have the estimated DNA contamination values from SeqMonk? Could incorporating these values manually into DESeq2 potentially improve the analysis and help identify DEGs despite the contamination? If so, what specific approaches or methods can I employ?
- Are there alternative methods for removing DNA reads from RNA-seq data aside from the approach used in SeqMonk? I'm interested in exploring other options and their potential effectiveness in my situation.
Thanks a lot.
I'm admittedly not familiar with SeqMonk, but if you use a standard alignment algorithm, the contaminated reads should be filtered out. It's possible for reads to map to the genome during the alignment process but during the counting process, only the alignments in genic regions will be counted. So I don't really see a need to filter out DNA contamination.
Just do your standard alignment and run DESeq2.
Thank you for your opinion. Would not the reads originating from the DNA but mapping to the exonic regions affect the quantification and identification of the DEG, please? There has been an estimation that even the percentage of DNA contamination varied between the samples.
If they originate from DNA, they shouldn't map to exonic regions. Aligners can take the "best" mapping and use that. If a read originates from DNA, the best mapping should not be an exon.
Thx. Can you bear with me, please? Why would not the reads orignating from the DNA map to exonic regions? How would the aligner differentiate that these reads originates from DNA? Obviously, there have to be reads in the exonic regions from the contaminated DNA along with intronic and intergenic regions.
OK, let's put it this way: If a read aligns perfectly well to both an exonic region and an intergenic region (which will be very few reads aligning equally well to both), you can't tell whether that particular read is a truly exonic read or if it's a contaminating read anyway no matter what you do or how you decontaminate. So what's the big deal?
Just align -> DEseq2. If your results look funky, then come back here. Stop obsessing over 10% of your reads coming from intergenic regions (most of which will probably not be counted). In the time that you took obsessing over it, you could have already done a DESeq2 analysis.
Thank you. Things are clearer now. I have already done the DESeq analysis and the pathway analysis gave interesting results. I am worried about the reliability of such extracted DEGs because the level of the estimated DNA contamination varied distinctively between the samples. Specifically, I am concerned if these genes were called due to the DNA reads in the exonic regions which influenced their calling as DEGs.
A few things:
A single RNA-seq experiment is unlikely to be sufficient to tell you any new biology. E.g. if you're studying cancer, does the result hold across different cancer types? E.g. if you've discovered something novel, does the result hold true when you try other assays (in situs, perturbations, etc.)?
There should be some biological variability between samples; and if you're getting significant adjusted p-values, great!
You said your built-in correction for DNA contamination didn't give you any significant DEGs. You should probably look at why this -- e.g. was the correction throwing away 80% of your reads? Might be good to look at the gene-level differences between the two methods. More generally speaking, if two methods give you different results, you should probably investigate why that is.
We just developed a R package to tackle this problem. Do you have stranded or unstranded RNA-seq data contaminated with gDNA? If you would like to try our tool, you can reply to me.
Please post information about your tool as a
tools
category post. Don't wait/ask other users to ask you.I have unstranded RNA-seq data. Please share the tool with me, if you do not mind.
Any updates on the r package?
Based on a search it may be this package: https://bioconductor.org/packages/devel/bioc/vignettes/gDNAx/inst/doc/gDNAx.html