Hi everybody,
I am new to RNA-Seq. I am working on a RNA-seq data set from the mouse genome. I have ran some quality testing using the fastqc and fastx software. I have found out that the data set is full of rRNA reads.
I would like to filter there reads out, but I am not sure how to go forward with this. I have also read this post, but it didn't help much.
I have downloaded the rRNA.gtf table from UCSC using this method. I have also downloaded the ncRNA table from ensembl.
As I have done the mapping against the ensembl GRCM38 version, I guess it is better to work with the ensembl file. But how to do the filtering?
Is there a way to filter the rRNA reads before the mapping?
I know I can convert the gtf files into GenomicRanges object. BUT will that help me to remove the reads mapped to this coordinates after the mapping. if so how?
The last option, as reads in the above mentioned post is to take the list of rRNA ensembl IDs and just filter the list of results from featureCounts or HTseq-count to remove those rRNAs from it.
Is this option better than the other two?
Thanks in advance for any information
Tomas
Yeah. Filtering some reads in advance of sophisticated normalization could skew the experiment. You want to get rid of rRNA before sequencing, then hand all read data to the analysis software unmolested.
Yes I know that. The sequencing facility did run rRNA depletion twice, but there still were a lot of rRNA residues in the fastq files I have. So now I need to figure out, what to do with it.
Thanks Devon, I will try it without removing them, but than just filtering them from the list of read counts.
I think in ribosomal profiling, it might be useful to remove the rRNA before mapping. Ribo-seq data might still contain about 20-30% of total RNA as rRNA. Mapping this together with rest of non rRNA reads might screw the downstream quantification.
My zebrafish data also has ribosome RNA contamination and my boss mandatorily asked me to use cufflinks...As the cufflinks manual says""Tells Cufflinks to ignore all reads that could have come from transcripts in this GTF file. We recommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file. Due to variable efficiency of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often improves the overall robustness of transcript abundance estimates." I think -M is the right way? Then I tried to get the complete rRNA/tRNA GTF files, at first I though rmsk from UCSC table browser is enough but I found this post in UCSC google groups: " the rmsk table contains coordinates for various repeat families. Some of these repeat families are derived from specific RNA families, such as rRNA or tRNA. These differ from the actual rRNA and tRNAs, the coordinates for which are found in different tracks. " So I am confused, how should I get the right rRNA/tRNA GTF files to use in cufflinks?
For cufflinks you'll want the repeatmasker track from UCSC (just extract the rRNA and tRNA sites).
Dear Devon, Karl et al.
I came across an issue with rRNA contamination in my RNA-seq data (poly-A sequencing), and I found several posts about this issue. I observed contradictory or not very straightforward answers regarding what I can do next. For example, here you're recommending to not remove ribosomal RNAs from the mapping as this would introduce a bias in the analysis, whereas I am under the impression that others recommend removing it from the analysis (please see here). I tried removing rRNA from my reads using SortMERNA (using all eukaryotic and prokaryotic databases in this program), but the GC distribution per sample did not change. But in one sample, instead of 3 overrepresented sequences mapping to a ribosomal protein (rpS29, each appearing 10-15% of the reads), I get only one mapping to the same protein with similar level of representation in the RNA-seq data (~10-15% of the reads)). The GC distribution plot did not change.
Is there an objective way to assess if this rRNA contamination will be a problem for my differential expression analysis?