Hi Everyone, I have done RNA-seq (75 bp paired-end; 50 million reads per sample) on some samples to assess differential expression. Initial FastQC report showed some sequence duplications issue. Despite that, I did the alignment and imported the resulting BAM files in Seqmonk for further analysis. RNASeq QC report in SeqMonk indicated rRNA contamination. So, I went back and removed the rRNA reads using sortmerna. In some cases, the rRNA read files are huge. However, when I do a FastQC analysis on the files without rRNA reads, I do not see a change in number of reads or sequence duplication levels compared to original raw files. I don't know how is it possible. Suggestions and solution please. Thanks.
Sumit
I don't think rRNA contamination is a problem (unless it's a lot), but that might also depend on what you want to do with the data.
In my samples it ranges from 5%-25%. I think it is a significant amount.
How much rRNA contamination? 1%, 5% 10%, 50%? In general, there always is some small percentage of rRNA left, you only need to worry if is too much and decreases the number of non-rRNA reads to less than the minimum recommended for the analysis you want to perform. For example, rules of thumb for differential gene expression and differential transcript expression are 10million and 20million, respectively (I am not 100% about the 20million figure).
The other point is FastQC will almost always point there is duplication in RNAseq samples, because there will be some genes / transcripts which are very expressed.
rRNA contamination in my samples ranges from 5%-25%. With this level of contamination, I should see a difference in number of reads before and after processing files with sortmerna. My concern is that I do not see any difference, although I find that there are a significant number of reads that match with rRNA in sortmerna ouput files.
You will have to explain yourself better. What are the command lines you are using? If a sample has 25% of rRNA contamination and you use SortMeRNA on it, you should have two outputs, one with 75% of the reads, free of rRNA, and another with 25% of your reads, all rRNA.
Here is the image of the FastQC Basic Statistics prior to removal of rRNA reads:
SeqMonk RNASeq QC report suggesting rRNA contamination:
I used the following command to remove rRNA reads:
Following is the result section from the log file generated from the above command:
Results:
By database:
home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/silva-bac-16s-id90.fasta 0.18%
/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/silva-bac-23s-id98.fasta 0.18%
/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/silva-arc-16s-id95.fasta 0.03%
/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/silva-arc-23s-id98.fasta 0.01%
/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/silva-euk-18s-id95.fasta 13.60%
/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/silva-euk-28s-id98.fasta 23.96%
/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/rfam-5s-database-id98.fasta 0.01%
/home/svsgrc/Sumit/RNASeq/sortmerna-2.1b/rRNA_databases/rfam-5.8s-database-id98.fasta 0.06%
Following is the FastQC report on the file generated using SortMeRNA for the same sample:
As you can see that there are significant number of reads matching to Eukaryotic 18s and 28s. Still there is no change in reads in the second FastQC statistics. I hope the problem is clear now.
The problem was clear from the start, how you got to the conclusions was the blurry part.
It seems SortMeRNA is failing to split the files. Did you examine
L-C-M6_rRNA-R1.fastq
? Is it empty? Which version of SortMeRNA? What is the result if you run it with--paired_out
?I never used SeqMonk, but it seems it is under-reporting contamination: the graphics show less than 25% for (what I presume are) all samples, but SortMeRNA reported 38% for the sample you provided output.
I don't know what is going on, but let me suggest two tools to help you: MultiQC does aggregate reports for lots of tools, it does for FastQC, so it is easy to compare
before
andafter
results; and BBDuk is really fast at quantifying and removing contamination, being almost as sensitive as SortMeRNA but at a fraction of the time.