One of the first things that I learnt about how RNAseq data has to be analyzed or at least, the workflow to follow, is that the first step is always to check the quality of the reads/samples through FastQC or similars. Then, according to that report, you must decide if you have to remove adapters, trim due to quality, etc. After all that, it is time to align to genome/transcriptome and then obtain your quantification for the consequential downstream analyses (DGE/DTE, enrichment, etc).
In the quality control step, apart from FastQC, I usually use FastQ Screen to check for possible contamination. Since I am working with this type of data, I didn't encounter contamination from other species until now. My samples (well, samples from public data that I want to use) have uniquely reads mapped to PhiX and E.coli. Therefore, I asked myself if it makes sense to filter or remove that contamination… since I will align to the human genome and those reads will not be aligned.
I have tried to search about what it is recommended (or at least, what people usually do in these cases) and I found different approaches:
Some people say to align ALL the data (transcriptome/genome) and then, extract reads that align to the genome leaving contamination behind. --> Post.
Others that… before mapping the reads to find the expression levels, you should remove the contaminant reads by mapping them against contaminant (Bacterial, Viral genome or rRNA) sequences. To confirm the contaminant reads, If possible, take the reference genome or nearest reference species sequence and map these contaminant reads against them. If the reads map to both contamination and nearest species, retain to map to my transcriptome, but throw away those which hit only Contamination. --> Post
However, if the reads belong to another species (e.g. bacteria) different from the one that I am working (Human), those reads will not be mapped and… removal would not be necessary (?)
Or at least, if I remove those reads, the only advantage that I will get is to obtain better % of alignment? (but, of course, I will have less total number of reads, so of course the % will be better).
On the other hand, I also found this post, that remove contamination is recommended to avoid chimeric assemblies and reduce assembly time, memory requirements, and fragmentation. But… I didn’t find anything related to differential expression analysis and if it makes a difference or not.
What do you think about this?
Any comment or literature references that you can give me will be much appreciated.
Thanks in advance
What % of the reads belong to that category?
Only if you are looking at % alignment as a fraction of total number of reads. Reads aligning to human should stay more or less constant irrespective of the contamination (if truly bacterial).
Thanks very much for you quick reply!
According to what FastQ Screen gives me (One hit/one genome & Multiple hits/one genome):
The maximum % that I have seen in my samples: 0.64 & 3.38
The minimum %: 0.08 & 0.40
The maximum %: 2.26 & 0.00
The minimum %: 1.38 & 0.00
There should be no disadvantage to not removing these reads. They should not align to the genome (and even if a few did by chance) and will not change the overall result.
But if you want to be super careful then remove them.
If the % was higher, do you think that it would be better to remove them?
I guess that there is not a "number" or cutoff where you can say, "okay, this is too much contamination, maybe I should remove those reads or even discard the sample".
I think these things are good to think about, but don't lose too much sleep over it. Those microorganisms were likely spiked into the sample during library prep to increase library diversity. There isn't any reason to believe that they would be in higher abundance in one group versus another, so I would not expect it to have an effect on differential expression analysis. It sounds like you have a good understanding about the effects of doing something versus not doing something about it. Pick an approach and keep in mind the possible limitations/advantages in the subsequent analysis.
Thanks for your quick reply :)
I have checked the differences between controls and condition (if there was much more in one group or another) and I haven't seen any difference... so I guess that there will not be an effect.
However, if this wasn't the case (if I could see differences between control and condition, or some samples regardless the group with more contamination than others), is removal of contaminants recommended? Or should I just simply don't take into account that sample in my analyses?
Because I still do not know what the advantage would be removing reads.... if those contaminated ones wouldn't align to my genome (in this case, human)... unless there could be some reads that could be aligned incorrectly (?)
I cannot provide you with a specific recommendation because it is not a simple question. The advantage of removing these sequences is that it gives you a cleaner and smaller dataset to work with for downstream analysis. The disadvantage of removing these sequences is that it's going to cost you additional time to do the bioinformatic analysis and you risk removing sequences you want to keep if you are not careful. Again, use your best judgement and weigh the trade-offs of your decision. Good luck!