I have RNAseq reads for some human patients. If I still have overrepresented sequences after trimming my reads using trimgalore and cutadapt. These over-represented sequences has no possible hits in FASTQC. So, I blasted these over-represented sequences using blast NCBI to microbes. For many of them I have some hits with bacteria (scores above 35). Sometimes when using blast and sometimes when using blastn. Does this mean my data is contaminated? And what can I do to fix this?
Thanks
Thank you. What do you mean by you don't worry much about over-represented sequences? Like I am getting low alignment rate, so you think this possible contamination is a reason, or no, I shouldn't worry about it?
Low alignment could certainly be because of the contamination. Once you separate the contamination you can re-assess. Over-represented sequences may be genuinely present if you have some RNA's highly over-expressed. I was suggesting that instead of worrying about them now you could proceed with your RNAseq analysis.
If there are problems with the results you can backtrack. Perhaps you are already in that mode? Can you post what aligner you have used and what the alignment stats look like?
Yes, I have alignments of 70% and was rechecking what's going on. Thank you so much for the clarifications.
70% is not great but it is not the end of the world especially if you really have bacterial contamination. You should worry about that though. Is the contamination present in all samples or just some?
I have too many samples, checking them one by one. So far all what I check have hits with bacteria. But i am still going on.
Presence of extraneous contamination will cast doubt on results, which could be generated by following normal alignments/counts. (most bacterial data should not align).
If the contamination is pervasive then you should track down the source and try to take corrective action. Since these are patient samples you may probably end up working with what you have until you figure out what is going on.
What kind of samples are these? Are there any reasons why you would be seeing bacterial DNA/RNA? (if these were cells that are expected to be infected, for example) Do you always get the same hits, i.e., do the same bacteria come up in the possibly contaminated samples? If that's the case, you should probably discuss this with the people who processed the samples. But apart from that, as Genomax pointed out, 70% alignment is not abysmal and the majority of non-human sequences will simply not align, particularly not to coding gene regions (with some viral sequences you could end up with aberrant hits in non-coding regions, so if you're worried about that, too, just keep it in mind and stay cautious for your downstream interpretations)
Actually, I now used Blat https://genome.ucsc.edu/cgi-bin/hgBlat and found that these over-represented sequences are human derived. So, that changes my possible contamination reason. Although they first aligned to microbes, but with 35 score, with human the score is 100%. Should find another reason for low alignment then !
if you use STAR for the alignment, simply make sure you retrieve the unaligned reads in a separate file (it has its own option for this). Are those human sequences genes encoding ribosomal RNA or other transcripts you expect to be highly abundant (e.g., globin for blood samples). You could then run FastQC on those unaligned reads again, maybe that will give you a better idea.
if we have all globin genes (whole blood RNAseq), what can we do (or should we do anything?) to address the high duplication before doing differential expression.