Question

RNAseq reads contamination

0

Entering edit mode

7.1 years ago

Sharon ▴ 610

I have RNAseq reads for some human patients. If I still have overrepresented sequences after trimming my reads using trimgalore and cutadapt. These over-represented sequences has no possible hits in FASTQC. So, I blasted these over-represented sequences using blast NCBI to microbes. For many of them I have some hits with bacteria (scores above 35). Sometimes when using blast and sometimes when using blastn. Does this mean my data is contaminated? And what can I do to fix this?

Thanks

RNA-Seq • 3.8k views

ADD COMMENT • link updated 5 months ago by 5HT2a ▴ 10 • written 7.1 years ago by Sharon ▴ 610

score 2 · Accepted Answer · 2017-10-24

2

Entering edit mode

7.1 years ago

GenoMax 147k

Since your dataset was not a bacterial one then it is possible that you have some sort of contamination. You can use bbsplit.sh from BBMap to bin your reads into usable and discard bins easily using the human genome reference.

Note: I would not worry too much about the "over-represented" sequences part from FastQC. Just do your regular RNAseq analysis.

ADD COMMENT • link 7.1 years ago by GenoMax 147k

0

Entering edit mode

Thank you. What do you mean by you don't worry much about over-represented sequences? Like I am getting low alignment rate, so you think this possible contamination is a reason, or no, I shouldn't worry about it?

ADD REPLY • link 7.1 years ago by Sharon ▴ 610

1

Entering edit mode

Low alignment could certainly be because of the contamination. Once you separate the contamination you can re-assess. Over-represented sequences may be genuinely present if you have some RNA's highly over-expressed. I was suggesting that instead of worrying about them now you could proceed with your RNAseq analysis.

If there are problems with the results you can backtrack. Perhaps you are already in that mode? Can you post what aligner you have used and what the alignment stats look like?

ADD REPLY • link 7.1 years ago by GenoMax 147k

0

Entering edit mode

Yes, I have alignments of 70% and was rechecking what's going on. Thank you so much for the clarifications.

ADD REPLY • link 7.1 years ago by Sharon ▴ 610

0

Entering edit mode

70% is not great but it is not the end of the world especially if you really have bacterial contamination. You should worry about that though. Is the contamination present in all samples or just some?

ADD REPLY • link 7.1 years ago by GenoMax 147k

0

Entering edit mode

I have too many samples, checking them one by one. So far all what I check have hits with bacteria. But i am still going on.

ADD REPLY • link 7.1 years ago by Sharon ▴ 610

1

Entering edit mode

Presence of extraneous contamination will cast doubt on results, which could be generated by following normal alignments/counts. (most bacterial data should not align).

If the contamination is pervasive then you should track down the source and try to take corrective action. Since these are patient samples you may probably end up working with what you have until you figure out what is going on.

ADD REPLY • link 7.1 years ago by GenoMax 147k

1

Entering edit mode

What kind of samples are these? Are there any reasons why you would be seeing bacterial DNA/RNA? (if these were cells that are expected to be infected, for example) Do you always get the same hits, i.e., do the same bacteria come up in the possibly contaminated samples? If that's the case, you should probably discuss this with the people who processed the samples. But apart from that, as Genomax pointed out, 70% alignment is not abysmal and the majority of non-human sequences will simply not align, particularly not to coding gene regions (with some viral sequences you could end up with aberrant hits in non-coding regions, so if you're worried about that, too, just keep it in mind and stay cautious for your downstream interpretations)

ADD REPLY • link 7.1 years ago by Friederike 9.0k

0

Entering edit mode

Actually, I now used Blat https://genome.ucsc.edu/cgi-bin/hgBlat and found that these over-represented sequences are human derived. So, that changes my possible contamination reason. Although they first aligned to microbes, but with 35 score, with human the score is 100%. Should find another reason for low alignment then !

ADD REPLY • link 7.1 years ago by Sharon ▴ 610

1

Entering edit mode

if you use STAR for the alignment, simply make sure you retrieve the unaligned reads in a separate file (it has its own option for this). Are those human sequences genes encoding ribosomal RNA or other transcripts you expect to be highly abundant (e.g., globin for blood samples). You could then run FastQC on those unaligned reads again, maybe that will give you a better idea.

ADD REPLY • link 7.1 years ago by Friederike 9.0k

0

Entering edit mode

if we have all globin genes (whole blood RNAseq), what can we do (or should we do anything?) to address the high duplication before doing differential expression.

ADD REPLY • link 5 months ago by 5HT2a ▴ 10