RNAseq: remove contaminants before or after mapping reads?
2
12
Entering edit mode
10.6 years ago

I have a de novo transcriptome for an ant species of about 100k transcripts assembled by Trinity. I've quantified expression using Sailfish and performed some analyses to identify genes with significant expression patterns.

Then, I became aware of the program DeconSeq to remove contaminants. I ran the program and removed ~5k of my transcripts, and spot-checking showed they all BLAST to bacteria or human. All good so far. Then, I re-ran Sailfish on the "cleaned" transcriptome of about 99k transcripts. My naive expectation was that this should barely effect the results - the reads that mapped to the "contaminants" shouldn't map at all. Instead, I find substantial changes, with about twice as many genes having significant expression patterns. I checked the unmapped ratio and it stays the same between mappings to the two transcriptome files.

So - my conundrum is how to deal with mapping reads to known contaminants?

Do I:

a) Map reads to the complete transcriptome, including contaminants, and then remove the known contaminants. In this case, it seems that I risk incorrectly mapping reads to the contaminants and losing information (false negatives), analogous to a Type I (producer's) error.

b) Map reads to the "cleaned" transcriptome. This seems analogous to a Type II (consumer's) error where I risk finding significant changes in expression that are due to incorrectly mapping 'contaminant' reads to true transcripts (false positives).

Any thoughts appreciated!

cross-posted on Sailfish user group here

read-mapping contaminants expression RNAseq • 12k views
ADD COMMENT
1
Entering edit mode

this post was cited in:

https://www.nature.com/articles/s41598-017-19010-5 "Angiogenesis and evading immune destruction are the main related transcriptomic characteristics to the invasive process of oral tongue cancer" doi:10.1038/s41598-017-19010-5

ADD REPLY
4
Entering edit mode
10.6 years ago

Responses from the Sailfish google user's group:


I strongly favor your choice a:

a) Map reads to the complete transcriptome, including contaminants, and then remove the known contaminants. In this case, it seems that I risk incorrectly mapping reads to the contaminants and losing information (false negatives), analogous to a Type I (producer's) error.

Trinity has already assembled your reads into contigs. Take advantage of that. Sailfish will assign reads best if the transcriptome is more complete (without regard to species of origin; it doesn't know about the species of origin). The main cause of false positives is when the true source of the k-mers involved is not in the index.

Presumably, you used all of the data for your Trinity run. Now, you're asking a different question, about differential expression. Giving Sailfish all of the information you have (your complete transcriptome) will give you the best result from the EM algorithm. The EM algorithm will do better when it knows about all of the possible sources -- I can't see any reason why it would map reads to the contaminants if has a better choice.


Steve makes good points here, and I think his proposed solution may currently be the best way to go (if there are contaminants in your transcriptome, then that's where the reads should be assigned). If you have a particular desire / need to remove these contaminants from your target transcript set, another potential solution I can think of (though it would require a little bit more work) would be to actually align (e.g. using BWA / Bowtie) the raw reads to just the contaminant transcripts. Since there are a relatively small number of these in your experiment, this should still be relatively fast (much faster than mapping to the entire transcriptome). These reads could then be removed from the read files, and the estimation done on the cleaned transcriptome and filtered reads. I would anticipate that this approach and your option a may yield similar results, but it's difficult to say for certain.

ADD COMMENT
0
Entering edit mode

thanks for following up, it is great advice

ADD REPLY
3
Entering edit mode
10.6 years ago
Prakki Rama ★ 2.7k
  1. "Mapping reads to contaminant sequences" - If the reads are without contamination, obviously the contaminant sequences will not get mapped. You can safely remove the contaminated assembled sequences by putting a coverage cutoff.
  2. "Incorrectly mapping 'contaminant' reads to true transcripts" - Again, if the reads are cleaned (without contamination), then is no need to worry about expression level of true transcripts.

Now, if I were you, even before mapping the reads to find the expression levels, I would remove the contaminant reads by mapping them against contaminant (Bacterial, Viral genome or rRNA) Sequences. To confirm they are contaminant reads, If possible, I would take reference genome or nearest reference species sequence and map these contaminant reads against them. If the reads map to both contamination and nearest species, I would retain to map to my transcriptome, but I would throw away those which hit only Contamination. Even though, one may remove contaminants from assembly, if assembled with contaminant reads, there is always a chance of misassembly.

ADD COMMENT
0
Entering edit mode

Thanks for the comments. Mapping reads to possible contaminants seems like a valid strategy and was also suggested on the Sailfish, though Rob Patro thought this would likely be quite close my suggestion (a) above.

ADD REPLY

Login before adding your answer.

Traffic: 1717 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6