Question

RNAseq: remove contaminants before or after mapping reads?

12

Entering edit mode

11.0 years ago

johnstantongeddes ▴ 410

I have a de novo transcriptome for an ant species of about 100k transcripts assembled by Trinity. I've quantified expression using Sailfish and performed some analyses to identify genes with significant expression patterns.

Then, I became aware of the program DeconSeq to remove contaminants. I ran the program and removed ~5k of my transcripts, and spot-checking showed they all BLAST to bacteria or human. All good so far. Then, I re-ran Sailfish on the "cleaned" transcriptome of about 99k transcripts. My naive expectation was that this should barely effect the results - the reads that mapped to the "contaminants" shouldn't map at all. Instead, I find substantial changes, with about twice as many genes having significant expression patterns. I checked the unmapped ratio and it stays the same between mappings to the two transcriptome files.

So - my conundrum is how to deal with mapping reads to known contaminants?

Do I:

a) Map reads to the complete transcriptome, including contaminants, and then remove the known contaminants. In this case, it seems that I risk incorrectly mapping reads to the contaminants and losing information (false negatives), analogous to a Type I (producer's) error.

b) Map reads to the "cleaned" transcriptome. This seems analogous to a Type II (consumer's) error where I risk finding significant changes in expression that are due to incorrectly mapping 'contaminant' reads to true transcripts (false positives).

Any thoughts appreciated!

cross-posted on Sailfish user group here

read-mapping contaminants expression RNAseq • 12k views

ADD COMMENT • link updated 3.6 years ago by Ram 45k • written 11.0 years ago by johnstantongeddes ▴ 410

1

Entering edit mode

this post was cited in:

https://www.nature.com/articles/s41598-017-19010-5 "Angiogenesis and evading immune destruction are the main related transcriptomic characteristics to the invasive process of oral tongue cancer" doi:10.1038/s41598-017-19010-5

ADD REPLY • link 7.3 years ago by Pierre Lindenbaum 166k

Ram · Accepted Answer · 2014-05-02

Responses from the Sailfish google user's group:

I strongly favor your choice a:

a) Map reads to the complete transcriptome, including contaminants, and then remove the known contaminants. In this case, it seems that I risk incorrectly mapping reads to the contaminants and losing information (false negatives), analogous to a Type I (producer's) error.

Trinity has already assembled your reads into contigs. Take advantage of that. Sailfish will assign reads best if the transcriptome is more complete (without regard to species of origin; it doesn't know about the species of origin). The main cause of false positives is when the true source of the k-mers involved is not in the index.

Presumably, you used all of the data for your Trinity run. Now, you're asking a different question, about differential expression. Giving Sailfish all of the information you have (your complete transcriptome) will give you the best result from the EM algorithm. The EM algorithm will do better when it knows about all of the possible sources -- I can't see any reason why it would map reads to the contaminants if has a better choice.

Steve makes good points here, and I think his proposed solution may currently be the best way to go (if there are contaminants in your transcriptome, then that's where the reads should be assigned). If you have a particular desire / need to remove these contaminants from your target transcript set, another potential solution I can think of (though it would require a little bit more work) would be to actually align (e.g. using BWA / Bowtie) the raw reads to just the contaminant transcripts. Since there are a relatively small number of these in your experiment, this should still be relatively fast (much faster than mapping to the entire transcriptome). These reads could then be removed from the read files, and the estimation done on the cleaned transcriptome and filtered reads. I would anticipate that this approach and your option a may yield similar results, but it's difficult to say for certain.

Ram · Accepted Answer · 2014-05-02

"Mapping reads to contaminant sequences" - If the reads are without contamination, obviously the contaminant sequences will not get mapped. You can safely remove the contaminated assembled sequences by putting a coverage cutoff.
"Incorrectly mapping 'contaminant' reads to true transcripts" - Again, if the reads are cleaned (without contamination), then is no need to worry about expression level of true transcripts.

Now, if I were you, even before mapping the reads to find the expression levels, I would remove the contaminant reads by mapping them against contaminant (Bacterial, Viral genome or rRNA) Sequences. To confirm they are contaminant reads, If possible, I would take reference genome or nearest reference species sequence and map these contaminant reads against them. If the reads map to both contamination and nearest species, I would retain to map to my transcriptome, but I would throw away those which hit only Contamination. Even though, one may remove contaminants from assembly, if assembled with contaminant reads, there is always a chance of misassembly.