I have a de novo transcriptome for an ant species of about 100k transcripts assembled by Trinity. I've quantified expression using Sailfish and performed some analyses to identify genes with significant expression patterns.
Then, I became aware of the program DeconSeq to remove contaminants. I ran the program and removed ~5k of my transcripts, and spot-checking showed they all BLAST to bacteria or human. All good so far. Then, I re-ran Sailfish on the "cleaned" transcriptome of about 99k transcripts. My naive expectation was that this should barely effect the results - the reads that mapped to the "contaminants" shouldn't map at all. Instead, I find substantial changes, with about twice as many genes having significant expression patterns. I checked the unmapped ratio and it stays the same between mappings to the two transcriptome files.
So - my conundrum is how to deal with mapping reads to known contaminants?
Do I:
a) Map reads to the complete transcriptome, including contaminants, and then remove the known contaminants. In this case, it seems that I risk incorrectly mapping reads to the contaminants and losing information (false negatives), analogous to a Type I (producer's) error.
b) Map reads to the "cleaned" transcriptome. This seems analogous to a Type II (consumer's) error where I risk finding significant changes in expression that are due to incorrectly mapping 'contaminant' reads to true transcripts (false positives).
Any thoughts appreciated!
cross-posted on Sailfish user group here
this post was cited in: