I would like to know if there is any clever protocol to separate reads for an exome Illumina sequenced dataset from a sample of a heterotransplanted human tumour into an immunodeficient rodent (BALB/c train):
The exome sample sequenced would contain both reads belonging to the human cancer cells sequenced that would have been enriched from surrounding mice cells due to the cross-species sequence annealing during the exome enrichment protocol.
Is there any clever way of separating the read sets from human and mouse in such a case?
If you use an aligner that supports references greater than 4GB, and that allows you to pull out uniquely mapped reads (e.g. BWA 0.6+ )
Then you could include both mouse and human genomes into a single reference FASTA file. You'd have to prefix, so that human chromosome 1 is hg19chr1 and mouse is mm9chr1.
Then, when you pull out uniquely mapped reads, you'll know which organism they came from.
You will not be able to use this to separate reads that map equally well to either reference.
+David Quigley, that would only happen if a miscall happened to make a read from human more like mouse, right? BWA should still be able to find the correct mapping in human, and infer it's correct by pairing, right?
If you have paired data (which you probably do) you're going to get hosed when one read in the pair maps to human Chr1 and the other maps to mouse Chr2. BWA will penalize the alignment score because the apparent read gap distance is huge. Just something to keep in mind.
so if there are reads that map equally well to both species, they will share coverage between one and the other? This will probably end up in sudden drops in coverage for exons that are highly conserved between mouse and human, is that right?
Good and interesting question, but do you need to separate the reads based on species? Could you not succeed in the goals of exome sequencing with mouse and human reads mixed, then sorting out species based on alignments to something like RefSeq mRNAs? I would think that would all work fine.
I would filter beforehand for common repeats like the human Alu, which is known to be expressed as mRNA. Mouse B1 elements can be filtered as well.
He want's to assign all reads to either mouse or human, what you suggest is to identify a few unique sequence snippets to see which species are present (which is known in this case).
+David Quigley, that would only happen if a miscall happened to make a read from human more like mouse, right? BWA should still be able to find the correct mapping in human, and infer it's correct by pairing, right?
If you have paired data (which you probably do) you're going to get hosed when one read in the pair maps to human Chr1 and the other maps to mouse Chr2. BWA will penalize the alignment score because the apparent read gap distance is huge. Just something to keep in mind.
+1 for this solution, which is how we have dealt with reference-based mapping from hybrid genome sequences.
so if there are reads that map equally well to both species, they will share coverage between one and the other? This will probably end up in sudden drops in coverage for exons that are highly conserved between mouse and human, is that right?
somewhat, but those reads will still be mapped, you'll just have to pull them out another way.