I have to analyze NGS data after a targeted enrichment (sure select Agilent) of a xenografted tumor. We know that there is a contamination of murin stroma around 20%. How to manage this issue to be sure that the mutation annoted are human specific?
Thanks,
Let me rephrase your question a bit for the sake of making it comprehensible: you took human tumor cells, implanted them into lab mice, let the tumors grow, harvested and performed next gen exome sequencing to detect variations. Now, your reads are understandably contaminated with DNA from mice. What to do? Unfortunately, I didn't have the 'luck' to do get such contaminated data, so here is what I would do, given the theoretical possibility of being asked to analyse such a data set, and given I wanted to publish the results:
The best way of removing
contamination is to avoid it in the first place (if
possible)
I don't believe there is any secure way to remove contamination especially of highly similar sequences. To salvage this case I would try to apply rigorous filtering:
Align the reads against the mouse and human genome
remove those reads that align better or as well to mouse as to human reference genome
check the alignment positions, discard all reads that align to non-exonic, intergenic regions, they should not be there anyway
run snp detection, I don't think copy number variation detection is feasible
after detecting a snp, align the genomic sequence flanking it's position against mouse using eg FASTA or SSearch. If mouse sequence is highly similar don't report it.
That way you will possibly be quite specific, the question is, if you will have many reads left.
Humans are quite different on the nucleotide level from mouse (just blastn NM_000546.4
Homo sapiens tumor protein p53 (TP53), transcript variant 1, mRNA against mouse ref-seq).
Even with 80% similarity there are hardly any 60bp long identical fragments. So if your read length is long enough then there is no chance that tumor will mutate giving you an exact mouse sequence. And we are talking here just exons, but (correct me if I am wrong) you should get some "dangling" intronic sequences flanking exons as well. Unless you land in some very peculiar parts of the genome, similarity drops there, so no cross-mapping of such reads.
I've been working on this problem and am searching for data sets such as yours to test it out on in which there is a known degree of contamination. Take a look and contact me if you are interested in pursuing: https://github.com/Lythimus/PARSES/