Im currently working with chimpanzee and Bonobo data, and i am curious to know if the datasets are contaminated with human DNA.
I know that that a potential issue is that generally the chimpanzee reference genome and human genome are quite similar. And that the assembly PanPan3 reference genome is lower quality compared to the human genome. So I am going to perform alignment to all the references genome to identify some of the differences.
I have a few initial ideas on how to identify human contamination, such as aligning to species-specific Alu elements, or examing the alignments to the mitochondria DNA, since it should be easier to identify the differences between the different species due to the shorter length and if a single sample contains multiple mT it would be from different individuals.
I know tools such as Kraken can be used to identify contamination, but now I'm just curious if anyone could help me with other ideas to identify potential human contamination in samples closely related to humans, such as the chimpanzee or Bonobos.
Likely not useful when species are so closely related.
Do you have your own sequence data for the species you mention? Take a look at
RemoveHuman.sh
from BBMap suite which is described in this thread. It is going to be difficult to decide what ishuman
contamination when you are working with related sequences such as these.Thanks for your reply, I'll check that out. Yes, I am aware that it's going to be difficult since they are closely related. Yes, i have sequence data for both Bonobo and Chimpanzees, another problem is then that the general read depth is below 5. Do you know any conceptual things i could try out to examine potential contaminations?.