Hello there,I have about 500 samples seperately using Illumina Hiseq 2000 (240 individuals) and Illumina X Ten (260 individuals) for whole genome sequencing.
After mapping to the reference genome and got the bam files, I applied 2 methods (Genotype likelihood approach and single read sampling approach) for the PCA analysis, but the results were quite confused, those samples clustered by different platforms (Hiseq2000 clustered together, and also for the Hiseq X Ten) but not by the same population.
And I'm not sure why there're so big deviation about different platform for the PCA results. Anyone have such data dealing experiences please give us some suggestions for the analysis? And how to calibrate these bias?
How reliable are your variant calls? What coverage do you have?
If your calls are not very reliable (may be because the coverage is low), then the biases of different platforms would play a stronger role. Try to get a clean set of variants as possible. Which method did you use to call variants?
Hello, many thanks for your response! The sequnecing coverage of two platform is 5X, calling SNP use the GATK best practice pipeline and ANGSD genotype likelihood approach. I have one more question, if the sequencing coverage of two platform increased to 30X, can it be calibrate these platform bias?
I would expect that increasing the coverage (and the quality of calls) would reduce platform biases. With 5x you should be able to test this by analysing only those calls with high quality in both platforms. There must be good calls enough. You may want to restrict your analysis to homozygotes, which are easier to call. For example, get all sites covered by at least 8 reads in which all reads support the alternative allele. You can also look at calls shared by one platform and analyse their signatures (G>T, etc). Maybe this gives some hint.
Thanks a lot! I will make some attempts according to your suggestion.