Hi everyone, I have a next-generation sequencing data set of two haplotypes. One is of 95% and the other one is only 5% of the sequencing data. I used assembly methods to get the dominant haploytpe, and then I mapped the raw reads to the assembled contigs. Since the sequence similarity between two haplotypes is high, I am wondering how can I tell the mismatches between reads and the references coming from sequencing errors or the minor haplotype?
Of course the sequencing error rate and the sequence difference between two haplotypes are different. In addition, sequencing errors tend to be random. However, I do need a probability or statistics model to model this problem and figure out a theoretical sound solution. For example, if multiple reads mapped have the same mismatch on the same position of the reference, this mismatch highly possible comes from the minor haplotype. A hypothesis testing method?
The base quality scores at the location in question are also useful, as well as some other factors like whether the reads indicating the minor allele are properly paired, include different orientations, and various other factors. I suggest you use a variant-caller that models some of these things and get its opinion. If your data has different barcodes for the different haplotypes this should be straightforward, but it's not clear to me what you mean by data with 95% of one haplotype and 5% from the other.
Thanks for your suggestions. The 95% and 5% mean the percentage of reads number corresponding to the dominant and minor haplotypes, respectively. Unfortunately, there are no barcodes for the two haplotypes.
I guess what I don't understand is how you have this ratio of haplotypes. Is this a combination of two strains of bacteria in a culture, for example?
Yes, similar like that. Two strains of HIV-1 virus are combined for sequencing.
Ah, I see. Yes, a variant-caller capable of handling low-frequency variants or arbitrary ploidy should help in this situation.