Hi,
I am doing amplicon sequencing (100 000x) on several targets of a bacteria. I expect most of the cultures to be clonal, however, there will be instances of heterogeneous populations. I want to, using the pair-end short-read sequencing, determine the haplotypes fo the bacteria present. I first want to determine the phase of the mutations, I've done extensive searching but can't seem to find tools that incorporates information about read 2.
==========Read 1========...==========Read 2========
-----C------------------...------------------------
-----C------------------...------------------------
-----C------------------...------------------------
-----A------------------...------------------------
-----A------------------...------------------------
-----A------------------...------------------------
-----A--------G---------...---------G--------------
-----A--------G---------...---------G--------------
-----A--------G---------...---------G--------------
-----A--------G---------...---------G--------------
==========Read 1========...==========Read 2========
---------G--------------...----------T-------------
---------G--------------...----------T-------------
---------G--------------...----------T-------------
------------------------...----------T-------------
------------------------...----------T-------------
So this will use the information from both reads for phasing. So we will get:
-----C------------------------------------------????????????????????????
-----A------------------------------------------????????????????????????
-----A--------G------------------G------------------------T-------------
????????????????????????----------------------------------T-------------
Maybe I could iterate over the mutations with pysam and determine the phase?
Id also like the frequencies, and then to determine the most likely haplotypes of the distinct bacterial populations. Not sure how to do that either, but I "expect" a clonal population, so I suppose I would try to use the frequencies to get the minimal number of distinct haplotypes.
I feel there must be tools already developed, maybe for microbiome analyses, that I'm missing? Otherwise, any thoughts on how to tackle this?