Hi,
I am looking at gene cassettes insertion in diploïd yeast samples. There is one gene cassette, which is known as heterozygous in a specific sample (called s0), confirmed by PCR. I did a phased assembly (hifiasm and flye+hapdup) but I found this cassette in both copies of these both assemblies (suggesting it is homozygous). The assemblies are pretty good and contiguous.
So, what I did is looking at the reads directly, which contain the real heterozygous information :
- I `BLAST` the gene cassette sequence against the s0 reads and I extract the read IDs where >99% idendity
- I align the s0 reads against the phased assembly of s0, and identify the region where the IDs reads from BLAST get aligned
- Then, I extract all the reads of this region and did a "proportion" : reads from BLAST / all reads of the region
I get 27% for the haplotype 1 and 29% for the haplotype 2. These threshold are probably too low for the assemblers to distinguish the two phases, considering the region homozygous.
I would like to know if this method is reliable? Because in all of my samples, it is the same problem for each gene cassette (some of them expected heterozygous, but I only have homozygous ones).
Best