Question

Is vg toolkit suitable for haplotype calling?

0

Entering edit mode

12 weeks ago

Pau • 0

Hello, I am interested in obtaining and quantifying the haplotypes (of only a gene) present in a FASTQ file. Currently, I have high-confidence variant calls in a vcf (performed from short read WGS) but it is unphased. I do have cDNA ONT long reads spanning the whole regions that I would like to use to obtain the haplotypes.

So far I have build a vg graph with a reference assembly (fasta) and an unphased vcf, after indexing I have mapped my long-reads FASTQ files to the graph. My expectation was that after this I could extract the haplotypes and later quantify them. However, now after going through the github issues I am not sure if it is possible to extract the haplotypes without an already phased vcf.

I would appreciate any recommendation or insights. Thanks a lot.

pangenome haplotype vg • 946 views

ADD COMMENT • link updated 12 weeks ago by Jordan M Eizenga ▴ 740 • written 12 weeks ago by Pau • 0

score 1 · Answer 1 · 2025-04-22

1

Entering edit mode

12 weeks ago

Jordan M Eizenga ▴ 740

When you say "extract the haplotypes", are you referring to the read alignments themselves, or do you mean inferring the underlying haplotypes based on the read alignments? The former is certainly something that vg can do, the latter is not something that has been built into it.

ADD COMMENT • link 12 weeks ago by Jordan M Eizenga ▴ 740

0

Entering edit mode

What I mean is: from a set of long reads and a pangenome with no haplotype paths (because it was built with unphased vcf) obtain the different haplotypes a sample has (each of which should be represented by different groups of long reads). In other words, I see it like adding the haplotype paths to the pangenome based on my long reads. With this then I'd like to quantify how many reads are aligning to each haplotype path.

I am not really sure what do you mean with "read alignments themselves" in this context. Could I extract the set of nodes that each read is using for the alignment, to build the haplotypes manually somehow?

Thanks

ADD REPLY • link 12 weeks ago by Pau • 0

1

Entering edit mode

The read alignment is a path through the graph, so in that sense, I think the read alignment might be the path you are looking for. For downstream inference, however, the main tool that's around is rpvg, which is designed around the assumption that you already have haplotypes that you can project transcripts onto. Apart from rpvg, you could do simple transcript counting by hacking on GAF output, but not the more sophisticated EM-based methods that account for gene families with similar sequences.

Have I also gathered correctly that you also do not have existing annotations for this genome? If not, VG might not be the best choice for you. vg mpmap, which is the spliced aligner in VG, relies more heavily on existing annotations than most linear reference spliced aligners do. I would expect relatively poor detection of splice junctions without good existing annotations.

ADD REPLY • link 11 weeks ago by Jordan M Eizenga ▴ 740

0

Entering edit mode

Just to be more explicit: I am working with a polyploid monoexonic gene whose boundaries are well annotated. I have many cDNA ONT reads (full of technical errors) that I want to use to obtain the different haplotypes found in each sample. Due to the error rate of ONT I can only trust the variants that I called using short read WGS.

1) If with my (unphased) vcf and my fasta reference I construct a pangenome of this gene and then map the ONT reads, from the .gam can I obtain a series of sequences representing the different paths that my reads follow without being affected by the technical errors? In other words, if my sample has 4 different haplotypes, with the alignment can I obtain 4 single haplotype sequences strictly following the graph sequences or I will get many more sequences due to the errors in the reads?

2) My plan was, after obtaining the different haplotypes, do a MSA and use it to create a pangenome with the haplotypes embedded so I can use rpvg to quantify how many reads have generated each haplotype. Would this be correct?

Thanks for the patience and help!

ADD REPLY • link 11 weeks ago by Pau • 0

1

Entering edit mode

If your reads have errors that look similar to variants, you can certainly get alignment paths that correspond to non-biological transcript paths. You haven't provided any phasing data, so the graph can't really have any knowledge of what the allowable combinations of variants are. Ideally, there are fewer non-biological paths than there are biological ones, but it's a non-trivial inference problem to sort out what the true paths are. If the error rate, transcript length, and heterozygosity are favorable enough, you might be able to just take the most frequent paths. That won't always be possible though.

ADD REPLY • link 11 weeks ago by Jordan M Eizenga ▴ 740

0

Entering edit mode

So if I manually correct the reads so they only include biological variation (that is also included in the graph) then I can manually extract the paths from the alignments and extract the haplotypes, right?