Hello, I am interested in obtaining and quantifying the haplotypes (of only a gene) present in a FASTQ file. Currently, I have high-confidence variant calls in a vcf (performed from short read WGS) but it is unphased. I do have cDNA ONT long reads spanning the whole regions that I would like to use to obtain the haplotypes.
So far I have build a vg graph with a reference assembly (fasta) and an unphased vcf, after indexing I have mapped my long-reads FASTQ files to the graph. My expectation was that after this I could extract the haplotypes and later quantify them. However, now after going through the github issues I am not sure if it is possible to extract the haplotypes without an already phased vcf.
I would appreciate any recommendation or insights. Thanks a lot.
What I mean is: from a set of long reads and a pangenome with no haplotype paths (because it was built with unphased vcf) obtain the different haplotypes a sample has (each of which should be represented by different groups of long reads). In other words, I see it like adding the haplotype paths to the pangenome based on my long reads. With this then I'd like to quantify how many reads are aligning to each haplotype path.
I am not really sure what do you mean with "read alignments themselves" in this context. Could I extract the set of nodes that each read is using for the alignment, to build the haplotypes manually somehow?
Thanks
The read alignment is a path through the graph, so in that sense, I think the read alignment might be the path you are looking for. For downstream inference, however, the main tool that's around is
rpvg
, which is designed around the assumption that you already have haplotypes that you can project transcripts onto. Apart fromrpvg
, you could do simple transcript counting by hacking on GAF output, but not the more sophisticated EM-based methods that account for gene families with similar sequences.Have I also gathered correctly that you also do not have existing annotations for this genome? If not, VG might not be the best choice for you.
vg mpmap
, which is the spliced aligner in VG, relies more heavily on existing annotations than most linear reference spliced aligners do. I would expect relatively poor detection of splice junctions without good existing annotations.Just to be more explicit: I am working with a polyploid monoexonic gene whose boundaries are well annotated. I have many cDNA ONT reads (full of technical errors) that I want to use to obtain the different haplotypes found in each sample. Due to the error rate of ONT I can only trust the variants that I called using short read WGS.
1) If with my (unphased) vcf and my fasta reference I construct a pangenome of this gene and then map the ONT reads, from the .gam can I obtain a series of sequences representing the different paths that my reads follow without being affected by the technical errors? In other words, if my sample has 4 different haplotypes, with the alignment can I obtain 4 single haplotype sequences strictly following the graph sequences or I will get many more sequences due to the errors in the reads?
2) My plan was, after obtaining the different haplotypes, do a MSA and use it to create a pangenome with the haplotypes embedded so I can use rpvg to quantify how many reads have generated each haplotype. Would this be correct?
Thanks for the patience and help!
If your reads have errors that look similar to variants, you can certainly get alignment paths that correspond to non-biological transcript paths. You haven't provided any phasing data, so the graph can't really have any knowledge of what the allowable combinations of variants are. Ideally, there are fewer non-biological paths than there are biological ones, but it's a non-trivial inference problem to sort out what the true paths are. If the error rate, transcript length, and heterozygosity are favorable enough, you might be able to just take the most frequent paths. That won't always be possible though.
So if I manually correct the reads so they only include biological variation (that is also included in the graph) then I can manually extract the paths from the alignments and extract the haplotypes, right?
Yes, error-free reads will generally follow the path of the transcript.