Hi,
I am trying to look at SV allele frequency differences from pooled sequencing data. I constructed a graph from a vcf of SVs, and mapped short-read data from pools to this using vg giraffe. I used vg pack with -Q 5 and then vg call with the -v option to genotype from the original vcf. I then extract the ref/alternate allele frequencies from the AD flag in the resulting vcf.
I seem to get quite wild swings in frequency for some SVs between my pools, which could be real. What makes me suspicious is when I made an in silico pool from some individually sequence samples, and compared the read support for alt allele frequencies in this pool compared to the alt allele support from the individual samples, a subset of SVs seemed to have a tendency for the proportion of alternate allees from the pool compared to the support from the individual based files (see graph 1). These sites (in the top left corner of this graph) are disproportionately highly variable between my real poolseq files, making me suspect some kind of error estimating the alternate/reference read frequencies for these sites. There are also some strange genotypes at these sites: they are much more likely to have 0/0 calls with more than 50% alternate read support, than the background SVs. While I’m not interested in the genotype calls from the pools, this does suggest something weird might be going on somewhere. A bit stumped - any ideas on what might be going on would be much appreciated.
vg version v1.48.0 "Gallipoli" Compiled with g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 on Linux Linked against libstd++ 20210601