I am working on raw vcf from c.a. 200 samples run in a 4lanes Novaseq. For variant calling I used Lofreq since it is precise and consistent for low vaf snv. Since going low adds noise, I need to manually filter some variants. I am interested in snv observed at different proportions on different lanes. Since from literature it is known that some artifact are lane-specific, I am interested testing whethere some mutations appear prevalently in specific lanes. However, I tested also reference observations and it appeared that there is an umbalancing also for reference positions (for example: L1:260,L2:276,L3:186,L4:200). The point is that I was supposing that both reference and mutations should be balanced or, at least shoul be reference observations. My question is: how is possible that I observed such an unbalancing in many positions?
Which literature are you referring to? If the same pool ran on all lanes then the sequence should be equivalent.
See https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/2016-08-11-2016-04-07/7635-NextSeq-500-Lanes-and-BaseRecalibrator . It has been long observed that the 'empirical' quality scores (measured by non-dbSNP mismatches) vary between lanes of the same flowcell.
As far as I know, GATK package considers the read group (and usually the RG tag is done from lanes) for building the model so it assumes differences per lane. Moreover, it is also plausible that, if the sample is run on multiple lanes, the mutations shoul be equally represented in each lane. However, in a 10-20% of positions tested I observed the unbalancing. Most strickingly, i observed a 5-10% of positions in which the mutation is equally represented in all lanes while the reference bases are not.
Is the complaint here that the depth of coverage at a position is different between the lanes? This is entirely expected from a negative binomial process...
The compliant is that coverage is lane-dependent and also the mutational vaf. By reasoning I thought that both should be comparable and if not, it could be a 'call' of artifact. I mean, it is expected to observe a situation like L1:260,L2:276,L3:186,L4:200 where most of the positions seem balanced?
You should expect per-site coverage to be variable by lane. The appropriate metric is the allelic fraction of the minor allele, which you have not posted.
As an example, this position had a vaf for the mutatoon like L1:8,L2:6,L3:8,L4:7
Update: I am working on c.a 50.000 mutations from 200 FFPE samples run on Novaseq6000. I calculate AFs for all of them for each lane. I observed c.a 15.000 snv having AF diiferences of 5-10% ± 5% among lanes. I obseved no correlation with RNA degradation, GC content and duplicates I had a smaller dataset of FF samples prepared with a different protocol, and I observed c.a 5% of unbalanced snv.
You can compute a simple p-value for your observed difference by running your table of:
through a Pearson chisq test python R. My guess is only a small number of the SNVs will have a p-value of < 0.1/50000 (bonferroni)