Hello!
We are analyzing a WGS data of 60 samples (6 groups, 10 samples/group) produced by HiSeq4000. The mean coverage per sample is 25x (lowest sample is 15x).
Now we realized we need to sequence more samples in order to better estimate the allele frequencies. Due to budget and technical constrains we came down to sequence 90 samples (6 groups, 15 samples/group) at a target coverage of 5x. This time on a NovaSeq platform.
Now each group has 25 samples (10 from Hiseq4000 and 15 from NovaSeq).
Our aim is to do population analysis using SNP allele frequencies after combining the Hiseq4000 (25x coverage) data and the NovaSeq (5x coverage) data.
My plan for the new batch (NovaSeq - 5x) is to run it through the steps of GATK's best practices until HaplotypeCaller
and then combine it with the original batch (Hiseq4000 - 25x) using CombineGVCFs
and do joint calling with GenotypeGVCF.
I am working with mice samples, so I will do VQSR afterwards.
I have basically two questions:
- Is there an issue with doing joint variant calling and VQSR using information from different thechnologies?
- Would it be better to produce one VCF per batch and then merge them into one final VCF?
A similar thread is found here but data was produced with the same thechnology. Nonetheless, it is mentioned that different patterns of coverage could potentially create confusion in model building during VQSR.
I know this is not a "do this, do that" answer. I would appreciate comments and suggestions.
DISCLAIMER: I have posted this question on the gatk forum a while ago (~2mo), but they haven't had time to address my concerns. EDIT: I added a second question to the post.
Illumina's guidance used to be "sequence produced should always be considered equivalent irrespective of type of sequencer used". This may be more or less true even now when we have expanded to 1-, 2- and 4 color chemistries from just 4-color. We had done some benchmarking for RNAseq data and results were identical when the same set of libraries were run on HiSeq 4000 and NovaSeq.
Have you done the sequencing for second round? If not I suggest that you add a few samples from last batch of HiSeq 4000 to get equivalent NovaSeq results. That should allow you to do direct comparisons on that subset of samples. You would want to try and keep other parameters (coverage etc) as equivalent as you can to prevent additional batch effects (if any).
But this is variant calling. There might be letter calling biases between the platforms, which would not affect RNASeq so much.
Possibly. That is the reason of my recommendation that OP include some past samples in new NovaSeq run to test that possibility directly. Sequencing being done is shallow (5x) so the results are not going to identify rare SNP.
We did not add samples from the last batch into the new batch
So you already did the sequencing?
yes, we just got the new data