I think I understand that large sequencing studies seem to coordinate that contributing centers use similar protocols to avoid batch effects
Although seemingly less common, I see examples of jointly calling samples that were processed w/ different kits and/or sequencing platforms.
On the other hand, examples of jointly calling samples that were sequenced with different read lengths seem much harder to find. Intuition seems that read length would have such a large impact on quality at the mapping stage that this would in turn introduce large bias.
Is joint calling of samples sequenced with different read lengths aggressively avoided, which is why I don’t see many examples? If the answer to no, then are best practices for approaching this type of processing?
Thank you for taking the time to respond.
Lets say you wanted to test association between rare variants and some disease. You notice there are several separately sequenced cohorts (from different studies) that contain samples with the phenotype of interest.
One option appears to be meta-analysis of individual study association test results. But as I understand it, this approach may not perform as well for low coverage.
A second option might be to “jointly process” the samples though a pipeline like GATK, then perform 1 association test on the resulting single call set . As I understand it, this second options should help reduce impact of batch effects from differences in prep in addition to improve calling, particularly in lower coverage regions.
By jointly processing I, mean to refer to a process where one runs
HaplotyeCaller
to get a gVCF file for each sample before combining gVCFs and performing joint genotyping withGenotypeGVCFs
.I have seen this joint analysis done for cases where contributing cohorts were sequenced on different machines, from different centers of course. Instances where joint genotyping is performed using samples from different cohorts that were sequenced with different read lengths seems harder to find, which is how this question originally came about.
Might be a better question for the GATK help center, but I'm pretty sure the GATK pipeline will allow this. I doubt anyone has studied it, but you could search for publications on potential variant calling biases obtained from libraries of differing sequencing lengths. I am sure people have done it before, especially when calling tumor/normal variants. It is likely that the benefit you gain from having the additional power of a larger cohort will outweigh the potential negatives of batch effect