Question

Jointly calling samples from different studies, which have different read lengths?

0

Entering edit mode

4.1 years ago

boxate1618 ▴ 60

I think I understand that large sequencing studies seem to coordinate that contributing centers use similar protocols to avoid batch effects

Although seemingly less common, I see examples of jointly calling samples that were processed w/ different kits and/or sequencing platforms.

On the other hand, examples of jointly calling samples that were sequenced with different read lengths seem much harder to find. Intuition seems that read length would have such a large impact on quality at the mapping stage that this would in turn introduce large bias.

Is joint calling of samples sequenced with different read lengths aggressively avoided, which is why I don’t see many examples? If the answer to no, then are best practices for approaching this type of processing?

joint variant gatk calling • 1.3k views

ADD COMMENT • link updated 4.0 years ago by heskett ▴ 110 • written 4.1 years ago by boxate1618 ▴ 60

score 0 · Answer 1 · 2021-04-29

0

Entering edit mode

4.1 years ago

heskett ▴ 110

I'm assuming you're talking about variant calling? Genotyping? Calling tumor mutations? Variant discovery? Maybe explain a little more exactly what you're trying to do as there's no yes or no answer to this question. In general we try to do the most rigorous analysis given the data that is possible to collect. There might be some regions of the genome where 36bp single end reads don't map uniquely while a 150bp paired end read does, but its hard to know if that could affect your analysis without having more details

ADD COMMENT • link 4.1 years ago by heskett ▴ 110

0

Entering edit mode

Thank you for taking the time to respond.

Lets say you wanted to test association between rare variants and some disease. You notice there are several separately sequenced cohorts (from different studies) that contain samples with the phenotype of interest.

One option appears to be meta-analysis of individual study association test results. But as I understand it, this approach may not perform as well for low coverage.

A second option might be to “jointly process” the samples though a pipeline like GATK, then perform 1 association test on the resulting single call set . As I understand it, this second options should help reduce impact of batch effects from differences in prep in addition to improve calling, particularly in lower coverage regions.

By jointly processing I, mean to refer to a process where one runs HaplotyeCaller to get a gVCF file for each sample before combining gVCFs and performing joint genotyping with GenotypeGVCFs.

I have seen this joint analysis done for cases where contributing cohorts were sequenced on different machines, from different centers of course. Instances where joint genotyping is performed using samples from different cohorts that were sequenced with different read lengths seems harder to find, which is how this question originally came about.

ADD REPLY • link 4.1 years ago by boxate1618 ▴ 60

0

Entering edit mode

Might be a better question for the GATK help center, but I'm pretty sure the GATK pipeline will allow this. I doubt anyone has studied it, but you could search for publications on potential variant calling biases obtained from libraries of differing sequencing lengths. I am sure people have done it before, especially when calling tumor/normal variants. It is likely that the benefit you gain from having the additional power of a larger cohort will outweigh the potential negatives of batch effect

ADD REPLY • link 4.0 years ago by heskett ▴ 110