Question

Variant calling from multiple sequencing runs for a single samples

0

Entering edit mode

8.3 years ago

harish.nucl • 0

Hi!

I have 4 libraries with multiple sequencing runs per library. I called the variants for each run individually and then combined them using GATK's CombineVariants as well as CatVariants procedure.

NW_006711278.1  526153  .   T   C   199 .   AC=2;AN=2;DP=7;DP4=0,0,3,4;MQ=60;MQ0F=0;MQSB=1.01283;SGB=-0.636426;VDB=0.792147;TYPE=snp    GT:PL   1/1:226,21,0
NW_006711278.1  526153  .   T   C   172 .   AC=2;AN=2;DP=6;DP4=0,0,4,2;MQ=60;MQ0F=0;MQSB=1;SGB=-0.616816;VDB=0.954032;TYPE=snp  GT:PL   1/1:199,18,0
NW_006711278.1  526153  .   T   C   228 .   AC=2;AN=2;DP=9;DP4=0,0,5,4;MQ=60;MQ0F=0;MQSB=0.974597;SGB=-0.662043;VDB=0.0292194;TYPE=snp  GT:PL   1/1:255,27,0

For example above, for the same position I can see that multiple libraries have been put here. Is there any way to collapse them?

The only reason I wasn't using a large bam file containing all of these libraries was that it was taking too much time to crunch and I had a power failure and didn't want to start it all over again.

Thanks!

Edit: removed ambiguity, power failure overnight and the backup didn't kick in, not a segfault.

variant-calling • 3.3k views

ADD COMMENT • link updated 16 months ago by Ram 45k • written 8.3 years ago by harish.nucl • 0

0

Entering edit mode

As a comment for future readers:

"about 4 libraries"

About?

"I had a power failure"

...or a segmentation fault?

ADD REPLY • link 8.3 years ago by Matteo Schiavinato ★ 3.7k

0

Entering edit mode

... I wasn't using a large bam file containing all of these libraries was that it was taking too much time to crunch

What does that mean? Can you lay out the details of the sequencing a little bit more? How do these different libraries compare? Do you consider them technical replicates? Or was it a single run and you just split them up to parallelize/speed-up the mapping process?

In case you just split up to speed-up the mapping process, I strongly recommend to merge the BAM files that belong together first!

I had a power failure and didn't want to start it all over again

Have you ensured that the files are correct and all the processing was finished ok? I mean... this is you experiment/analysis but if you start with crap (i.e., non-complete files or whatsoever), your results will be crap (i.e., potentially wrong).

ADD REPLY • link 8.3 years ago by Manuel Landesfeind ★ 1.4k

0

Entering edit mode

Can you lay out the details of the sequencing a little bit more?

The sequencing was done in multiple runs, for a couple of insert sizes. For example, an insert size of 250 bases was sequenced in 3 runs.
It is a single sample, for which I have 4 libraries, each having more than a single run. So no, it isn't a replicate.

What does that mean?

I aligned the reads from the multiple runs individually to the reference genome, so I have a couple of bams. I merged the bams from all the insert sizes and runs tagged with their Read-Groups, to call the variants, but since there was a power shutoff and the generator didn't kick in, I couldn't get the calls out. Also since this is a large dataset and it had already taken a couple of days, I wasn't keen to rerun it (although, I have reinitialized the run)
So now, I have the bams, and I called variants from them. I now have the variant calls in over 10 vcfs. I merged them using GATK's CombineVariants and CatVariants modules as well and I'm left with the above output, that I have posted as a example. Is there any way to collapse these variants from multiple library runs into a single call?

Hope that makes it clearer! I apologize if I was too abstruse!

Thanks

ADD REPLY • link 8.3 years ago by harish.nucl • 0

0

Entering edit mode

What do you want to demonstrate with the study? If you are interested in the biological results (i.e., what are the mutations of your biological sample) I recommend combining on the BAM level. This will increase the total coverage and thus gives you a higher accuracy for variant detection. In particular, I would not expect to identify different variants if the only difference is the insert size of your reads.

In my opinion, merging variants from multiple samples only makes sense if you have replicates and want to figure out how many or which variants were found in both.

To sum up: wait for the reanalysis of your merged-BAM data. That should give you better (and easier) results.

ADD REPLY • link 8.3 years ago by Manuel Landesfeind ★ 1.4k

score 1 · Answer 1 · 2017-04-26

1

Entering edit mode

8.3 years ago

Matteo Schiavinato ★ 3.7k

You're looking for bcftools merge: https://samtools.github.io/bcftools/bcftools.html

ADD COMMENT • link 8.3 years ago by Matteo Schiavinato ★ 3.7k

score 0 · Answer 2 · 2017-04-26

From the INFO-field DP4, I deduce, that you have used samtools/bcftools for calling already? If yes, I recommend joint variant calling:

bcftools mpileup --annotate 'FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP' $run_1_lib1.bam $run_1_lib2.bam [ ... and all other bam files you have] | bcftools call ...

Then you do not need to merge afterwards.

Anyway, VCF is a "multi-sample" format and you should make sure that no information is lost when merging data. In particular, it is important to be able to fetch allelic coverage (AD) and total coverage (DP) from the FORMAT fields (columns 10, 11, ...)