Hi!
I have 4 libraries with multiple sequencing runs per library. I called the variants for each run individually and then combined them using GATK's CombineVariants as well as CatVariants procedure.
NW_006711278.1 526153 . T C 199 . AC=2;AN=2;DP=7;DP4=0,0,3,4;MQ=60;MQ0F=0;MQSB=1.01283;SGB=-0.636426;VDB=0.792147;TYPE=snp GT:PL 1/1:226,21,0
NW_006711278.1 526153 . T C 172 . AC=2;AN=2;DP=6;DP4=0,0,4,2;MQ=60;MQ0F=0;MQSB=1;SGB=-0.616816;VDB=0.954032;TYPE=snp GT:PL 1/1:199,18,0
NW_006711278.1 526153 . T C 228 . AC=2;AN=2;DP=9;DP4=0,0,5,4;MQ=60;MQ0F=0;MQSB=0.974597;SGB=-0.662043;VDB=0.0292194;TYPE=snp GT:PL 1/1:255,27,0
For example above, for the same position I can see that multiple libraries have been put here. Is there any way to collapse them?
The only reason I wasn't using a large bam file containing all of these libraries was that it was taking too much time to crunch and I had a power failure and didn't want to start it all over again.
Thanks!
Edit: removed ambiguity, power failure overnight and the backup didn't kick in, not a segfault.
As a comment for future readers:
About?
...or a segmentation fault?
What does that mean? Can you lay out the details of the sequencing a little bit more? How do these different libraries compare? Do you consider them technical replicates? Or was it a single run and you just split them up to parallelize/speed-up the mapping process?
In case you just split up to speed-up the mapping process, I strongly recommend to merge the BAM files that belong together first!
Have you ensured that the files are correct and all the processing was finished ok? I mean... this is you experiment/analysis but if you start with crap (i.e., non-complete files or whatsoever), your results will be crap (i.e., potentially wrong).
I aligned the reads from the multiple runs individually to the reference genome, so I have a couple of bams. I merged the bams from all the insert sizes and runs tagged with their Read-Groups, to call the variants, but since there was a power shutoff and the generator didn't kick in, I couldn't get the calls out. Also since this is a large dataset and it had already taken a couple of days, I wasn't keen to rerun it (although, I have reinitialized the run)
So now, I have the bams, and I called variants from them. I now have the variant calls in over 10 vcfs. I merged them using GATK's CombineVariants and CatVariants modules as well and I'm left with the above output, that I have posted as a example. Is there any way to collapse these variants from multiple library runs into a single call?
Hope that makes it clearer! I apologize if I was too abstruse!
Thanks
What do you want to demonstrate with the study? If you are interested in the biological results (i.e., what are the mutations of your biological sample) I recommend combining on the BAM level. This will increase the total coverage and thus gives you a higher accuracy for variant detection. In particular, I would not expect to identify different variants if the only difference is the insert size of your reads.
In my opinion, merging variants from multiple samples only makes sense if you have replicates and want to figure out how many or which variants were found in both.
To sum up: wait for the reanalysis of your merged-BAM data. That should give you better (and easier) results.