Say I have 24 VCFs (chrom 1-Y) and I want to concat them all into one VCF. But ... for 16 VCFs I have 100 samples and for the remaining I have 95. The 95 samples are present in the 100 sample VCF.
So is there an easy way to concat (not merge) these VCFs into one?
Because the merge tools do not like the duplicate sample names. And if you --force-samples then you get a second column of the sample name NA12878 becomes 2:NA12878 and that's more of a headache .
I'm looking for a quick fix out of laziness. This can be done with a quick script, but I have split the genome into 10Mb intervals for over 6000 whole genomes. So yeah, if someone else wrote a tool I would prefer that.
I was struggling with a similar issue and CombineVariants was the answer, as @RamRS suggested. (Thanks!)
I had chr8.vcf.gz and chrY.vcf.gz files with some shared samples. More specifically, male samples were shared by both files, but female samples were absent from chrY.vcf.gz. After indexing both files with tabix <filename>, this is the command merged both sets perfectly:
Of course, gatk3 should be java plus the path to gatk. The PRIORITIZE parameter is meant to prioritize redundant genotypes from one file over the other, with -priority A,B refering to the labels assigned in the --variant parameters. However, if you have non-overlapping regions (e.g. different chromosomes like my case), the chosen priority does not affect the output. The result will have missing genotypes for those samples that are present in one file but not the other, as expected (in my case, female samples have missing genotypes for chromosome Y variants).
Yep, I wanted a "block of code", but ```...``` did not preserve newlines. I just tried the [101|010] edit button but the result seems to be the same right?
The code button is the way to go. Select the content and use the 101010 button for a code block. For inline code formatting, use back-ticks (`content`).
but ```...``` did not preserve newlines
This is a bug where the preview box that pops up when you're typing your content doesn't render content surrounded by triple back-ticks properly, but once you click Submit, there is virtually no difference between content surrounded by triple back-ticks and content where each line is prefixed with 4 spaces and padded by a newline from content preceding and following it. (which is what the 101010 button does).
why don't you want to merge ? 'merge' (gatk/bcftools) is the only way to merge some VCF with different samples.
Because the merge tools do not like the duplicate sample names. And if you
--force-samples
then you get a second column of the sample nameNA12878
becomes2:NA12878
and that's more of a headache .I'm looking for a quick fix out of laziness. This can be done with a quick script, but I have split the genome into 10Mb intervals for over 6000 whole genomes. So yeah, if someone else wrote a tool I would prefer that.
GATK has a tool, CombineVariants, that does it. It has an option
genotypemergeoption
for when sample names are repeated.