Entering edit mode
6.8 years ago
kirannbishwa01
★
1.6k
I have multiple single-sample VCF files, which I want to merge into a single multi-sample VCF file. When using bcftools merge
I am getting duplicate records.
$ bcftools merge ms01e_phased.vcf.gz ms02g_phased.vcf.gz ms03g_phased.vcf.gz ms04h_phased.vcf.gz MA605_phased.vcf.gz MA611_phased.vcf.gz -O v -o RBphased_variants.SixSamples.Final.vcf
# duplicate records at the same lines from the file "RBphased_variants.SixSamples.Final.vcf"
2 14691373 . A . 1153.31 PASS BaseQRankSum=2.02;ClippingRankSum=0;ExcessHet=3.0103;FS=1.098;InbreedingCoeff=-0.0861;MQ=58.74;MQRankSum=-2.459;QD=19.22;ReadPosRankSum=-0.466;SOR=0.96;DP=68;AN=8 GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:PM 0/0:4:4:0:0:0/0:.:.:0/0:.:. 0/0:7:7:18:0:0/0:.:.:0/0:.:. 0/0:4:4:12:0:0/0:.:.:0/0:.:. 0/0:2:2:3:0:0/0:.:.:0/0:.:. ./.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.
2 14691373 . A AAG 1153.31 PASS BaseQRankSum=2.02;ClippingRankSum=0;ExcessHet=3.0103;FS=1.098;InbreedingCoeff=-0.0861;MQ=58.74;MQRankSum=-2.459;QD=19.22;ReadPosRankSum=-0.466;SOR=0.96;set=InDels;DP=676;AF=0.042;AN=4;AC=0 GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC:PM ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. 0/0:12,0:12:9:.:.:0,9,135:0/0:.:.:0/0:.:. 0/0:22,0:22:12:.:.:0,12,180:0/0:.:.:0/0:.:.
2 14691374 . A . 1320.25 PASS BaseQRankSum=-1.049;ClippingRankSum=0;ExcessHet=0.2929;FS=0;InbreedingCoeff=0.4006;MQ=55.35;MQRankSum=0;QD=33.01;ReadPosRankSum=-0.671;SOR=0.892;DP=44;AN=2 GT:AD:DP:GQ:PL:PG:PB:PI:PW:PC:PM 0/0:4:4:0:0:0/0:.:.:0/0:.:. ./.:7:7:.:0:./.:.:.:./.:.:. ./.:0:0:.:0:./.:.:.:./.:.:. ./.:0:0:.:0:./.:.:.:./.:.:. ./.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.
2 14691374 . A G 1320.25 PASS BaseQRankSum=-1.049;ClippingRankSum=0;ExcessHet=0.2929;FS=0;InbreedingCoeff=0.4006;MQ=55.35;MQRankSum=0;QD=33.01;ReadPosRankSum=-0.671;SOR=0.892;set=HignConfSNPs;DP=710;AF=0.115;MLEAC=3;MLEAF=0.115;AN=4;AC=0 GT:AD:DP:GQ:PGT:PID:PL:PG:PB:PI:PW:PC:PM ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. ./.:.:.:.:.:.:.:.:.:.:.:.:. 0/0:12,0:12:9:.:.:0,9,135:0/0:.:.:0/0:.:. 0/0:22,0:22:12:.:.:0,12,180:0/0:.:.:0/0:.:.
I raised this issue in bcftools
thinking if it was a bug https://github.com/samtools/bcftools/issues/754 . But, is there any other solution to the problem.
Thank you Kevin for the answer. But, this didn't work - it just deleted all the records that were duplicates (by position).
Try without the pipedbcftools view --min-ac 1
.Hi Kevin,
I actually tried it without
bcftools view --min-ac 1
and it seemed to work. Thanks,Hi Kevin, Everything worked well with this tool. But, there came a little problem. Several lines in my VCFs are this way:
See those samples at the end. Instead of
./. ./. ./. ./.
they should be../.:.:.:.:.:.:. ./.:.:.:.:.:.:. ......
i.e filled with null values for equal number of fields inFORMAT
field. The reason I need this is because I am using other tools to manipulate the data downstream and this data structure is throwing a problem. I have altogther 16 samples in this VCF.Thanks,
Let me guess: these VCFs have not been produced in the same way?
Kevin, that's not the case. See my code:
merge variants
ValidateVariants
That line I put there is from the file "RBphased_variants.AllSamples.Final.vcf"
And, here is another problem. Several
FORMAT
level tags are missing. See the example below.In the single sample VCF's.
sample "MA605" - "PI" format field exits
sample "ms01e" - no record at position 2 180505
sample "Sp3" - "PI" field present
But, the merge VCFs has "PI" tag missing at all.
I think the reason the "PI" are missing is because all the samples at that record don't have meaningful "PI" value. But, regardless of that shouldn't it be possible to have it in "merged VCFs".
I see that you're using some GATK tools, a program suite for which hardly anything is compatible, but that's an issue for the Broad Institute to work on.
For the particular example that you mention just now (chr2:180505), the data that you say is missing is neither in the input VCF files going into the BCFtools command. So, it's an issue even prior to that... perhaps to do with phASER.
For the PL example that you mention, the correct PL values are indeed there for that particular sample - just look further along the line.
The VCFs coming out of phASER, generally, do not look like they are in good shape. I don't know the exact specifics of your experiment to comment much further. It looks messy because there are a lot of reference calls in the data and a lot of missing FORMAT values.
Sorry that I cannot help further for now.
Kevin, This merging has been a big pain for me. GATK
CombineVariants
is throwing an error which I can't fix. And,bcftools
now seem to be giving up too.I think I will need to the author of "phaser" to look into this problem.
Thanks,
Yes, that is a better bet. The author will undoubtedly have encountered these issues before. Had I known that you were eventually using GATK, I would not have suggested BCFtools. SAMtools / BCFtools don't interact very well with GATK functions because the GATK likes to add numerous different and specific (to the GATK) encodings.
You can email the author here: https://stephanecastel.wordpress.com/
Best of luck.
Yeah, GATK doesn't seem to be working well when data are piped in, if it came out of the GATK workflow.
I have been in touch with Stephane Castel over the issue several times. But, it seems he is busy and not able to look over the problem. Or, prolly he has moved on.
I know a little bit of python but it just kills most of my time; and I may be reinventing the wheel, when a fast and workable solution is already out there.
Thanks much though.
Yes, that happens: they get their publication and then move on.
What if you tried to merge the VCFs prior to phASER, and then ran phASER on the merged VCF? I presume that phASER supports that.
My vcf's are merged and validated before piping into "phaser". The issue is "phaser" exclusively outputs single sample VCFs, which needs remerging - and when I try to merge I run into problems. I have not been able to find a tools that can fix the VCF format issue (but has correct information). Let's see what I can work out.
Did you ever get a solution to this? I find that each situation is different and requires a different combinations of commands.
You may take ideas from the following, where I process / merge calls:
I have to add that it's not "PL" tag and values but "pi" tags and values. I think the character differences caused the issue.