Entering edit mode
10.1 years ago
Lee Katz
★
3.2k
Hey BioStars, sorry to flood you so much with my bcftools questions. Since I have merged my VCFs, it turns out that they are now in VCF version 4.2 format. I want to view the pooled VCF; however, IGV does not read v4.2. Is there a mechanism for viewing v4.2? In any pileup viewer?
Using vcf-convert to version 4.1 does not work because after I convert/bgzip/index to v4.1, I get an error
Error loading C:\Users\gzu2\Desktop\msa\out.4.1.vcf.gz: The provided VCF file is malformed at approximately line number 31: Duplicate allele added to VariantContext: G
Lines 30-32 are such:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT lambda_virus.fasta.wgsim.fastq.gz-reference sample1.fastq.gz-reference sample2.fastq.gz-reference sample3.fastq.gz-reference sample4.fastq.gz-reference
`NC001416 1 . G G 0 PASS AC=0;ADP=0;AN=0;HET=0;HOM=0;NC=1;WT=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR:AC ./. ./.:.:0:2:.:0:.:.:.:.:.:.:.:.:0 ./.:.:1:1:.:0:.:.:.:.:.:.:.:.:0 ./.:.:1:3:.:0:.:.:.:.:.:.:.:.:0 ./.:.:2:2:.:0:.:.:.:.:.:.:.:.:0`
NC001416 2 . G G 0 PASS AC=0;ADP=3;AN=0;HET=0;HOM=0;NC=1;WT=0 GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR:AC ./. ./.:.:3:4:.:0:.:.:.:.:.:.:.:.:0 ./.:.:1:1:.:0:.:.:.:.:.:.:.:.:0 ./.:.:2:3:.:0:.:.:.:.:.:.:.:.:0 ./.:.:2:3:.:0:.:.:.:.:.:.:.:.:0
Why don't you just filter out the rows where REF=ALT? That may well get rid of the problem.
Good idea! But now I am getting the same error pertaining to this line:
Most or all of my lines will have at least two choices in the ALT field because it is a pooled VCF from
bcftools merge
.So it complains whenever any of the ALT alleles = REF. That's pretty reasonable in my mind (if VCF4.2 allows that then I'd be curious what the reasoning was...it seems like a terrible idea). I'd be curious if the GT field ever uses the ALT=REF genotype (e.g., 0/2 for the line you showed). I would suspect not. In any case, removing examples of this shouldn't be too terrible to script.
If I remove any line with REF and ALT with something like
G C,G
, then it will remove every line. This is because it is a merged VCF file. So I guess a good question would be, can I view a merged VCF in any viewer?I wouldn't remove the line, just remove the reference ALT allele. You'll need to ensure that it's the last ALT allele (otherwise, you'll change the sample genotypes).
I'm still learning the VCF format but I'll give that a try. Is this the kind of format you are saying to make it into? I would change
C,G
toC,.
? Or to.,C
? Or remove it altogether so that it is justC
?I changed all the ALT fields so that they do not have commas and so that they are only a single nucleotide. I also made sure that the ALT nucleotide was different than the REF. And now I can import it to IGV! Woo!
However, all variants now are incorrectly attributed to the first genome. I think I see what you meant by having to change the genotype field which might be too much effort for what I want... is there any script out there to do what I want? Or is it easier than I think?
And the code, for posterity:
I don't know of a script/program to do it (I don't have time at the moment to quickly write it, or else I'd just do it). If you happen to end up writing one then please share it here. I imagine others would find it useful.
Fair enough! I am not sure when I'll be able to do it, but if anyone else does it before me though please post it here!
I think I script I wrote today should finally answer this question (and more!). I call it
fix vcf
and it is part of my Lyve-SET package. https://github.com/lskatz/lyve-SET/blob/master/scripts/set_fixVcf.plMaybe a better question is, can I load a pooled/merged VCF into any pileup viewer? What about VCF v4.2 files?