I am working with publicly available data sets of VCF files. I accessed VCF files broken out by patient and by chromosome with just the 0/0 calls, and unfortunately the ALT column includes a value <non_ref> on every line. I also have VCF files per patient with 1/1 and 0/1 calls across the entire genome, those do have values in the ALT column, such as A, G, or CATGTT, for instance.
I merged all files by patient, but then when I try to use bcftools merge across patients, the problem is the single merged vcf file (with 5000 patients) now thinks of <non_ref> literally as one of the potential ALT calls.
Sadly, I cannot go back upstream in this public data set and re-run these files with GATK.
I am wondering if anyone has any ideas on how to get vcftools, bcftools or gatk vcf merge functions to ignore the <non_ref> value in the ALT column on some lines in each file?
P.S. I tried a recode the files manually with perl -pe "s/<non_ref>/./g" but bcftools is throwing flags, as in missing value in ALT column.
Jim
There is no need to SHOUT. I have adapted your title and simultaneously made it more specific.