vcf-merge is giving me a few SNPs in the merged vcf file
2
1
Entering edit mode
8.1 years ago
shinken123 ▴ 150

Hi

I am using vcf-merge to join to vcf files, one contain the data of 56 individuals and the other of only one. One of my files have 524754 SNPs and the other 105878 however when I join them I obtain just a few SNPs ~10000 SNPs and I obtain the next message error:

Wrong number of values in LANMLR17B064:250550525/PL at 1:21709201 .. nAlleles=4, nValues=3.
Expected 10 values for diploid genotypes or 4 for haploid genotypes.

at /LUSTRE/storage/data/software/vcftools/lib/perl5/site_perl/Vcf.pm line 172, <$__ANONIO__> line 10372.
Vcf::throw(Vcf4_0=HASH(0x7b12a8), "Wrong number of values in LANMLR17B064:250550525/PL at 1:2170"...) called at /LUSTRE/storage/data/software/vcftools/lib/perl5/site_perl/Vcf.pm line 1767
VcfReader::parse_AGtags(Vcf4_0=HASH(0x7b12a8), HASH(0xb4d080)) called at /LUSTRE/storage/data/software/vcftools/bin/vcf-merge line 464
main::merge_vcf_files(HASH(0xa4ea10)) called at /LUSTRE/storage/data/software/vcftools/bin/vcf-merge line 12

Do you know what is happening here? Why I obtain just a few SNPs in the merged file?

snp • 3.8k views
ADD COMMENT
0
Entering edit mode

Posting the lines containing '1:21709201' (for example, search for it using grep) could help - sounds like that SNP has some weird alleles in your two files

ADD REPLY
0
Entering edit mode

Ok thanks, I removed some individuals that I suspect have sequencing errors but the problem is the same, here are the first lines of my two files that I am merging:

 zcat PTxB73BC1S5RIL_0.1_0.9_56ind_V3_sorted.vcf.gz | grep -v "#" | head -n 2

1 10045 S1_10045 G C . PASS DP=36 GT:AD:DP:GQ:PL ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:79:0,6,72 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 0/0:2,0:2:79:0,6,72 0/0:2,0:2:79:0,6,72 ./.:0,0:0 0/0:2,0:2:79:0,6,72 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:79:0,6,72 0/0:3,0:3:88:0,9,108 0/0:1,0:1:66:0,3,36 ./.:0,0:0 0/0:1,0:1:66:0,3,36 0/0:1,0:1:66:0,3,36 1/1:0,1:1:66:36,3,0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 0/0:3,0:3:88:0,9,108 0/0:2,0:2:79:0,6,72 1/1:0,2:2:79:72,6,0 ./.:0,0:0 ./.:0,0:0 1/1:0,1:1:66:36,3,0 1/1:0,1:1:66:36,3,0 0/0:1,0:1:66:0,3,36 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 0/0:2,0:2:79:0,6,72 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 0/0:1,0:1:66:0,3,36 ./.:0,0:0

1 157465 S1_157465 C T . PASS DP=42 GT:AD:DP:GQ:PL ./.:0,0:0 0/0:1,0:1:66:0,3,36 ./.:0,0:0 0/0:1,0:1:66:0,3,36 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 0/0:2,0:2:79:0,6,72 ./.:0,0:0 0/0:1,0:1:66:0,3,36 0/0:1,0:1:66:0,3,36 1/1:0,1:1:66:36,3,0 ./.:0,0:0 ./.:0,0:0 1/1:0,1:1:66:36,3,0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 ./.:0,0:0 0/0:1,0:1:66:0,3,36 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 ./.:0,0:0 0/0:2,0:2:79:0,6,72 0/0:1,0:1:66:0,3,36 0/0:1,0:1:66:0,3,36 0/0:2,0:2:79:0,6,72 ./.:0,0:0 ./.:0,0:0 1/1:0,1:1:66:36,3,0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 0/0:2,0:2:79:0,6,72 0/0:1,0:1:66:0,3,36 ./.:0,0:0 0/0:3,0:3:88:0,9,108 0/0:1,0:1:66:0,3,36 1/1:0,3:3:88:108,9,0 ./.:0,0:0 ./.:0,0:0 1/1:0,1:1:66:36,3,0 1/1:0,1:1:66:36,3,0 0/0:3,0:3:88:0,9,108 0/0:1,0:1:66:0,3,36 0/0:1,0:1:66:0,3,36 0/0:1,0:1:66:0,3,36 ./.:0,0:0 1/1:0,3:3:88:108,9,0 0/0:1,0:1:66:0,3,36 ./.:0,0:0 ./.:0,0:0 0/0:1,0:1:66:0,3,36 ./.:0,0:0

 zcat PTxB73F1_filtrado_sorted.vcf_bgzip.gz | grep -v "#" | head -n 2

1 2064 . C CTTT 622.73 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-0.347;ClippingRankSum=-1.807;DP=94;ExcessHet=3.0103;FS=4.769;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.619;QD=10.04;ReadPosRankSum=1.882;SOR=0.450 GT:AD:DP:GQ:PL 0/1:35,27:62:99:660,0,1325

1 2179 . T C 2109.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=1.590;ClippingRankSum=-1.191;DP=119;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=-0.548;QD=17.73;ReadPosRankSum=3.280;SOR=0.633 GT:AD:DP:GQ:PL 0/1:49,70:119:99:2138,0,1187

ADD REPLY
0
Entering edit mode

Do you still have the same error involving that particular SNP at 21709201? What do the files print there?

ADD REPLY
0
Entering edit mode

Ok, I got it,

I have this:

zcat PTxB73RIL_0.1_0.9_el_bueno_V3_sorted.vcf.gz | grep -v "#" | grep "21709201" | head

1 21709201 S1_21710020 CA NT,NG,N . PASS DP=341 GT:AD:DP:GQ:PL 0/0:5,0,0,0:5:96:0,15,180 0/1:3,2,0,0:5:99:57,0,93 0/0:4,0,0,0:4:94:0,12,144 0/1:2,2,0,0:4:99:60,0,60 0/0:9,0,0,0:9:99:0,27,255 0/0:4,0,0,0:4:94:0,12,144 ./.:0,0,0,0:0 0/0:1,0,0,0:1:66:0,3,36 0/0:3,0,0,0:3:88:0,9,108 0/1:5,3,0,0:8:99:84,0,156 0/0:4,0,0,0:4:94:0,12,144 ./.:0,0,0,0:0 0/1:3,1,0,0:4:99:24,0,96 0/0:5,0,0,0:5:96:0,15,180 0/0:6,0,0,0:6:98:0,18,216 0/0:4,0,0,0:4:94:0,12,144 0/0:7,0,0,0:7:99:0,21,252 0/0:4,0,0,0:4:94:0,12,144 0/0:2,0,0,0:2:79:0,6,72 0/0:3,0,0,0:3:88:0,9,1080/0:2,0,0,0:2:79:0,6,72 0/0:2,0,0,0:2:79:0,6,72 0/0:6,0,0,1:7:98:0,18,216 0/0:4,0,0,0:4:94:0,12,144 0/1:3,1,0,0:4:99:24,0,960/0:3,0,0,0:3:88:0,9,108 0/0:4,0,0,0:4:94:0,12,144 0/3:3,0,0,1:4:88:0,9,108 ./.:0,0,0,0:0 1/1:0,4,0,0:4:94:144,12,0 0/0:1,0,0,0:1:66:0,3,36 0/0:3,0,0,0:3:88:0,9,108 0/0:2,0,0,0:2:79:0,6,72 0/0:4,0,0,0:4:94:0,12,144 0/0:3,0,0,0:3:88:0,9,108 1/0:3,6,0,0:9:99:189,0,81 0/3:1,0,0,1:2:66:0,3,36 1/0:1,2,0,0:3:99:63,0,27 ./.:0,0,0,0:0 0/0:5,0,0,0:5:96:0,15,180 0/0:3,0,0,0:3:88:0,9,108 0/1:1,1,0,0:2:99:30,0,30 0/0:1,0,0,0:1:66:0,3,36 0/0:5,0,0,0:5:96:0,15,180 0/1:3,1,0,0:4:99:24,0,96 1/1:1,6,0,0:7:96:195,0,15 ./.:0,0,0,0:0 1/0:1,2,0,0:3:99:63,0,27 ./.:0,0,0,0:0 2/2:0,0,1,0:1:33:0,0,0 0/0:6,0,0,0:6:98:0,18,216 0/1:1,1,0,0:2:99:30,0,30 0/0:4,0,0,0:4:94:0,12,144 0/0:3,0,0,0:3:88:0,9,108 1/1:0,1,0,0:1:66:36,3,0 0/0:3,0,0,0:3:88:0,9,108 1/2:0,3,2,0:5:88:108,9,0 0/0:2,0,0,0:2:79:0,6,72 0/0:2,0,0,0:2:79:0,6,720/0:6,0,0,0:6:98:0,18,216 0/0:2,0,0,0:2:79:0,6,72 0/1:8,4,0,0:12:99:108,0,252 0/0:3,0,0,0:3:88:0,9,108 0/0:5,0,0,0:5:96:0,15,180 0/0:2,0,0,0:2:79:0,6,72 1/0:2,3,0,0:5:99:93,0,57 0/0:2,0,0,0:2:79:0,6,72 0/3:2,0,0,2:4:79:0,6,72 0/0:9,0,0,0:9:99:0,27,255 0/0:3,0,0,0:3:88:0,9,108 0/0:9,0,0,0:9:99:0,27,255 0/0:5,0,0,0:5:96:0,15,180 1/0:1,5,0,0:6:98:162,0,18 1/0:1,2,0,0:3:99:63,0,27 0/0:8,0,0,0:8:99:0,24,255 0/0:4,0,0,0:4:94:0,12,144 0/0:3,0,0,0:3:88:0,9,108 0/0:10,0,0,0:10:99:0,30,255 0/0:3,0,0,0:3:88:0,9,108 0/0:7,0,0,1:8:99:0,21,252 0/0:4,0,0,0:4:94:0,12,144 ./.:0,0,0,0:0 0/1:2,2,0,0:4:99:60,0,60 0/0:5,0,0,0:5:96:0,15,180 0/0:5,0,0,0:5:96:0,15,180 0/0:3,0,0,0:3:88:0,9,108 0/0:3,0,0,0:3:88:0,9,108 0/0:3,0,0,0:3:88:0,9,108

and for LANMLR17B064:250550525 in this line I have

zcat PTxB73RIL_0.1_0.9_el_bueno_V3_sorted.vcf.gz | grep -v "#" | grep "21709201" | head | awk '{print $46}'

0/3:1,0,0,1:2:66:0,3,36

ADD REPLY
0
Entering edit mode

Also this problems could be because my vcf files have different versions one is VCFv4.0 and the other VCFv4.2. What do you think? I also know that in the GT filed we could have 0/0, 0/1 and 1/1 but in this case we have 0/3 that is weird right?

ADD REPLY
0
Entering edit mode

Is the 0/3 in the second file for a different alternative allele? In the first file, 0/3 means CA/N, what does it mean in the second file?

ADD REPLY
0
Entering edit mode

Ok, thanks.

In V4.0:

GT genotype, encoded as alleles values separated by either of ”/” or “|”, e.g. The allele values are 0 for the reference allele (what is in the reference sequence), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1 or 1|0 etc. For haploid calls, e.g. on Y, male X, mitochondrion, only one allele value should be given. All samples must have GT call information; if a call cannot be made for a sample at a given locus, ”.” must be specified for each missing allele in the GT field (for example ./. for a diploid).

And in v4.2 the definition is the same:

GT : genotype, encoded as allele values separated by either of / or |. The allele values are 0 for the reference allele (what is in the REF field),  1 for the first allele listed in ALT, 2 for the second allele list in ALT and so  on.   For  diploid  calls  examples  could  be  0/1,  1|0,  or  1/2,  etc.   For  haploid  calls,  e.g.   on  Y,  male  non-pseudoautosomal  X,  or mitochondrion,  only  one  allele  value  should  be  given;  a  triploid  call  might  look  like 0/0/1.  If a call cannot be made for a sample at a given locus, '.'  should be specified for each missing allele in the GT field (for example './.' for a diploid genotype and '.'  for haploid genotype).

So I guess that I am not going to have problems with that.

At the end I use bcftools and I removed the PL and GQ fields. For the rest It looks that the data is the same and can be combined. What do you think?

ADD REPLY
1
Entering edit mode
8.1 years ago

Hi Shinken,

I just encountered this error as well with more than 2 alternate alleles for some variants. I found that bcftools merge was able to run fine, and gave the entire length of the vcf files, whereas the vcftools vcf-merge truncates the file. See: https://samtools.github.io/bcftools/bcftools.html#merge

ADD COMMENT
0
Entering edit mode

Thank you, yes bcftools is working fine.

ADD REPLY
0
Entering edit mode

This didn't work for me, with bcftools v1.6. I did find another solution, however, which I added an answer for.

ADD REPLY
0
Entering edit mode
7.0 years ago
olavur ▴ 150

One possible solution to this problem is to exclude multiallelic sites.

In my case, the problem was caused by a site that had 3 possible ALT alleles for a single individual, which means there are 10 (n (n + 1) / 2) possible genotypes, but the GL (genotype likelihood) field only contained 3 values. Such a site will probably always be filtered out, if nothing else then due to allele fraction, but I wanted to remove such sites more deliberately.

So my solution to this problem is this. Simply filter out multiallelic sites, which using bcftools looks like this:

bcftools view --max-alleles 2 sample.vcf.gz | bgzip -c > filtered.vcf.gz

Do this for all your samples, and then merge them.

ADD COMMENT

Login before adding your answer.

Traffic: 2577 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6