Entering edit mode
6.6 years ago
wrab425
▴
50
I have a vcf file containing the variants from 64 haploid strains. I need to be able to get the allele frequencies for further work using plink but when I run the vcftools command
vcftools --vcf merged_SNPs_only.vcf --freq --out freq_analysis
I get this, The problem is I have variants that are in more than one strain but the output suggests that the frequency of the allele is 0 or 1
What is wrong with my vcf or my command?
CHROM POS N_ALLELES N_CHR {ALLELE:FREQ}
chr1 4446 2 1 C:0 T:1
chr1 4570 2 1 T:0 C:1
chr1 4575 2 1 T:0 C:1
chr1 4595 2 1 A:0 C:1
chr1 4676 2 1 A:0 G:1
chr1 4726 2 2 T:0 C:1
chr1 4865 2 7 A:0 G:1
chr1 4870 2 7 T:0 C:1
chr1 4883 2 7 A:0 T:1
chr1 4952 2 4 G:0 T:1
chr1 5015 2 4 G:0 A:1
chr1 5098 2 7 A:0 C:1
chr1 5168 2 6 G:0 A:1
chr1 5252 2 7 A:0 T:1
chr1 5253 2 4 T:0 A:1
chr1 5265 2 7 C:0 T:1
chr1 5270 2 7 G:0 T:1
chr1 5302 2 2 T:0 G:1
Does your VCF file have all 64 samples in it? Also, the
plink
tag is not relevant here as plink is not involved in your current situation.Thanks Ram, my file does have all 64 samples in it. I put the plink in there because I get the same result when I look at allele frequencies using plink. so the problem is illustrated with a snp like chr1 5265 2 7 C:0 T:1 which has seven chromosomes out of the 64 in the set but the T allele frequency is 1 when by my understanding should be 7/64.
What's your
vcftools
version?Hi Ram the version: 0.1.15
Yeah, the problem lies with your input file. Please follow my instructions on the gist excerpt so we can understand this better.
super thanks Ram, I have entered the command and put it into a gist but if it would help to see more of the file please ask. We seem to be about to make progress
It would help if we had a few lines where you expect to see a different output than what you see. For example, chr1:5265. I'm not sure if my
grep
was badly written (I didn't test it), but it's supposed to have produced 10 lines. Remove the-m1
and replace the-A11
with-C11
, and put that in the gist if you can please? The output of that along with the line forchr1:5265
should be really helpful.Actually, if plink is giving you the same output, the problem probably lies with the data being piped in to the tools. Are you working on the file directly or is any other tool involved? Can you paste an except of your VCF in a GitHub gist and paste the link to the gist in your post please? Run this to get the excerpt:
and
This agrees with the frequency, in that only one sample has a variant at this locus.
Hi Ram,
We do appreciate your advice as we are stuck with this. The thing is that there are snps that are present in 30 or so of the samples and they also give a MAF of 0 so this seems wrong. If you look at the second gitgist there are some snps in that sample which are present in seven samples and of course they have a MAF of 0. I think it must be because the vcf is haploid but I am not sure hw to fix it.
Thanks,
William
Your gist has only one row(
chr1:4446
) in it. I don't see a second gist.EDIT: OK, you haven't linked it here. The gist is
It's weird - I guess the haploid explanation fits - I've never seen VCF with a single value in GT instead of a
x/y
orx|y
.