VCFtools LD for haploids is nan
1
1
Entering edit mode
9.1 years ago
biobio ▴ 50

Hi, I'm trying to calculate LD for a population. According to this post: http://sourceforge.net/p/vcftools/mailman/message/33219601/, the author says that it should be fine for haploid individuals if their genotype is 1 or "1|.":

Apologies. The announcement was slightly misinformative. The change allowed one to calculate r^2 with genotypes of the form '1|.', but not actually for haploids like '1'. I've just committed a change to allow the second case, so please give it a go and drop me a line if there are any issues.

So I tried calculating LD on my vcf file like so:

vcftools --vcf HA.diploid.vcf --hap-r2 --out test

But I got only nan in my output for the value of r^2. I then converted my VCF to have 1|. for the genotype field instead of just 1 (looks something like this):

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  M-Ait14.22010_CAAGACTA_C5E9DANXX_6_20140930B_20140930.bam.mpileup.all.snp.vcfsamp.vcf   M-Ait9.22003_GTCTGTCA_C5E9DANXX_6_20140930B_20140930.bam.mpileup.all.snp.vcfsamp.vcf    M-ELH20.18511_ATCCTGTA_C41F2ACXX_6_20140423B_20140423.bam.mpileup.all.snp.vcfsamp.vcf   M-ELH27.18516_CCTCTATC_C41F2ACXX_6_20140423B_20140423.bam.mpileup.all.snp.vcfsamp.vcf   M-ARB0.18513_GCTAACGA_C41F2ACXX_6_20140423B_20140423.bam.mpileup.all.snp.vcfsamp.vcf    M-Set0.22009_CATACCAA_C5E9DANXX_6_20140930B_20140930.bam.mpileup.all.snp.vcfsamp.vcf
1       306     .       G       T       .       PASS    AC=1;AN=1;SF=9  GT      .|.     .|.     .|.     1|.     .|.     .|.
1       375     .       A       G       .       PASS    AC=5;AN=5;SF=1,2,3,9,15 GT      1|.     1|.     .|.     1|.     1|.     1|.
1       424     .       C       A,T     .       PASS    AC=3,0;AN=3;SF=2,6,9,14,15,16,19        GT      .|.     1|.     .|.     1|.     .|.     1|.
1       465     .       G       A       .       PASS    AC=1;AN=1;SF=5,15       GT      .|.     .|.     .|.     .|.     .|.     1|.
1       508     .       T       C       .       PASS    AC=5;AN=5;SF=1,2,3,5,6,9,10,12,14,15,16,17,19,20        GT      1|.     1|.     .|.     1|.     1|.     1|.​

But my output is still nan:

CHR     POS1    POS2    N_CHR   R^2     D       Dprime
1       306     375     1       nan     0       nan
1       306     465     0       nan     nan     nan
1       306     508     1       nan     0       nan
1       306     528     1       nan     0       nan
1       306     711     1       nan     0       nan
1       306     723     1       nan     0       nan
1       306     727     0       nan     nan     nan
1       306     856     0       nan     nan     nan
1       306     864     1       nan     0       nan​

Any ideas on what I'm doing wrong? Thank you very much!

vcftools ld • 5.0k views
ADD COMMENT
1
Entering edit mode
9.1 years ago

Probably a parsing issue. Just backfill the .s with a zero. It should yield the same results.

ADD COMMENT
0
Entering edit mode

Awesome, changing from .|. to 0|0 and 1|. to 1|0 works. Does this have any implications for the LD calculations though?

ADD REPLY
0
Entering edit mode

I wouldn't do .|. -> 0|0, that might cause errors. The LD calculation is based on frequencies. You are doubling the amount of haploids you are counting, but not the frequencies of the haplotypes. You should be fine.

ADD REPLY
0
Entering edit mode

Hmm, doing that I get all 1s for my r^2 value:

CHR     POS1    POS2    N_CHR   R^2     D       Dprime
1       306     375     2       1       0.25    1
1       306     465     0       nan     nan     nan
1       306     508     2       1       0.25    1
1       306     528     2       1       0.25    1
1       306     711     2       1       0.25    1
1       306     723     2       1       0.25    1
1       306     727     0       nan     nan     nan
1       306     856     0       nan     nan     nan
1       306     864     2       1       0.25    1
1       306     901     0       nan     nan     nan
1       306     1019    2       1       0.25    1
1       306     1414    2       1       0.25    1
1       306     2248    2       1       0.25    1
1       306     2666    2       1       0.25    1
1       306     4987    2       1       0.25    1
1       306     5972    2       1       0.25    1
1       306     6063    2       1       0.25    1
1       306     6514    2       1       0.25    1
1       306     6557    2       1       0.25    1​
ADD REPLY
0
Entering edit mode

Do you only have 2 chromosome?

ADD REPLY
0
Entering edit mode

No there's 5 (Arabidopsis). I think the issue may be that only polymorphisms are reported and the VCF doesn't distinguish between missing data and matches to the reference. I will try writing a script that replaces "." that correspond to matching to the reference and "." that correspond to missing data.

ADD REPLY

Login before adding your answer.

Traffic: 2079 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6