I have a vcf file (format VCFv4.0), generated by GATK pipeline starting from Illumina reads.
I need to convert it to 23andme file format. Example of the 23andme format:
# rsid chromosome position genotype
rs4477212 1 82154 TT
rs3094315 1 752566 TC
rs3131972 1 752721 AA
rs12124819 1 776546 AC
I am having problems with plink2 --recode 23 cannot be used with multi-char alleles. Plink was recommended earlier here C: Conerting vcf to 23andMe format
I tried then to modify the vcf to remove multi-char alleles using VcfMultiToOneAllele, which did a great job but the output file, even though it looks like a vcf, it was not recognised as such by plink2 no genotype data in .vcf file. Any other tool up to the task?
This command returns the error: Error: Only VCF, BCF, oxford, bgen-1.x, haps, hapslegend, A, AD, Av, ped, tped,
compound-genotypes, and ind-major-bed output have been implemented so far.
End time: Thu Jul 21 22:02:20 2022
It seems this may have not been implemented yet
That doesn't look like a vcf file to me -
'multi-char alleles' appear when you have more than one alternative allele, which should be impossible if it's for a single human like 23andMe files are. Are you sure your example is your vcf file?
It should look like:
##fileformat=VCFv4.0##fileDate=20090805##source=myImputationProgramV3.1##reference=1000GenomesPilot-NCBI36##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
The example that I posted - was an example of 23and me format, to which I want to convert my vcf file.
This is a part of my vcf file that was recognised by plink2 as containing multi-char allele:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AM1
1 814264 . C A 1121 PASS BRF=0.27;FR=1.0000;HP=1;HapScore=1;MGOF=58;MMLQ=29;MQ=42.01;NF=19;NR=10;PP=1066
1 814297 . TGCT ACTA 256 alleleBias BRF=0.23;FR=0.5000;HP=1;HapScore=1;MGOF=71;MMLQ=25;MQ=45.1;NF=3;NR=0;P
1 814371 . GTGTT C 1093 PASS BRF=0.2;FR=0.4000;HP=2;HapScore=2;MGOF=47;MMLQ=28;MQ=41.43;NF=17;NR=7;PP=993;Q
Aah I see - have you tried removing all lines where there are indels, i.e., where the ALT field has more than one letter (here: ACTA)? I don't think 23andMe has those
There are a couple of ways to convert 23andme dataset to vcf:
Download 23andme dataset as a tab-delimited file with just these columns: the marker ID, chromosome name, position, and the genotype. Then use bcftools to convert the tsv file above to vcf by this:
Also, It seems you have multiallelic sites in the 23andme dataset. Many software don't work well with that and one convention is to throw them away or to break them into single allelic sites. A useful tool here is bcftools for resolving the multi-allelic sites:
bcftools norm -m - input.vcf -o out.vcf
Finally, are you analyzing population level data? If not, why do you have multi-character alleles?
Thank you for your comment, but I actually need to convert it the other way around. I have a vcf file generated by GATC pipeline starting from Illumina reads. and I need to convert it to 23andme file format shown above.
Hi, at Gencove we just launched an open and free API with tools that allows users to upload almost any type of DNA file (23andMe, Ancestry, FTDNA, etc). Feel free to test as user too. We give back a vcf too.
The 23andMe format does not support multi-character alleles; you must reorganize your data so that none of these remain. Split length-preserving multi-nucleotide variants into a bunch of single-nucleotide variants. (As for length-changing variants, 23andMe has historically represented some common insertions with "I", some common deletions with "D", and thrown out everything else. This requires you to write a script to postprocess the VCF file, and is unlikely to be worth the trouble.)
The example data you posted is missing the rightmost two columns ("FORMAT" and the actual sample data). Assuming they exist and just failed to be copy/pasted, the errors reported by plink and other programs imply that there is no "GT" field at the beginning of the "FORMAT" column; that's the standard way of representing the actual data you want to convert to 23andMe-format. You need to figure out how to add a sufficiently-accurate GT field to your VCF.
Yes, it was a copy-paste error, now fixed. Thank you for your suggestion. I would've preferred if there was a ready to use tool. If not, yes I am going to code it by myself.
Can someone please provide me with a tool/sourcecode to convert a VCF into GEDMatch uploadable format?