Hello, I extracted the first 5 columns from the vcf files of 1000G, Phase 3, for each chromosome:
http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/
This is the script I used, for instance, for chromosome 22:
awk 'BEGIN {OFS ="," ; FS = "\t"};{print $1, $2, $3, $4, $5}' chr22.vcf
However, for some SNPs I get the following info (this is one example):
CHROM,POS,ID,REF,ALT
1,886817,rs111748052;rs10465241,C,CATTTT,T
I am not sure which one is the REF and ALT allele in this case, since this example has 6 columns, and the last 3 columns have allele info.
Thanks very much
how about using another delimiter ?
Thank you @Pierre Lindenbaum. I will try the tab delimiter.
It's weird. Why did ID show two SNPs? I think your script extracting the 5 columns from vcf files made those mistakes. You can other scripts.
Hi @Joe, please see the solution I posted.