Reference & alternative alleles extracted from 1000G, Phase 3
1
0
Entering edit mode
6.8 years ago
Mr Locuace ▴ 180

Hello, I extracted the first 5 columns from the vcf files of 1000G, Phase 3, for each chromosome:

http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/

This is the script I used, for instance, for chromosome 22:

awk 'BEGIN {OFS ="," ; FS = "\t"};{print $1, $2, $3, $4, $5}' chr22.vcf

However, for some SNPs I get the following info (this is one example):

CHROM,POS,ID,REF,ALT
1,886817,rs111748052;rs10465241,C,CATTTT,T

I am not sure which one is the REF and ALT allele in this case, since this example has 6 columns, and the last 3 columns have allele info.

Thanks very much

vcf 1000G ref allele • 2.5k views
ADD COMMENT
1
Entering edit mode

I extracted the first 5 columns

how about using another delimiter ?

ADD REPLY
0
Entering edit mode

Thank you @Pierre Lindenbaum. I will try the tab delimiter.

ADD REPLY
0
Entering edit mode

It's weird. Why did ID show two SNPs? I think your script extracting the 5 columns from vcf files made those mistakes. You can other scripts.

ADD REPLY
0
Entering edit mode

Hi @Joe, please see the solution I posted.

ADD REPLY
3
Entering edit mode
6.8 years ago
Mr Locuace ▴ 180

SOLUTION

The columns must be extracted using a tab as outer field separator (OFS), as one of the users suggested:

awk 'BEGIN {OFS ="\t" ; FS = "\t"};{print $1, $2, $3, $4, $5}' chr22.vcf

Doing this it becomes clear which columns have the REF & ALT alleles. For instance, for the following SNP, G and T are the ALT alleles:

22  18292633    rs430321    A   G,T
ADD COMMENT

Login before adding your answer.

Traffic: 2654 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6