I need a script that takes a reference (say hg19v37 fasta) and generates a vcf file for all defined bases of all chromosomes of the reference (ignore Ns). The idea is to generate a whole genome vcf with all possible single nucleotide variants (not indels) so it can be used with different annotation tools.
The vcf needs to have all possible alternate bases in ALT column for each position (see example below)
#CHROM POS ID REF ALT QUAL FILTER INFO
1 10001 . T A,C,G 100 PASS DP=100
1 10002 . A C,G,T 100 PASS DP=100
1 10003 . A C,G,T 100 PASS DP=100
1 10004 . C A,G,T 100 PASS DP=100
Any ideas?
EDIT: Some benchmarks for the proposed solutions using hg19:chr20
User Code_version Language Total_chars Time (m:s)
Pierre 1 C 1490 1:04
Pierre 2 C 1559 1:07
matted 2 Python 543 1:45
Istvan 2 Python 369 2:37
Istvan 1 Python 506 3:03
JC 3 Perl 393 3:16
JC 1 Perl 735 7:31
JC 2 Perl 552 8:03
Rm 1 Awk 434 9:22
matted 1 Python 404 10:38
All tests were done in a Mac OS X 10.6.8, Intel Core 2 Duo 2.4 GHz, 2 GB 667 MHz DDR2 SDRAM
gcc => i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5666) (dot 3)
perl => v5.10.0 built for darwin-thread-multi-2level
python => v2.6.1 (r261:67515, GCC 4.2.1 Apple Inc. build 5646) on darwin
awk => v20070501
36 cumulative votes and 289 views in less than a day :-) this really tickled our fancy
I think we need one code-golf challenge per week
You guys are amazing. I didn't think the question would trigger this much activity.
what have you tried? Using the alignIO in biopython you can extract all the SNP information for each column of the alignment (you can ignore columns with a dash). At that point you can "easily" write the file...