Question

Programming Challenge - Synthetic Whole Genome Vcf

10

Entering edit mode

12.5 years ago

Mahdi Sarmady ▴ 310

I need a script that takes a reference (say hg19v37 fasta) and generates a vcf file for all defined bases of all chromosomes of the reference (ignore Ns). The idea is to generate a whole genome vcf with all possible single nucleotide variants (not indels) so it can be used with different annotation tools.

The vcf needs to have all possible alternate bases in ALT column for each position (see example below)

#CHROM POS     ID        REF    ALT     QUAL FILTER INFO
1     10001   . T      A,C,G       100   PASS   DP=100 
1     10002   . A      C,G,T       100   PASS   DP=100 
1     10003   . A      C,G,T       100   PASS   DP=100 
1     10004   . C      A,G,T       100   PASS   DP=100

Any ideas?

EDIT: Some benchmarks for the proposed solutions using hg19:chr20

 User      Code_version       Language       Total_chars         Time (m:s)
 Pierre         1             C                  1490               1:04
 Pierre         2             C                  1559               1:07
 matted         2             Python              543               1:45
 Istvan         2             Python              369               2:37
 Istvan         1             Python              506               3:03
 JC             3             Perl                393               3:16
 JC             1             Perl                735               7:31
 JC             2             Perl                552               8:03
 Rm             1             Awk                 434               9:22
 matted         1             Python              404              10:38

All tests were done in a Mac OS X 10.6.8, Intel Core 2 Duo 2.4 GHz, 2 GB 667 MHz DDR2 SDRAM

gcc => i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5666) (dot 3)

perl => v5.10.0 built for darwin-thread-multi-2level

python => v2.6.1 (r261:67515, GCC 4.2.1 Apple Inc. build 5646) on darwin

awk => v20070501

vcf python perl awk • 6.8k views

ADD COMMENT • link updated 12.5 years ago by Rm 8.3k • written 12.5 years ago by Mahdi Sarmady ▴ 310

1

Entering edit mode

36 cumulative votes and 289 views in less than a day :-) this really tickled our fancy

ADD REPLY • link 12.5 years ago by Istvan Albert 102k

0

Entering edit mode

I think we need one code-golf challenge per week

ADD REPLY • link 12.5 years ago by JC 13k

0

Entering edit mode

You guys are amazing. I didn't think the question would trigger this much activity.

ADD REPLY • link 12.5 years ago by Mahdi Sarmady ▴ 310

0

Entering edit mode

what have you tried? Using the alignIO in biopython you can extract all the SNP information for each column of the alignment (you can ignore columns with a dash). At that point you can "easily" write the file...

ADD REPLY • link 12.5 years ago by Whetting ★ 1.6k

score 9 · Answer 1 · 2012-09-05

Updated C version (only a switch/case):

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>

#define SEQNAME_MAX BUFSIZ

#define SWITCH(letter1,letter2,code) case letter1: case letter2:\
                fputs(name,stdout);\
                fputc('\t',stdout);\
                printf("%d",(pos+1));\
                fputs("\t.\t",stdout);\
                fputc(c,stdout);\
                fputc('\t',stdout);\
                fputs(code,stdout);\
                fputs("\t100\tPASS\tDP=100\n",stdout);\
                ++pos;\
                break


int main(int argc,char** argv)
    {
    int c;
    int pos=0;
    char name[SEQNAME_MAX];
    name[0]=0;
    fputs("#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n",stdout);
    for(;;)
            {
            switch((c=fgetc(stdin)))
             {
         case EOF:  return EXIT_SUCCESS;
        case '>':
            {
            int space=0;
            int name_length=0;
            name[0]=0;
            pos=0;
            while((c=fgetc(stdin))!=EOF && c!='\n')
                {    
                if(space) continue;
                if(isspace(c)) { space=1; continue;}
                name[name_length++]=c;
                }
            name[name_length]=0;
            break;
            }
                SWITCH('c','C',"A,T,G");
                SWITCH('a','A',"C,T,G");
                SWITCH('t','T',"C,A,G");
                SWITCH('g','G',"C,A,T");
            case ' ':case '\n' : case '\r': break;/* space, ignore */
                default:++pos;break;
                }
        }
     return EXIT_SUCCESS;
    }

compile:

gcc -o biostar52398 -Wall -O3 file.c

test:

$ curl -s ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz | gunzip  -c | biostar52398  | head -n 20
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO
1    10167    .    T    C,A,G    100    PASS    DP=100
1    10168    .    A    C,T,G    100    PASS    DP=100
1    10169    .    A    C,T,G    100    PASS    DP=100
1    10170    .    C    A,T,G    100    PASS    DP=100
1    10171    .    C    A,T,G    100    PASS    DP=100
1    10172    .    C    A,T,G    100    PASS    DP=100
1    10173    .    T    C,A,G    100    PASS    DP=100
1    10174    .    A    C,T,G    100    PASS    DP=100
1    10175    .    A    C,T,G    100    PASS    DP=100
1    10176    .    C    A,T,G    100    PASS    DP=100
1    10177    .    C    A,T,G    100    PASS    DP=100
1    10178    .    C    A,T,G    100    PASS    DP=100
1    10179    .    T    C,A,G    100    PASS    DP=100
1    10180    .    A    C,T,G    100    PASS    DP=100
1    10181    .    A    C,T,G    100    PASS    DP=100
1    10182    .    C    A,T,G    100    PASS    DP=100
1    10183    .    C    A,T,G    100    PASS    DP=100
1    10184    .    C    A,T,G    100    PASS    DP=100
1    10185    .    T    C,A,G    100    PASS    DP=100

score 8 · Answer 2 · 2012-09-06

I'm late to the party, but here's a Python version:

import sys, Bio.SeqIO

print "\t".join(["#CHROM", "POS", "ID", "REF", "ALT", "QUAL", "FILTER", "INFO"])

for s in Bio.SeqIO.parse(sys.stdin, "fasta"):
    for i,c in enumerate(s.seq.tostring()):
        if c != "N": print "\t".join([s.id, str(i+1), ".", c,
                                      ",".join(sorted(set("ACGT").difference(c))),
                                      "100", "PASS", "DP=100"])

It won't be the fastest, but the code is short...

score 7 · Answer 3 · 2012-09-05

7

Entering edit mode

12.5 years ago

JC 13k

I'm no sure the purpose of this big file, but here is my solution:

#!/usr/bin/perl

# genome2vcf.pl

use strict;
use warnings;

$/="\n>";

print join "\t", '#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO';
print "\n";
while (<>) {
    s/>//g;
    my ($chr, @seq) = split (/\n/, $_);
    $chr =~ s/chr//;
    my $seq = join "", @seq;
    my $len = length $seq;
    for (my $i = 0; $i <= $len; $i++) {
        my $b = uc(substr($seq, $i, 1));
        next if ($b =~ m/[^ACGT]/);
        my $alt = 'N';
        if    ($b eq 'A') { $alt = 'C,G,T'; }
        elsif ($b eq 'C') { $alt = 'A,G,T'; }
        elsif ($b eq 'G') { $alt = 'A,C,T'; }
        elsif ($b eq 'T') { $alt = 'A,C,G'; }
        print join "\t", $chr, $i + 1, '.', $b, $alt, 100, 'PASS', 'DP=100';
        print "\n";
    }
}

then you can run:

perl genome2vcf.pl < hg19.fa > hg19.vcf

ADD COMMENT • link 12.5 years ago by JC 13k

2

Entering edit mode

after seeing that the "code-golf" tag (why Pierre's code has more up-votes?), I create a shorter version:

#!/usr/bin/perl
print join "\t",'#CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO';
print "\n";
while (<>) {
    chomp;
    if (m/>(.+)/) { $chr = $1; $i = 0; next; }
    if (m/[ACGT]/i) {
        for (my $j = 0; $j <= length $_; $j++) {
            $b = uc(substr($_, $j, 1));
            next if ($b =~ m/[^ACGT]/);
            $alt = 'A,C,G,T,'; $alt =~ s/$b,//; $alt =~ s/,$//;
            print join "\t", $chr,$i+$j+1,'.',$b,$alt,100,'PASS','DP=100';
            print "\n";
        }
    }   
    $i += length $_;
}

*edit: some tricks for alternative alleles and a bug fixed thanks to matted

ADD REPLY • link 12.5 years ago by JC 13k

1

Entering edit mode

I think it's just a reward for daring to enter a code-golf competition while wielding the C programming language

ADD REPLY • link 12.5 years ago by Istvan Albert 102k

0

Entering edit mode

well at least isn't Java or C# ...

ADD REPLY • link 12.5 years ago by JC 13k

0

Entering edit mode

Since I already started checking for bugs: this won't handle output POS correctly either. Lines of all N won't correctly increase $i. If you run on Pierre's example, you'll get:

#CHROM POS ID REF ALT QUAL FILTER INFO
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 41 . T A,C,G 100 PASS DP=100
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 42 . A C,G,T 100 PASS DP=100
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 43 . A C,G,T 100 PASS DP=100
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 44 . C A,G,T 100 PASS DP=100
1 dna:chromosome chromosome:GRCh37:1:1:249250621:1 45 . C A,G,T 100 PASS DP=100

Pierre's positions are wrong too, though... the correct ones start at 10001. Verify with:

$ samtools faidx temp.fa 1:10001-10021
>1:10001-10019
TAACCCTAACCCTAACCCT

$ samtools faidx temp.fa 1:9991-10009
>1:9991-10009
NNNNNNNNNNTAACCCTAA

ADD REPLY • link 12.5 years ago by matted 7.8k

0

Entering edit mode

oooppps, I've fixed the bug :-)

ADD REPLY • link 12.5 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

thanks for spotting the bug, I corrected it

ADD REPLY • link 12.5 years ago by JC 13k

1

Entering edit mode

I ran some bechmarks with hg19:chr20 for all the solutions:

 User      Code_version       Language       Total_chars         Time (m:s)
 Pierre         1             C                  1490               1:04
 Pierre         2             C                  1559               1:07
 matted         2             Python              543               1:45
 Istvan         2             Python              369               2:37
 Istvan         1             Python              506               3:03
 JC             3             Perl                393               3:16
 JC             1             Perl                735               7:31
 JC             2             Perl                552               8:03
 Rm             1             Awk                 434               9:22
 matted         1             Python              404              10:38

ADD REPLY • link 12.5 years ago by JC 13k

0

Entering edit mode

could you please add this benchmark to the top post as well and update it there, it is getting harder to find it

ADD REPLY • link 12.5 years ago by Istvan Albert 102k

0

Entering edit mode

sure, first I will run matted's new python code and I will copy/paste to the top post.

ADD REPLY • link 12.5 years ago by JC 13k

0

Entering edit mode

third Perl version after copying Istvar's optimizations:

 print join "\t",'#CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO';
 print "\n";
 %a = ('A'=>'C,G,T', 'C'=>'A,G,T', 'G'=>'A,C,T', 'T'=>'A,C,G');
 while (<>) {
     chomp;
     if (m/>(.+)/) { $chr = $1; $i = 0; }
     else {
         @a = split(//, uc $_);
         foreach $b (@a) {
             $i++;
             if ($a{$b}) {
                 print join "\t", $chr, $i, '.', $b, $a{$b}, 100, 'PASS', 'DP=100';
                 print "\n";
             }
         }
     }
 }

really close to Istvar's Python, but still far from Pierre's C

ADD REPLY • link 12.5 years ago by JC 13k

score 4 · Answer 4 · 2012-09-06

4

Entering edit mode

12.5 years ago

Istvan Albert 102k

Ok, I can't let the Python version be the slowest. This should be end up being around the second fastest after the C (at least in my calculations it is about three times faster than matted's).

import sys
from itertools import count, imap

print "\t".join("#CHROM POS ID REF ALT QUAL FILTER INFO".split())

patt = "\t".join("%s %s . %s %s 100 PASS DP=100".split())

change = dict(A="C,G,T", C="A,G,T", T="A,C,G", G="A,C,T")
for line in sys.stdin:
    if line[0]=='>': 
        chrom = line[1:-1].split()[0]
        pos = count(1)
    else:
        for base in line.strip().upper():
            index = pos.next()
            if base == "N": continue
            print patt % (chrom, index, base, change[base])

ADD COMMENT • link 12.5 years ago by Istvan Albert 102k

0

Entering edit mode

Great job! I ran the same test, now you're second in speed and third in size. I had to upper-case all the sequence, you aren't considering soft-masking bases.

ADD REPLY • link 12.5 years ago by JC 13k

0

Entering edit mode

ah yes, I had a test file with all uppercased bases - made the change in the code above as well

ADD REPLY • link 12.5 years ago by Istvan Albert 102k

score 3 · Answer 5 · 2012-09-06

tooooo late?: Awk one (long) liner solution: Genome sequence to VCF

cat genome.fasta | awk -v dna="ATGC" '{if (NR==1){printf "%s\n", "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"}; split(dna, atgc, "") ; if (/^>/){sub(">","") ;chr=$1 } else { split($0, chars, "") ; for (i=1; i <= length(chars); i++) { count++ ; if (chars[i] ~/[ATGC]/){ printf("%s\t%s\t%s\t%s\t", chr, count, ".", chars[i]) ; for(j in atgc){if (chars[i]!=atgc[j]){printf("%s,", atgc[j])}} printf ("%s\t%s\t%s\t%s\n", "", 100,"PASS","DP=100") } } } }' > genome.vcf

Sample Output:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
1       10001   .       T       C,A,G,  100     PASS    DP=100
1       10002   .       A       C,T,G,  100     PASS    DP=100
1       10003   .       A       C,T,G,  100     PASS    DP=100
1       10004   .       C       A,T,G,  100     PASS    DP=100
1       10005   .       C       A,T,G,  100     PASS    DP=100
1       10006   .       C       A,T,G,  100     PASS    DP=100
1       10007   .       T       C,A,G,  100     PASS    DP=100
1       10008   .       A       C,T,G,  100     PASS    DP=100
1       10009   .       A       C,T,G,  100     PASS    DP=100

score 3 · Answer 6 · 2012-09-06

3

Entering edit mode

12.5 years ago

Istvan Albert 102k

I will respectfully resubmit a version that I don't condone but it will become the shortest at 368 characters, and probably the 2nd fastest (it seems to be faster than my previous python version):

import sys

print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
p ="%s\t%s\t.\t%s\t%s\t100\tPASS\tDP=100"

m = dict(A="C,G,T", C="A,G,T", T="A,C,G", G="A,C,T")
for s in sys.stdin:
    if s[0]=='>': 
        i, c = 0, s[1:-1].split()[0]
    else:
        for b in s.strip().upper():
            i += 1
            if b in m: 
                print p % (c, i, b, m[b])

ADD COMMENT • link 12.5 years ago by Istvan Albert 102k

2

Entering edit mode

I'm almost embarrassed to post this bad code, but I wanted to join you in your mission of fast(er) Python. I based it on your implementation, but made it much less readable. The upside is that it's a fair bit faster. On my machine, running on hg19 chr20, actually writing to disk, I get:

Pierre  C++  33 sec (1.00 normalized to C++)
matted  python  53 sec (1.58 normalized to C++)
Istvan python  95 sec (2.88 normalized to C++)

Code:

import sys

print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"
p ="%s\t%s\t.\t%s\t%s\t100\tPASS\tDP=100\n"

v = ["C,G,T","A,G,T","A,C,G","A,C,T"]
m = dict(zip("ACTGactg",v+v))

def go(lines):
    i, c = 0, ""
    for s in lines:
        try:
            yield "".join([p % (c, i+j, b, m[b]) for j,b in enumerate(s[:-1])])
            i += j+1
        except KeyError:
            try:
                yield "".join([p % (c, i+j, b, m[b]) for j,b in enumerate(s[:-1]) if b != 'N'])
                i += j+1
            except KeyError:
                i, c = 1, s[1:-1].split()[0]

sys.stdout.writelines(go(sys.stdin.readlines()))

ADD REPLY • link 12.5 years ago by matted 7.8k

1

Entering edit mode

Wow, that is pretty nifty it is a about 30% faster than my version! I can't believe how close we are getting to pure C. Goes to show that the slowest parts of python are making a function call.

an improvement would be to replace the sys.stdin.readlines() function call with just sys.stdin, that will avoid reading the file into memory and should not impact the speed at all.

ADD REPLY • link 12.5 years ago by Istvan Albert 102k

0

Entering edit mode

yes, it's faster and compact. Benchmark added.

ADD REPLY • link 12.5 years ago by JC 13k

score 1 · Answer 7 · 2012-09-07

1

Entering edit mode

12.5 years ago

Rm 8.3k

My improved Awk oneliner:

awk -v s="100\tPASS\tDP=100" '{OFS="\t"; if (NR==1)print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"; if (/^>/){sub(">",""); chr=$1} else {split($0, chars, ""); for(i=1; i <= length(chars); i++) {count++ ; if(chars[i] ~/[ATGC]/){alt = "A,C,G,T"; sub(chars[i], "", alt); sub(/^,|,$/, "", alt); sub(/,,/, ",", alt); print chr, count, ".", chars[i], alt, s} } } }' hs_ref_GRCh37.p9_chr20.fa >chr20.vcf

ADD COMMENT • link 12.5 years ago by Rm 8.3k

0

Entering edit mode

I ran the test, it's slower than your previous version (28m 18s)

ADD REPLY • link 12.5 years ago by JC 13k

0

Entering edit mode

surprising: my previous one: real 8m0.254s this one: real 4m11.273s

ADD REPLY • link 12.5 years ago by Rm 8.3k

0

Entering edit mode

yes, sounds weird, I'm gonna run again and check if there are some hidden process in my mac

ADD REPLY • link 12.5 years ago by JC 13k

0

Entering edit mode

I found the problem, AWK is using all my memory (2GB) and the system is swapping

ADD REPLY • link 12.5 years ago by JC 13k