add coordinates to line of output extracted from vcf
1
0
Entering edit mode
7.0 years ago
bioguy24 ▴ 230

I am trying to extrapolate/convert variants from a vcf file to genotype. I am close but can not seem to add the chr and start of each line to the output. Maybe there is a better way? Thank you :).

#!/usr/bin/perl   # execute perl script

 use strict;
 use warnings;
 use Vcf;

 my $filename = $ARGV[0];

 open ( my $handle, "<", $filename);
 my $vcf = Vcf->new(fh=>$handle);
 $vcf->parse_header();
 vcf_iterate();

 sub vcf_iterate
 {
 while ( my $x=$vcf->next_data_hash() )
 {
 foreach my $sample ( keys $$x{gtypes} )
 {
 print "SAMPLE=$sample, genotype_raw=$$x{gtypes}{$sample}{GT}, ";
 my $decoded_genotype = decode_genotype($x, $sample);
 print "decoded_genotype=$decoded_genotype\n";
   }
  }
}

sub decode_genotype
{
my ( $x, $sample ) = ( $_[0], $_[1] );
my $gt = $vcf->decode_genotype($$x{REF}, $$x{ALT}, $$x{gtypes}{$sample}{GT}); # returns 'G/G'
return $gt;
 }

current output:

SAMPLE=MEC1, genotype_raw=0/1, decoded_genotype=G/A
SAMPLE=MEC1, genotype_raw=1/1, decoded_genotype=G/G
SAMPLE=MEC1, genotype_raw=0/1, decoded_genotype=T/C
SAMPLE=MEC1, genotype_raw=0/1, decoded_genotype=A/G
SAMPLE=MEC1, genotype_raw=0/1, decoded_genotype=T/C

desired output:

             $1  tab   $2,
SAMPLE=MEC1, chr1   949608, genotype_raw=0/1, decoded_genotype=G/A
SAMPLE=MEC1, chr1   949654, genotype_raw=1/1, decoded_genotype=G/G
SAMPLE=MEC1, chr1   977330, genotype_raw=0/1, decoded_genotype=T/C
SAMPLE=MEC1, chr1   981931, genotype_raw=0/1, decoded_genotype=A/G
SAMPLE=MEC1, chr1   982994, genotype_raw=0/1, decoded_genotype=T/C

vcf:

##fileformat=VCFv4.1
....
....
....
##INFO=<ID=OREF,Number=.,Type=String,Description="List of original reference bases">
##INFO=<ID=OALT,Number=.,Type=String,Description="List of original variant bases">
##INFO=<ID=OMAPALT,Number=.,Type=String,Description="Maps OID,OPOS,OREF,OALT entries to specific ALT alleles">
##deamination_metric=0.183561
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  MEC1
chr1    949608  .   G   A   852.727 PASS    .   .
chr1    949654  .   A   G   3815.58 PASS    .   .
chr1    977330  .   T   C   261.147 PASS    .   .
chr1    981931  .   A   G   352.821 PASS    .   .
chr1    982994  .   T   C   496.098 PASS    .   .
genotype vcf • 2.4k views
ADD COMMENT
1
Entering edit mode

Why are you doing this from scratch? You can use bcftools query to do this, and they have predefined format labels (%GT, %TGT, etc)

ADD REPLY
0
Entering edit mode

I was not aware bcftools could do this. I ran:

bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' NA12878s5.vcf.gz | head 5

chr1    949608  G   A   MEC1=0/1
chr1    949654  A   G   MEC1=1/1
chr1    977330  T   C   MEC1=0/1
chr1    981931  A   G   MEC1=0/1
chr1    982994  T   C   MEC1=0/1

Can the raw GT be coverted to actual genotype? Thank you :).

chr1    949608  G   A   MEC1=G/A
chr1    949654  A   G   MEC1=G/G
chr1    977330  T   C   MEC1=T/C
chr1    981931  A   G   MEC1=A/G
chr1    982994  T   C   MEC1=T/C

Thank you :).

ADD REPLY
0
Entering edit mode

The %TGT I mentioned might help

ADD REPLY
0
Entering edit mode

Amazing, works great.... thank you, really nice tool :)

ADD REPLY
1
Entering edit mode
7.0 years ago

using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ java -jar dist/bioalcidaejdk.jar -e 'stream().forEach(V->{println(V.getContig()+"\t"+V.getStart()+"\t"+V.getReference().getDisplayString()+"\t"+V.getAlternateAlleles().stream().map(A->A.getDisplayString()).collect(Collectors.joining(","))+"\t"+V.getGenotypes().stream().map(G->G.getSampleName()+"="+G.getAlleles().stream().filter(A->A.isCalled()).map(A->A.getDisplayString()).collect(Collectors.joining("/"))).collect(Collectors.joining("\t")));});' src/test/resources/test_vcf01.vcf

1   956852  C   T   S1= S2=T/T  S3=C/T  S4=C/T  S5=T/T  S6=C/T
1   959155  G   A   S1= S2=G/A  S3=G/A  S4=G/A  S5=A/A  S6=G/A
1   959169  G   C   S1= S2=G/C  S3=G/C  S4=G/C  S5=C/C  S6=G/C
1   959231  G   A   S1= S2=G/A  S3=G/A  S4=G/A  S5=A/A  S6=G/A
1   960409  G   C   S1= S2=G/C  S3=G/C  S4=G/C  S5=C/C  S6=G/C
1   962210  A   G   S1=A/G  S2=A/G  S3=A/G  S4=A/G  S5= S6=A/G
1   962606  G   A   S1= S2=G/A  S3=G/A  S4=G/A  S5=A/A  S6=G/A
1   962891  C   T   S1= S2=C/T  S3=C/T  S4=C/T  S5=T/T  S6=C/T
ADD COMMENT
0
Entering edit mode

Times like these, I miss C#'s properties :-)

ADD REPLY

Login before adding your answer.

Traffic: 2594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6