Entering edit mode
7.1 years ago
bioguy24
▴
230
I am trying to extrapolate/convert variants from a vcf file to genotype. I am close but can not seem to add the chr and start of each line to the output. Maybe there is a better way? Thank you :).
#!/usr/bin/perl # execute perl script
use strict;
use warnings;
use Vcf;
my $filename = $ARGV[0];
open ( my $handle, "<", $filename);
my $vcf = Vcf->new(fh=>$handle);
$vcf->parse_header();
vcf_iterate();
sub vcf_iterate
{
while ( my $x=$vcf->next_data_hash() )
{
foreach my $sample ( keys $$x{gtypes} )
{
print "SAMPLE=$sample, genotype_raw=$$x{gtypes}{$sample}{GT}, ";
my $decoded_genotype = decode_genotype($x, $sample);
print "decoded_genotype=$decoded_genotype\n";
}
}
}
sub decode_genotype
{
my ( $x, $sample ) = ( $_[0], $_[1] );
my $gt = $vcf->decode_genotype($$x{REF}, $$x{ALT}, $$x{gtypes}{$sample}{GT}); # returns 'G/G'
return $gt;
}
current output:
SAMPLE=MEC1, genotype_raw=0/1, decoded_genotype=G/A
SAMPLE=MEC1, genotype_raw=1/1, decoded_genotype=G/G
SAMPLE=MEC1, genotype_raw=0/1, decoded_genotype=T/C
SAMPLE=MEC1, genotype_raw=0/1, decoded_genotype=A/G
SAMPLE=MEC1, genotype_raw=0/1, decoded_genotype=T/C
desired output:
$1 tab $2,
SAMPLE=MEC1, chr1 949608, genotype_raw=0/1, decoded_genotype=G/A
SAMPLE=MEC1, chr1 949654, genotype_raw=1/1, decoded_genotype=G/G
SAMPLE=MEC1, chr1 977330, genotype_raw=0/1, decoded_genotype=T/C
SAMPLE=MEC1, chr1 981931, genotype_raw=0/1, decoded_genotype=A/G
SAMPLE=MEC1, chr1 982994, genotype_raw=0/1, decoded_genotype=T/C
vcf:
##fileformat=VCFv4.1
....
....
....
##INFO=<ID=OREF,Number=.,Type=String,Description="List of original reference bases">
##INFO=<ID=OALT,Number=.,Type=String,Description="List of original variant bases">
##INFO=<ID=OMAPALT,Number=.,Type=String,Description="Maps OID,OPOS,OREF,OALT entries to specific ALT alleles">
##deamination_metric=0.183561
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MEC1
chr1 949608 . G A 852.727 PASS . .
chr1 949654 . A G 3815.58 PASS . .
chr1 977330 . T C 261.147 PASS . .
chr1 981931 . A G 352.821 PASS . .
chr1 982994 . T C 496.098 PASS . .
Why are you doing this from scratch? You can use bcftools query to do this, and they have predefined format labels (
%GT
,%TGT
, etc)I was not aware bcftools could do this. I ran:
Can the raw GT be coverted to actual genotype? Thank you :).
Thank you :).
The
%TGT
I mentioned might helpAmazing, works great.... thank you, really nice tool :)