Question

Extract read count for specific allele from a GATK VCF file

0

Entering edit mode

9.7 years ago

Sandeep ▴ 260

Is there a easy method to extract the genotype information from the genotype field in a GATK generated vcf file?

I am looking at a output from a VCF file in the following format.

Chr  Coordinate  Allele1  Allele2  #Allele1  #Allele2

I have tried vcftools --vcf sample.vcf --extract-FORMAT-info <columnName> --out sample

This one just extracts column data similar to

vcf2bed < sample.vcf > sample_table.bed
cat sample_table.bed | awk '{print $10,$11}'|  cut -d':' -f5 |awk '{print $2}' > GT.txt

Extracting AD-->

cat sample_table.bed | awk '{print $10,$11}'| cut -d':' -f6 > AD.txt

Extracting DP-->

cat sample_table.bed | awk '{print $10,$11}'| cut -d':' -f7 > DP.txt

Extracting GQ-->

cat sample_table.bed | awk '{print $10,$11}'| cut -d':' -f8 > GQ.txt

Extracting PL-->

cat sample_table.bed | awk '{print $10,$11}'| cut -d':' -f9 > PL.txt

Also, I have looked into vcflib and unable to find a suitable function to do the same.

read-count VCF gatk allele • 3.4k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Sandeep ▴ 260

0

Entering edit mode

it isn't clear to me what output do you exactly expect. here are some ideas:

are you willing to get just raw genotypes? then vcftools' vcf-to-tab would do
are you willing to split each gt column's field into a column? then a simple cut -f1-4,10 file.vcf | sed 's/:/\t/g' would do
are you willing to split each gt column's field into a separate file? then an awk/perl one-liner that indexes each gt content and appends it to each appropriate file would do

ADD REPLY • link 9.7 years ago by Jorge Amigo 14k

0

Entering edit mode

I want to split each GT columns field into their respective header to extract the fields Ref Count (AD) Alt Count (AD) Second Alt (AD) Coverage (DP) QUAL Confidence (GQ) Homozygous Reference (PL) Heterozygous Reference (PL) Homozygous alternate (PL) in separate columns.

The script mentioned in the comment simply splits the GT columns field to extract the sub fields GT:AD:DP:GQ:PL into 5 columns. It does not give the separate columns for counts of reference, alternate alle etc. Hence, was looking for tool which can do the same as vcf-to-tab for GT column field.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by Sandeep ▴ 260

Ram · Accepted Answer · 2015-04-07

for quite a long time I used to pipe my variant files through this vcf2tab.pl perl script, it does quite a bit more than what you exactly need (deals with vcf and gff files, deals with up to 4 different alleles,...), but I hope it may be useful to you too.

#!/usr/bin/perl -w

#
# vcf2tab.pl
#
# transforms vcf like files into tabulated lists
#
# by Jorge Amigo Lechuga
#

# deal with input header
$header = <>;
if ($header !~ /^#/) {
    if ($header =~ /(.+)Otherinfo/) {
        $header = $1."Zygosity\tQual\tDepth\t";
    } else {
        die "Unknown header format\n$header\n";
    }
} else {
    $header = "";
}

# read the file to get info fields
$vcfLines = 0;
$gffLines = 0;
$maxAlleles = 0;
while (<>) {
    if ( not /^#/ ) {
        if ( /(.+\t)([^\t]+)\tGT:(AD:)?(DP:)?GQ:PL\t(.+)/ ) {
            $vcfLines++;
            $rawColumn = $1;
            $dBqualColumn = "";
            $infoColumn = $2;
            $gtad = $3;
            $gtdp = $4;
            $gtColumn = $5;
            if (not $gtad) { $gtColumn =~ s/:/::/; }
            if (not $gtdp) { $gtColumn =~ s/:([^:]*):/:$1::/; }
            @cols = split "\t", $rawColumn;
            $alleles = ($cols[4] =~ tr/,//);
            $alleles += 2;
            if ($alleles > $maxAlleles) { $maxAlleles = $alleles }
            if ($maxAlleles > 4) { print; die "Cannot deal with more than 4 alleles" }
        } elsif ( /(.+\t)([^\t]+)\tGT:[^\t]+\t([^:]+)/ ) {
            $vcfLines++;
            $rawColumn = $1;
            $dBqualColumn = "";
            $infoColumn = $2;
            $gtColumn = $3;
        } elsif ( /(.+\t(hom|het|unk)\t)([\d\.]+\t\d+\t)(.+)/ ) {
            $gffLines++;
            $rawColumn = $1;
            $dBqualColumn = $3;
            $infoColumn = $4;
            $gtColumn = "";
        } else {
            die "Unknown variant file format found:\n$_\n";
        }
        $rawColumn =~ s/\tName=(\d+)(,\d+)*\t/\t$1\t/;
        push(@rawColumns, $rawColumn);
        push(@dBqualColumns, $dBqualColumn);
        push(@infoColumns, $infoColumn);
        push(@gtColumns, $gtColumn);
        foreach ( split(";", $infoColumn) ) {
            if (/(.+)=.+/) {
                $infoFieldsStorage{$1} = "";
            }
        }
    }
}

# check for unique formats
die "Cannot work with mixed vcf and gff files\n" if ($vcfLines and $gffLines);

# deal with output header
print $header;
if ($vcfLines) { print "CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\t"; }
@infoFields = sort keys %infoFieldsStorage;
foreach $infoField (@infoFields) {
    print $infoField."\t";
}
if ($vcfLines) {
    if ($maxAlleles) {
        print "GT\tAD_REF\tAD_ALT";
        if ($maxAlleles > 2) { print "1\tAD_ALT2"; }
        if ($maxAlleles > 3) { print "\tAD_ALT3"; }
        print "\tDP\tGQ\tPL_AA\tPL_AB\tPL_BB";
        if ($maxAlleles > 2) { print "\tPL_AC\tPL_BC\tPL_CC"; }
        if ($maxAlleles > 3) { print "\tPL_AD\tPL_BD\tPL_CD\tPL_DD"; }
    } else {
        print "GT";
    }
}
print "\n";

# deal with genotype columns
if ($maxAlleles) {
    for (1..$maxAlleles) { $plMaxCols += $_ }
    foreach $gtColumn (@gtColumns) {
        @gtCols = split ":", $gtColumn; # GT:AD:DP:GQ:PL
        @adCols = split ",", $gtCols[1];
        while ($#adCols < $maxAlleles - 1) { push @adCols, "" }
        @plCols = split ",", $gtCols[4];
        while ($#plCols < $plMaxCols - 1) { push @plCols, "" }
        push @gtColumnsMod,
            "$gtCols[0]\t".
            ( join "\t", @adCols[0..$maxAlleles - 1] ).
            "\t$gtCols[2]\t$gtCols[3]\t".
            ( join "\t", @plCols[0..$plMaxCols - 1] );
    }
} else {
    @gtColumnsMod = @gtColumns;
}

# read the file to parse data
foreach $line (0..$#rawColumns) {
    $infoFieldsCol = $infoColumns[$line];
    $newInfoFieldsCol = "";
    foreach $infoField (@infoFields) {
        if ($infoFieldsCol =~ /$infoField=([^;]+)/) {
            $newInfoFieldsCol .= $1;
        }
        $newInfoFieldsCol .= "\t";
    }
    print "$rawColumns[$line]$dBqualColumns[$line]$newInfoFieldsCol$gtColumnsMod[$line]\n";
}