Question

Extracting Gene Names From Vcf File

3

Entering edit mode

12.7 years ago

User 1933 ▴ 360

I have an annotated VCF file, and I would like to extract gene names which are not necessarily appear in the same columns. The gene names are preceded by "txGN=" pattern.

I wonder if there is any flexible parser, preferably in R or using awk for such a purpose.

vcf parsing • 12k views

ADD COMMENT • link updated 12.7 years ago by Rm 8.3k • written 12.7 years ago by User 1933 ▴ 360

0

Entering edit mode

Use vcftools --get-INFO option. So your script would be:

./vcftools --vcf your_vcf_file.vcf --get-INFO txGn --out vcf_file_gene_name_info

ADD REPLY • link 9.5 years ago by das2000sidd ▴ 30

score 3 · Answer 1 · 2013-03-12

3

Entering edit mode

12.7 years ago

Jorge Amigo 14k

a perl oneliner would do:

perl -lne 'print $1 while /txGN=([^;]+)/g' < input.vcf | uniq

(assuming that ";" is the delimiter character for the gene names, and that you want a list with unique gene names in it)

ADD COMMENT • link 12.7 years ago by Jorge Amigo 14k

1

Entering edit mode

up vote: perl favoritism.

ADD REPLY • link 12.7 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Double up vote: favoritism of perl favoritism

ADD REPLY • link 12.7 years ago by Alex Paciorkowski 3.5k

score 2 · Answer 2 · 2013-03-12

2

Entering edit mode

12.7 years ago

brentp 24k

it's not particularly flexible (or pretty), but you can do this in awk quite simply regardless of the column:

awk '{ split($0, a, "txGn="); split(a[2], b, /;|\t|\s/); print b[1] }'

ADD COMMENT • link 12.7 years ago by brentp 24k

score 2 · Answer 3 · 2013-03-13

Here is the awk script i wrote to extract INFO TAGs for each line from VCF (like) files:

USAGE: awk -F"\t" -v InfoColumns="8,16,24" -v TAGS="txGN,DP,DP4,CLR" -f extract.vcf.info.Tag.awk union.7samples.tsv

#!/bin/awk -f
##extract.vcf.info.Tag.awk   
##INPUTS (pasted or) TSV or VCFs file with INFO field intact from VCF
## USAGE: awk -v InfoColumns="8,16,24" -v TAGS="txGN,DP,DP4,CLR" -f extract.vcf.info.Tag.awk union.7samples.tsv

#    BEGIN { FS = "\t" }  # if not using -F"\t" above

# excludes lines starting with # or ## 
(substr($1,1,1)!="#" && substr($1,2,1)!="#") {

printf $0 ; ##  Prints original Line

split(TAGS,key,",") ;
split(InfoColumns,col,",") ;

n = asorti(col,copy); # To preserve the original column order

for(i=1;i<=n;i++){
  split($col[copy[i]],info,";");

k = asorti(key,kapy) ; # To preserve the original key order
for(j=1;j<=k;j++){
           pat1=key[kapy[j]]"=";
     if ($col[copy[i]] ~ pat1){
        for (f in info){
                if (info[f] ~ pat1){
                sub(pat1,"",info[f]);
                sub(/"/,"",info[f]);
                printf "\t" info[f]; # Prints extracted info tag field
              }
         }
      }
     else
      printf "\t" "."; # Prints "dot" if not present 
  }
 }
  printf "\n";
}

score 1 · Answer 4 · 2013-03-12

1

Entering edit mode

12.7 years ago

jxchong ▴ 160

vcftools vcf-query http://vcftools.sourceforge.net/perl_module.html#vcf-query

Have it output %txGN or %INFO/txGN depending on where txGN is stored

ADD COMMENT • link 12.7 years ago by jxchong ▴ 160

score 0 · Answer 5 · 2013-03-12

I wrote a tool to extract a tag from a VCF: http://code.google.com/p/variationtoolkit/wiki/ExtractInfo

e.g:

$ gunzip -c data.vcf.gz |\
  extractinfo -t GN -i | \
  awk -F '       ' '($11 =="NOTCH2")' |\
  cut -d ' ' -f 3 | grep rs

rs6685892
rs2493392
rs2493420
rs7534585
rs7534586
rs2493409
rs2453040
rs2124109