Extracting Gene Names From Vcf File
5
3
Entering edit mode
11.8 years ago
User 1933 ▴ 360

I have an annotated VCF file, and I would like to extract gene names which are not necessarily appear in the same columns. The gene names are preceded by "txGN=" pattern.

I wonder if there is any flexible parser, preferably in R or using awk for such a purpose.

vcf parsing • 11k views
ADD COMMENT
0
Entering edit mode

Use vcftools --get-INFO option. So your script would be:

./vcftools --vcf your_vcf_file.vcf --get-INFO txGn --out vcf_file_gene_name_info

ADD REPLY
3
Entering edit mode
11.8 years ago

a perl oneliner would do:

perl -lne 'print $1 while /txGN=([^;]+)/g' < input.vcf | uniq

(assuming that ";" is the delimiter character for the gene names, and that you want a list with unique gene names in it)

ADD COMMENT
1
Entering edit mode

up vote: perl favoritism.

ADD REPLY
0
Entering edit mode

Double up vote: favoritism of perl favoritism

ADD REPLY
2
Entering edit mode
11.8 years ago
brentp 24k

it's not particularly flexible (or pretty), but you can do this in awk quite simply regardless of the column:

awk '{ split($0, a, "txGn="); split(a[2], b, /;|\t|\s/); print b[1] }'
ADD COMMENT
2
Entering edit mode
11.8 years ago
Rm 8.3k

Here is the awk script i wrote to extract INFO TAGs for each line from VCF (like) files:

USAGE: awk -F"\t" -v InfoColumns="8,16,24" -v TAGS="txGN,DP,DP4,CLR" -f extract.vcf.info.Tag.awk union.7samples.tsv

#!/bin/awk -f
##extract.vcf.info.Tag.awk   
##INPUTS (pasted or) TSV or VCFs file with INFO field intact from VCF
## USAGE: awk -v InfoColumns="8,16,24" -v TAGS="txGN,DP,DP4,CLR" -f extract.vcf.info.Tag.awk union.7samples.tsv

#    BEGIN { FS = "\t" }  # if not using -F"\t" above

# excludes lines starting with # or ## 
(substr($1,1,1)!="#" && substr($1,2,1)!="#") {

printf $0 ; ##  Prints original Line

split(TAGS,key,",") ;
split(InfoColumns,col,",") ;

n = asorti(col,copy); # To preserve the original column order

for(i=1;i<=n;i++){
  split($col[copy[i]],info,";");

k = asorti(key,kapy) ; # To preserve the original key order
for(j=1;j<=k;j++){
           pat1=key[kapy[j]]"=";
     if ($col[copy[i]] ~ pat1){
        for (f in info){
                if (info[f] ~ pat1){
                sub(pat1,"",info[f]);
                sub(/"/,"",info[f]);
                printf "\t" info[f]; # Prints extracted info tag field
              }
         }
      }
     else
      printf "\t" "."; # Prints "dot" if not present 
  }
 }
  printf "\n";
}
ADD COMMENT
0
Entering edit mode

+1 for a full-on multi-line awk script!

ADD REPLY
0
Entering edit mode

Two minor suggestions on awk script.

  1. Suggest adding BEGIN { FS = "\t" } for VCF files, since the delimiter is tab not the default whitespace
  2. Instead of (substr($1,1,1)!="#" && substr($1,2,1)!="#") I would suggest /^[^#]/
ADD REPLY
0
Entering edit mode

thanks; one of many ways to code.....

ADD REPLY
1
Entering edit mode
11.8 years ago
jxchong ▴ 160

vcftools vcf-query http://vcftools.sourceforge.net/perl_module.html#vcf-query

Have it output %txGN or %INFO/txGN depending on where txGN is stored

ADD COMMENT
0
Entering edit mode
11.8 years ago

I wrote a tool to extract a tag from a VCF: http://code.google.com/p/variationtoolkit/wiki/ExtractInfo

e.g:

$ gunzip -c data.vcf.gz |\
  extractinfo -t GN -i | \
  awk -F '       ' '($11 =="NOTCH2")' |\
  cut -d ' ' -f 3 | grep rs

rs6685892
rs2493392
rs2493420
rs7534585
rs7534586
rs2493409
rs2453040
rs2124109
ADD COMMENT

Login before adding your answer.

Traffic: 1945 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6