Extract gene names from annotated vcf file
3
0
Entering edit mode
6.8 years ago
Gene_MMP8 ▴ 240

I have an annotated vcf file. I want to extract the gene name for each variant. How can I do this? This is the field I am interested in :

ANN=T|intron_variant|MODIFIER|Plekhg1|ENSMUSG00000040624|transcript|ENSMUST00000120274|protein_coding
|1/16|c.-169+10295G>T||||||,T|intron_variant|MODIFIER**|Plekhg1**|ENSMUSG00000040624|transcript|ENSMUST00000144543|retained_intron| 
 1/7|n.163+10295G>T||||||,T|
intron_variant|MODIFIER|Plekhg1|ENSMUSG00000040624|transcript|ENSMUST00000137111|retained_intron|1/7|n.343+9828G>T||||||"

I want to extract "Plekhg1" from the above entry.

sequencing next-gen • 7.0k views
ADD COMMENT
0
Entering edit mode

I formatted the line to better visualize it I am not sure if all of that is supposed to be on a single line.

Is the gene name always in the 4th field (separator |)?

ADD REPLY
3
Entering edit mode
6.8 years ago
NB ▴ 960

How have you got this annotation ?

Is this done generated using SNPeff ? If so, you can extract gene names using snpsift

java -jar SnpSift.jar extractFields file.vcf CHROM POS REF ALT "ANN[*].GENE:"
ADD COMMENT
0
Entering edit mode

Thanks for answering. Yeah it was generated using SNPEFF. But I don't seem to understand the output of your command. It's giving me the entire vcf file as output. Can I get just the list of genes?

ADD REPLY
0
Entering edit mode

Please read the documentation of snpsift to understand usage of the commands

Not sure if this would work but you can try try:

java -jar SnpSift.jar extractFields file.vcf  "ANN[*].GENE:"
ADD REPLY
0
Entering edit mode

Hi,

Was there an answer to this because I am too having the same issues "ANN[*].GENE:" is just outputting all the ANN fields and not the gene name specifically.

Thanks, Anj

ADD REPLY
0
Entering edit mode

Try "ANN[*].GENE". This worked for both GENE and GENEID on my vcf.

ADD REPLY
0
Entering edit mode

What if I want only a list of all unique genes?

ADD REPLY
1
Entering edit mode
5.6 years ago
Shicheng Guo ★ 9.5k

Usually, the fouth will be gene symbol, they this one:

java -jar SnpSift.jar extractFields file.vcf  "ANN[*].GENE:" | awk -F"|" '{print $4}'

The best choice will be

java -jar SnpSift.jar extractFields file.vcf CHROM POS REF ALT "ANN[*].GENE:" | awk -F'[\t|]' '{print $1,$2,$3,$4,$8}' OFS="\t"
ADD COMMENT
0
Entering edit mode
5.6 years ago
bioguy24 ▴ 230

I have not used SNPeff, but if the gene name is always in the fourth field seperated by |

awk -F'|' '{print $4}'
ADD COMMENT

Login before adding your answer.

Traffic: 1782 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6