VCF file help
1
0
Entering edit mode
6.8 years ago
aurora • 0

Hello Biostars,

I am doing project with vcf files and I need your help. After annotation in Annovar I used python to extract gene names, exonic function and aa change. I want to remove from output genes that do not have exonic function and aa change (ExonicFunc.refGene = . ; AAChange.refGene = . ) and also there are cases in which I have two or three genes i.e. refGene=HNRNPCL1,HNRNPCL3,HNRNPCL4 and multiple transcripts. I want to take just first gene and from its AAChange.refGene in which I have for example HNRNPCL3:NM_001146181:exon1:c.A764G:p.D255G I want just last part (p.D255G). How can I do it with python?

Thank you for your help :)

vcf python script • 2.5k views
ADD COMMENT
1
Entering edit mode
6.8 years ago

Seems to work as your ask for me, this wil create a new vcf file, deleting line with (ExonicFunc.refGene=. AND AAChange.refGene=.). If only ExonicFunc.refGene or AAChange.refGene is not "." line is printed.

Your thread is close to this one ( C: parsing vcf file ).

Managing vcf is a very popular question and the python codes, to process it, are very close. Try to get inspiration from similar thread to do what you want to do.

###Open your vcf file
new_vcf_file = open("your_new_vcf_file.vcf", "a")
with open("your_vcf_to_modif.vcf", 'r') as vcf_f:
    ###Read it line by line
    for line in vcf_f:
        check_up_line=True
        ###Skip metadata informations
        if line[0] != '#':
            ###Split the line by "tab" to keep info field (for me it's the 8th column, so choose the index = 7, I don't remember if it's always the 8th, change this if needed)
            info_field_line = line.split("\t")[7]
            ###Split the info line by ";"
            info_field_line_array = info_field_line.split(";")
            ###For each line of your VCF, create a dictionnary with this array key : info, value : value of this info
            dict_info={}
            for i in info_field_line_array:
                ###Just looking for line with "=" character (as key = value)
                if "=" in i:
                    ###Left from equal sign is key
                    key = i.split("=")[0]
                    ###Right from equal sign is value
                    value = i.split("=")[1]
                    ###Put them in a dictionnary
                    dict_info[key]=value

            ###After dictionnary creation, you can have an access to all your features (key), just by their name as follow

            ###Check first, if your line can be output (pass your filter (ExonicFunc.refGene AND AAChange.refGene not null))
            if dict_info['ExonicFunc.refGene'] == "." and dict_info['AAChange.refGene'] == ".":
                ###If it does not, put the variable at False, then it would not be print in the new vcf file
                check_up_line=False

            ###If AAChange.refGene item is not null (so you have something in it)
            if dict_info['AAChange.refGene'] != ".":
                ###Take the last info of the AAChange.refGene item and replace it in the actual line
                line = line.replace(dict_info['AAChange.refGene'], dict_info['AAChange.refGene'].split(":")[-1])

            ###If Gene.refGene item has many info (more than one)
           if len(dict_info['Gene.refGene'].split(","))>1:
                ###Take the first info of the Gene.refGene item and replace it in the actual line
                line = line.replace(dict_info['Gene.refGene'], dict_info['Gene.refGene'].split(",")[0])

            ###If the line pass your check up, print it
            if check_up_line:
                new_vcf_file.write(line)

        ###Write unchanged metadata
        else:
            new_vcf_file.write(line)
new_vcf_file.close()
ADD COMMENT
0
Entering edit mode

Oh I did't see this related post.. thank you, for helping :) So if I combine these two scripts I should obtain something like:

import sys
import re

def parse_vcf(vcf_file):
    with open(vcf_file, 'r+') as vcf_f:
        for line in vcf_f:
            if line[0] != '#':
                info_field_line = line.split("\t")[7]
                info_field_line_array = info_field_line.split(";")
                dict_info = {}
                for i in info_field_line_array:
                    if "=" in i:
                        key = i.split("=")[0]
                        value = i.split("=")[1]
                        dict_info[key]=value
                my_GeneRefGene = dict_info['Gene.refGene']
                my_ExonicFuncRefGene = dict_info['ExonicFunc.refGene']
                my_AAChangeRefGene = dict_info['AAChange.refGene']
                if dict_info['ExonicFunc.refGene'] == "." and dict_info['AAChange.refGene'] == ".":
                    check_up_line=False
                if dict_info['AAChange.refGene'] != ".":
                    line = line.replace(dict_info['AAChange.refGene'], dict_info['AAChange.refGene'].split(":")[-1])
                if len(dict_info['Gene.refGene'].split(","))>1:
                    line = line.replace(dict_info['Gene.refGene'], dict_info['Gene.refGene'].split(",")[0])
                if check_up_line:
                    vcf_f.write(line)
                else:
                    vcf_f.write(line)

                print(my_GeneRefGene),(my_ExonicFuncRefGene),(my_AAChangeRefGene)

if __name__  == '__main__':
    vcf=sys.argv[1]
    parse_vcf(vcf)

vcf_f.close()

right?

ADD REPLY
0
Entering edit mode

You do not need to use the script from the other thread, that was just an hint for you to create your own solution. The script I posted here do the job you want.

ADD REPLY
0
Entering edit mode

Sorry I thought that I need to combine them because it is showing me error info_field_line = line.split("\t")[7] IndexError: list index out of range and I tried with other indexes, still the same

ADD REPLY
0
Entering edit mode

Can you copy/paste some lines of your vcf below please Also add a

print(line)
break

after the line

if line[0] != '#':

and send me the result please

ADD REPLY
0
Entering edit mode

After I add that line the result is just one line:

1 10048 . C CCT . CA VT=INS;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1826;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/0:66:.,0:.:0:.:0 0/1:32:.,2:.:2:.:0

And part of my vcf looks like this (without metadata):

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL PRIMARY

1 10048 . C CCT . CA VT=INS;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1826;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/0:66:.,0:.:0:.:0 0/1:32:.,2:.:2:.:0 1 10078 . CT C . CA VT=DEL;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1795;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/0:25:.,0:.:0:.:0 0/1:13:.,2:.:2:.:0 1 10177 . A AC . CA VT=INS;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1697;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/1:7:.,3:.:1:.:0 0/1:10:.,6:.:1:.:0 1 10212 . A AC . CA VT=INS;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1662;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/0:57:.,0:.:0:.:0 0/1:22:.,2:.:2:.:0

ADD REPLY
0
Entering edit mode

From which tool did you get this vcf ? How did you open your vcf file ? By default vcf delimiter is "\t" (tabulation), yours seems to be a " " (space).

I tried in this line

info_field_line = line.split("\t")[7]

to split each line of your vcf by the default delimiter which is a "tab" ("\t"), so in your case just replace "\t" by " ".

Something like this :

info_field_line = line.split(" ")[7]

Also, I asked you how you opened your vcf file. Because for me, in your comment, all your results are on the same line (without any carriage return). Your vcf should look like this :

1 10048 . C CCT . CA VT=INS;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1826;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/0:66:.,0:.:0:.:0 0/1:32:.,2:.:2:.:0

1 10078 . CT C . CA VT=DEL;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1795;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/0:25:.,0:.:0:.:0 0/1:13:.,2:.:2:.:0

1 10177 . A AC . CA VT=INS;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1697;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/1:7:.,3:.:1:.:0 0/1:10:.,6:.:1:.:0

etc...

But since the result of print(line) is correct, my solution should work. Just replace "\t" by " " in this line :

info_field_line = line.split("\t")[7]

That should be enough

ADD REPLY

Login before adding your answer.

Traffic: 1953 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6