Seems to work as your ask for me, this wil create a new vcf file, deleting line with (ExonicFunc.refGene=. AND AAChange.refGene=.). If only ExonicFunc.refGene or AAChange.refGene is not "." line is printed.
Your thread is close to this one ( C: parsing vcf file ).
Managing vcf is a very popular question and the python codes, to process it, are very close. Try to get inspiration from similar thread to do what you want to do.
###Open your vcf file
new_vcf_file = open("your_new_vcf_file.vcf", "a")
with open("your_vcf_to_modif.vcf", 'r') as vcf_f:
###Read it line by line
for line in vcf_f:
check_up_line=True
###Skip metadata informations
if line[0] != '#':
###Split the line by "tab" to keep info field (for me it's the 8th column, so choose the index = 7, I don't remember if it's always the 8th, change this if needed)
info_field_line = line.split("\t")[7]
###Split the info line by ";"
info_field_line_array = info_field_line.split(";")
###For each line of your VCF, create a dictionnary with this array key : info, value : value of this info
dict_info={}
for i in info_field_line_array:
###Just looking for line with "=" character (as key = value)
if "=" in i:
###Left from equal sign is key
key = i.split("=")[0]
###Right from equal sign is value
value = i.split("=")[1]
###Put them in a dictionnary
dict_info[key]=value
###After dictionnary creation, you can have an access to all your features (key), just by their name as follow
###Check first, if your line can be output (pass your filter (ExonicFunc.refGene AND AAChange.refGene not null))
if dict_info['ExonicFunc.refGene'] == "." and dict_info['AAChange.refGene'] == ".":
###If it does not, put the variable at False, then it would not be print in the new vcf file
check_up_line=False
###If AAChange.refGene item is not null (so you have something in it)
if dict_info['AAChange.refGene'] != ".":
###Take the last info of the AAChange.refGene item and replace it in the actual line
line = line.replace(dict_info['AAChange.refGene'], dict_info['AAChange.refGene'].split(":")[-1])
###If Gene.refGene item has many info (more than one)
if len(dict_info['Gene.refGene'].split(","))>1:
###Take the first info of the Gene.refGene item and replace it in the actual line
line = line.replace(dict_info['Gene.refGene'], dict_info['Gene.refGene'].split(",")[0])
###If the line pass your check up, print it
if check_up_line:
new_vcf_file.write(line)
###Write unchanged metadata
else:
new_vcf_file.write(line)
new_vcf_file.close()
Oh I did't see this related post.. thank you, for helping :) So if I combine these two scripts I should obtain something like:
right?
You do not need to use the script from the other thread, that was just an hint for you to create your own solution. The script I posted here do the job you want.
Sorry I thought that I need to combine them because it is showing me error info_field_line = line.split("\t")[7] IndexError: list index out of range and I tried with other indexes, still the same
Can you copy/paste some lines of your vcf below please Also add a
after the line
and send me the result please
After I add that line the result is just one line:
1 10048 . C CCT . CA VT=INS;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1826;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/0:66:.,0:.:0:.:0 0/1:32:.,2:.:2:.:0
And part of my vcf looks like this (without metadata):
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL PRIMARY
1 10048 . C CCT . CA VT=INS;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1826;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/0:66:.,0:.:0:.:0 0/1:32:.,2:.:2:.:0 1 10078 . CT C . CA VT=DEL;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1795;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/0:25:.,0:.:0:.:0 0/1:13:.,2:.:2:.:0 1 10177 . A AC . CA VT=INS;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1697;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/1:7:.,3:.:1:.:0 0/1:10:.,6:.:1:.:0 1 10212 . A AC . CA VT=INS;VLS=5;ANNOVAR_DATE=2016-02-01;Func.refGene=intergenic;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1662;ExonicFunc.refGene=.;AAChange.refGene=.;ALLELE_END GT:DP:AD:BQ:SS:SSC:MQ60 0/0:57:.,0:.:0:.:0 0/1:22:.,2:.:2:.:0
From which tool did you get this vcf ? How did you open your vcf file ? By default vcf delimiter is "\t" (tabulation), yours seems to be a " " (space).
I tried in this line
to split each line of your vcf by the default delimiter which is a "tab" ("\t"), so in your case just replace "\t" by " ".
Something like this :
Also, I asked you how you opened your vcf file. Because for me, in your comment, all your results are on the same line (without any carriage return). Your vcf should look like this :
But since the result of print(line) is correct, my solution should work. Just replace "\t" by " " in this line :
That should be enough