I have a vcf file with following data structure
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 2ms01e 2ms02g 2ms03g 2ms04h
2 1738 . A G 4693.24 PASS AC=2;AF=0.250;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW 0|1:389,92:481:99:.,.,.,.,.:1.0:0|1:1020:1748,0,12243:0|1 0/0:318,0:318:99:.:.:0/0:.:0,120,1800:0/0 0|1:270,53:323:99:.,.,.,.,.:1.0:0|1:1258:990,0,9096:0|1 0/0:473,0:473:99:.:.:0/0:.:0,120,1800:0/0
2 1764 . A C 51892.85 PASS AC=5;AF=0.625;AN=8;set=Intersection GT:AD:DP:GQ:PB:PC:PG:PI:PL:PW 1|0:102,415:517:99:.,.,.,.,.:1.0:1|0:1020:12332,0,2817:1|0 1/1:0,356:356:99:.:.:1/1:.:12587,1069,0:1/1 1|0:65,301:366:99:.,.,.,.,.:1.0:1|0:1258:9337,0,1279:1|0 0/1:281,353:634:99:.:.:0/1:.:10325,0,7548:0/1
2 1921 . T C 4465.03 PASS AC=0;AF=0.00;AN=6;set=Intersection GT:AD:DP:GQ:PG:PL:PW 0/0:1,0:1:3:0/0:0,3,35:0/0 ./.:0,0:0:.:./.:0,0,0:./. 0/0:1,0:1:3:0/0:0,3,39:0/0 0/0:2,0:2:6:0/0:0,6,80:0/0
Problem: Unfortunately the number of fields
in the FORMAT
column (9th column, 8th column python based) isn't the same for all the lines and pyVCF
is giving me error when trying to call certain values from the vcf file
.
I want to read this file and mine values from specific tags like GT
, PI
and PG
. But, all these tags (mainly the PI
tag) are not present in all the lines; in such cases I just want to the value output to be default period i.e '.'
So, the file output would have following structure:
contig pos ref alt_My freq_My GT PI PG
2 1764 A C 0.250 1|0 1020 1|0
2 1921 T C 0.00 0/0 . 0/0
Below is my script:
import vcf;
vcf1_data = vcf.Reader(open('MY.phased_variants.Final_sub.vcf', 'r'))
for record in vcf1_data:
contig1 = record.CHROM
pos1 = record.POS
ref_allele1 = record.REF
alt_alleles1 = ",".join(map(str, (record.ALT[::])))
alt_freq1 = ",".join(map(str, record.INFO['AF'])))
Now, I write these called values to an output
text file as:
output = open("My_allele_table.txt", "a")
output.write("{}\t{}\t{}\t{}\t{}"
.format(contig1, pos1, ref_allele1, all_alleles1, all_freq1))
Additionally, I append other values to the output file. But, when doing so I get AttributeError
since PI
field is not present in all the line.
Incomplete solution: I added exception
to the error
but then it just skips reading through the end of the line.
for sample in record.samples:
try:
output.write("\t{}\t{}\t{}".format(sample['GT'], sample['PI'], sample['PG']))
except AttributeError:
continue
output.write('\n')
Additionally, I tried:
if sample['PI'] not in sample:
PI = '.'
else: PI = sample['PI']
Still not working.
Any help appreciated !
Why do you keep deleting old threads and creating new? It's not likely that we suddenly now do know the answer.
It's because when there is no answer in first 18 hours, there will never be an answer. Either there are very few resources personnel volunteering for answering, or I am not asking questions that are interesting enough. Fact is most questions are not interesting !
Okay, let's try to troubleshoot this. But as you know, we are volunteers so you'll have to be patient. It's Friday night and I just poured a glass of red wine, so let's talk python.
What's not working in this case?
Have you tried the following?
HI @WouterDeCoster: Thanks much. The given code worked. I was more than frustrated and had been pulling my hair for 2 full days. I read through several examples and tutorials to mine the data, but this led me no-where. I reposted this question almost 3 times hoping that I had made the problem-explanation more simpler each time, and hoping someone would read it eventually.
I also couldn't move past this problem because the downstream analyses all depends upon this. Sometimes, I think why I even drown myself into this project/problem. And, there is a lots of assumption that the problem is not big at all - but for biologist it is. I can have a full fledged debate and analyses on evolutionary theories (which is my expertise) at quite decent level but the coding problem just drives me nuts sometimes - we all have our weaknesses, but it turns into frustration when the weakness becomes unmanageable.
The solution you proposed was a simple and elegant one which I now realize I was always around it, but I really didn't find it in the dark.
Thank again for being patient on my comments !
Happy to help. Good luck with the rest of the project, and feel free to ask more questions when necessary.
Another reason is I am working in a lab (and a department) where there is not IT support at all and I am all from a Bio background and my professor doesn't have a network to help. Any problem that I am facing is solely mine and not shared. Am I desperate for an answer - yes I am. Have I researched the problem - yes I have. Am I pyton or other IT expert - no I am not, but I have managed to learn to the extent I can.
Based on your code you already learned a lot. Don't lose hope, learning to code takes time. I also learned Python on my own, mostly.
How is your VCF generated? It seems strange that you would need to do this.
Just a simple tip, initialize a string to some value, do the try/except loop to check values and set the new value, and then write out the line after you build the string you hope to write out (or something similar).
This vcf was generate using
GATK
first and then later updated by another software calledpHASER
. All, was well, but the way the header info was parsed by this later software was a bit different - reason being that the author didn't usepyVCF
module to parse it, rather the author read the thevcf files
as text file and then parsed through it line-by-line. I confirm this because I have his python software and reading through it I sawpyVCF
was never used.All, was well until I took the
vcf file
generated by thispHASER
and then ran it again throughGATK
to combine the variants. This is when the values in theFORMAT
andSAMPLE
fields got updated and rearranged. Though this process didn't affect the values in the vcf and corresponding fields, but GATK removed certain fields that were not present in all the samples - which led to unequal number offields
inFORMAT
andSAMPLE
column.My problem: Now,
pyVCF
doesn't handle thecall
properly if the field is missing in even one line of the vcf.I think we need to put the programmers together in one room, so their final software is clean; and doesn't trouble the front end user to a greater extent. Lol !
I hope I am making sense with what I have explained. :)
Thanks for the explanation. You may want to participate on the GATK forums to see why this is happening. if it is a side-effect you may want to point it out to the authors of the software so they can fix it, or explain why it is behaving the way it is. Sorry, I don't have experience with GATK. I think most maintainers of open source software are happy to hear from users.
Sounds to me the problem was in pHASER, if I understood correctly.
I think it's partly both. I will need to collect more details and report to both the author.