I recently did a haplotype phasing and stitching of the haplotype blocks. I now want to write this information back on the vcf file. I have dealt with reading and mining vcf file, but not writing. Also, the documentation on PyVCF manual isn't very detailed.
So, this is my few datalines from my text file:
contig pos ref haplotype_block hap_X hap_Y log_odds_ratio My_PG Sp_PG My_hap Sp_hap
2 2969 C 3706 T C -1.902 0 1 C T
2 3222 G 3706 G C - 1 0 C G
2 3416 G 3706 A G - 0 1 G A
2 5207 T 1856 T A 2.225 0 1 T A
2 5238 G 1856 C G - 1 0 C G
2 5398 A 1856 A G - 0 1 A G
2 5403 A 1856 A G - 0 1 A G
2 5426 C 1856 C A - 0 1 C A
2 5434 A 1856 A G - 0 1 A G
2 5467 C 1856 A C - 1 0 A C
2 5483 A 1856 A C - 0 1 A C
2 6636 C 892 T C 2.238 1 0 T C
2 6740 A 892 A G - 0 1 A G
2 12138 T 892 T A - 0 1 T A
2 12160 G 892 G A - 0 1 G A
2 12237 G 892 G C - 0 1 G C
2 12298 T 892 T G - 0 1 T G
2 12301 T 892 T A - 0 1 T A
I want to > read a vcf file > match the contig and position > then write the new data into the FORMAT
and SAMPLE
fields.
Instead of replacing/updating the old tag values I want to create new tag in FORMAT field and insert the corresponding value in SAMPLE field:
- haplotype_block as HB
- log_of_odds_ratio as LOD
- phased_genotype as PG and insert the My_PG|Sp_PG
- phased_allele as PA and insert My_hap|Sp_hap
I read the template as:
import vcf
# reader template to burrow
vcf_read = vcf.Reader(open('RNA_seq.test.vcf'))
# file to write
vcf_write = vcf.Writer(open('/dev/null', 'w'), vcf_file)
# now read and update the record
for record in vcf_read:
sample_id = record.genotype('2ms04h')
vcf_write.write_record()
# How to proceed from here?
- I want to update the record for sample named
2ms04h
which is in thevcf_read
- How can I read the data from above text file (chr, pos) and insert it into vcf_write.
- Can I write a new file name for
vcf_write
.
The pyvcf manual isn't very succint on this matter and no example hits from google. Any help appreciated.
Thanks,
I haven't done this myself, as I don't typically ever write out to VCF files, just process existing and do something with them, or add them to a database to work with. I would think that you need to form the header information yourself for a valid VCF and output it. Any changes you want for the FORMAT fields I think you'll have to do by changing the VCF.model record itself while looping before you output it with VCF.Write().
Do you have to use pyvcf?
Could you please give a snippet of your input vcf file and a few lines of desired output.
Is it multisample vcf file? If so, are you planning to add the same type of information for other samples later?
Thank you
hi @Petr: It doesn't have to be pyvcf, but since I had been using this for mining the data I thought I better stick with it. I will shortly share the a short snippet of my vcf file and another snippet of the haplotype file (the information I want to write back on vcf).
Thanks much !
HI petr,
Here is a little information from my.vcf except the main header:
And, say this is my haplotype file:
So, I want to add a field
My|Sp
in theFORMAT
column and write the correspondingGenotype
info. While doing so, I will compare the alleles inMy|Sp
with all alleles (ref+alt) from the my.vcf file.If C is ref and T is alt (numbered 2) for the position 422. I want write My|Sp as 0|2 in the sample field. For the positions that have no Haplotype (My|Sp) information we can simply update genotype as ./. or N|N.
Thanks,
Thank you, kirannbishwa01. Could you please make manually desired output snipped that corresponds to your example as I do not understand it yet. In fact, it is even more confusing now. Based on your last comment it apears to me that you want to use the second file with local block phasing to update the first file with genotypes of different samples. In particular, you want to use My|Sp to convert it to proper GT:GT in the vcf and update it. So for the second line, you want to change GT:GT=./. for sample 2ms04h to GT:GT=0|1. Unfortunately, this somewhat contradicts your topic question where you mention My_hap|Sp_hap and other things. Could you please edit your questions and make them consistent and with good representative example of input and output in snippets? Thank you
Yeah, I'm less clear now on what they want to accomplish as well. Looking into this a bit more with the power of google, this has been an open discussion on PyVCF's github issue tracker since 2013.
https://github.com/jamescasbon/PyVCF/issues/82
There are some bits in there and links to forks that may help in addressing the issue, but it would be a bit of a hack job. Seems like there have been some open PRs that were never merged into the main project.
@Dan Gaston: I have read all the issues, Q/A under pyvcf module on Github. Actually, I also navigated through several forks to find the answers. I also put a question in there, but no response so far (2 weeks).
One of the forks had a pratial fix of adding a new field under
FORMAT
, but it isn't merged in the master module. I had to copy the code and paste in my ubuntu python repo (pyvcf). One thing, I am now able to fix is that, I can add a newFIELD
underFORMAT
and its corresponding value underSAMPLE
, but it only works if I an calculating something from the same line. If I want to read some values from another vcf or say a haplotype files like mine it won't be possible (if possible won't be easy). I am going to see how @Petr can help me.Thanks,
As I said, dear kirannbishwa01, the snippets you added contradict your main topic description and does not have desired output. It is hard to help you when I do not understand the task. Maybe other's here can help or you can clarify your questions. If you can make everything more coherent by using "edit" button, I would greatly appreciate it. Thank you. Petr
Why exactly is it not possible/easy? If you can add a field under format and its value under sample I don't see what you are now missing?