Entering edit mode
2.6 years ago
tea.vuki
▴
20
Hello,
How do I select only genotype (1/1, 0/1 etc..) data from VCF file using PySam?
Hello,
How do I select only genotype (1/1, 0/1 etc..) data from VCF file using PySam?
To process VCF file in Python I would recommend cyvcf2
makes parsing a VCF a joyous event:
from cyvcf2 import VCF
for variant in VCF('some.vcf.gz'): # or VCF('some.bcf')
variant.REF, variant.ALT # e.g. REF='A', ALT=['C', 'T']
variant.CHROM, variant.start, variant.end, variant.ID, \
variant.FILTER, variant.QUAL
# numpy arrays of specific things we pull from the sample fields.
# gt_types is array of 0,1,2,3==HOM_REF, HET, UNKNOWN, HOM_ALT
variant.gt_types, variant.gt_ref_depths, variant.gt_alt_depths # numpy arrays
variant.gt_phases, variant.gt_quals, variant.gt_bases # numpy array
## INFO Field.
## extract from the info field by it's name:
variant.INFO.get('DP') # int
variant.INFO.get('FS') # float
variant.INFO.get('AC') # float
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks! I will try cyvcf2 to learn it, but now I kind of have to go through this with pysam.. and this code:
doesn't work for me
doesn't work is not a good way to state a problem,
there could be a myriad of reasons why something does not work no one could possibly guess what does not work
You should always state the exact error message you get
Sorry.
The error message is: 'Unknown INFO field: DP'
When I try "GT" instead of DP I get the same.
And this is the code I found in the pysam documentation under "but also to complex attributes such as the contents to the info, format and genotype columns. These complex attributes are views on the underlying htslib data structures and provide dictionary-like access to the data:"
It means that your VCF file does not have that field. Check the VCF file, how is the
DP
tag reported?Sometimes you get it reported as
FORMAT/DP
Oh, now I see. I don't know why documentation put it under "info" when this ID is actually under format.
So when I do:
I get the output. However, I get it like this:
And what I want is to grab exact values
your output does not match the code, your code prints three lines, the output has five lines.
regardless, the solution here is to learn the ins and outs of PySam and the way the variants are represented, look at the API and docs for
VariantRecordInfo
but the real solution is to drop PySam and use cyvcf2 ... there is a reason that cfvcf2 exists, and the main reason is PySam is confusing has a counterintuitive interface that is best avoided
Thank you!