Question

Mapping genes from VCF by exclusively using python

0

Entering edit mode

2.9 years ago

ManuelDB ▴ 110

I want to map the gene of a list of CNVs in a VCF. I know I can do this (and also it is the most correct way to do that) via Via BEDOPS and bedtool. However, I want to do this exclusively with Python (not subprocess) because this is the beginning of a long project (the mapping is just the first step) and I intend to use this on Windows. This will work in a hospital and IT is quite strict installing VM or alternative approaches. But with a python workflow should be ok.

What I have done so far is to download Homo_sapiens.GRCh38.105.chr.gtf from ensemble and get only protein_coding genes doing this

df.loc[(df['feature'] == "gene") & (df['gene_biotype'] == "protein_coding")]

so now I have a data frame with 19964 rows (I guess one per different gene I need to check this). <--- By the way, is this approach correct? I am quite unfamiliar with this file

What could be the next logical/pythonic step?? Shall I convert start, end and gene_name into a bed file o directly match this against my VCF file.

More info: I expect that my WGS data will have no more than 200 CNVs so at this point I am not worried about computing-resource limitations.

vcf • 792 views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 2.9 years ago by ManuelDB ▴ 110

score 1 · Answer 1 · 2022-01-03

Hi. I am not sure I understand what exactly you want to do, but I suggest that you look into a few python libraries specifically designed to handle bioinformatic file formats. Biopython is a generally useful one, although for GFF/GTF files I recommend gffutils or gffpandas. For VCF files you could use pyvcf. Finally, if you are willing to take the time to learn a more general framework, then take a look at Hail.
Another advice: make sure you understand the formats you are working with before parsing them. It will save you a lot of bugs and issues down the road.

Good luck!