Mapping genes from VCF by exclusively using python
1
0
Entering edit mode
2.9 years ago
ManuelDB ▴ 110

I want to map the gene of a list of CNVs in a VCF. I know I can do this (and also it is the most correct way to do that) via Via BEDOPS and bedtool. However, I want to do this exclusively with Python (not subprocess) because this is the beginning of a long project (the mapping is just the first step) and I intend to use this on Windows. This will work in a hospital and IT is quite strict installing VM or alternative approaches. But with a python workflow should be ok.

What I have done so far is to download Homo_sapiens.GRCh38.105.chr.gtf from ensemble and get only protein_coding genes doing this

df.loc[(df['feature'] == "gene") & (df['gene_biotype'] == "protein_coding")]

so now I have a data frame with 19964 rows (I guess one per different gene I need to check this). <--- By the way, is this approach correct? I am quite unfamiliar with this file

What could be the next logical/pythonic step?? Shall I convert start, end and gene_name into a bed file o directly match this against my VCF file.

More info: I expect that my WGS data will have no more than 200 CNVs so at this point I am not worried about computing-resource limitations.

vcf • 792 views
ADD COMMENT
1
Entering edit mode
2.9 years ago
liorglic ★ 1.4k

Hi. I am not sure I understand what exactly you want to do, but I suggest that you look into a few python libraries specifically designed to handle bioinformatic file formats. Biopython is a generally useful one, although for GFF/GTF files I recommend gffutils or gffpandas. For VCF files you could use pyvcf. Finally, if you are willing to take the time to learn a more general framework, then take a look at Hail.
Another advice: make sure you understand the formats you are working with before parsing them. It will save you a lot of bugs and issues down the road.

Good luck!

ADD COMMENT

Login before adding your answer.

Traffic: 2600 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6