I want to map the gene of a list of CNVs in a VCF. I know I can do this (and also it is the most correct way to do that) via Via BEDOPS and bedtool. However, I want to do this exclusively with Python (not subprocess) because this is the beginning of a long project (the mapping is just the first step) and I intend to use this on Windows. This will work in a hospital and IT is quite strict installing VM or alternative approaches. But with a python workflow should be ok.
What I have done so far is to download Homo_sapiens.GRCh38.105.chr.gtf from ensemble and get only protein_coding genes doing this
df.loc[(df['feature'] == "gene") & (df['gene_biotype'] == "protein_coding")]
so now I have a data frame with 19964 rows (I guess one per different gene I need to check this). <--- By the way, is this approach correct? I am quite unfamiliar with this file
What could be the next logical/pythonic step?? Shall I convert start, end and gene_name into a bed file o directly match this against my VCF file.
More info: I expect that my WGS data will have no more than 200 CNVs so at this point I am not worried about computing-resource limitations.