Hello! I haven't seen any similar question, so:
I have a list of genes and several vcf files. What I would like to do is to check from the list of the genes in all vcf files from a dir, and if I get a match, return me in one table (e.g excel) with all the info line, the first columm should havethe name of the match file.
At the momment what I get is a filter script for each file, but I don't know how to check in a dir tree and return it all in a single table.
import sys
from glob import glob
from subprocess import call
from pandas import DataFrame
> gene_list = open("./genes_rp.txt",'r')
> gene_list = gene_list.readlines()[1:]
>
> final_list = list() for gene in gene_list:
> gene = gene.strip('\n').split('\t')
> final_list.append(gene[0].strip())
>
> sample_folder = glob(sys.argv[1] + '*prefiltered.txt')
>
> for sample_path in sample_folder[1:]:
> sample = open(sample_path, 'r')
> sample = sample.readlines()
>
> header = sample[0].strip('\n').split('\t')
> output = list()
> output.append(header)
>
> for variant in sample:
> variant = variant.strip('\n').split('\t')
> variant_gene = variant[0]
> if variant_gene in final_list:
> output.append(variant)
>
> df = DataFrame(output)
>
> df.to_excel(sample_path + '_rp.xlsx', sheet_name='sheet1', header = False,index=False)
The script above it will be usefull if you have a a vcf with a lot of genes and you wanna see only a few of them