I have a list of 5000 genes and their start and end position in bed format. I need to obtain the length of the coding regions (more preferably, the regions covered in exome sequencing data in my vcf files) of these genes. What is the best approach I can follow ?
What is your overall aim with this? If you want to know which variants fall within coding regions then your easiest bet is to just use variant annotation software, like the Ensembl VEP. No need to mess about identifying the coding regions yourself.
Hi Emily, Thank you for your reply. I am working on some mathematical approximations, so I need to do this just to get the average gene length (i.e. exome length only) in my VCF file for the list of genes. I then need to get the average number of SNPs per gene from that list.