How to obtain the length of coding regions for the list of genes?
2
0
Entering edit mode
8.6 years ago
MAPK ★ 2.1k

I have a list of 5000 genes and their start and end position in bed format. I need to obtain the length of the coding regions (more preferably, the regions covered in exome sequencing data in my vcf files) of these genes. What is the best approach I can follow ?

exome length coding region • 5.6k views
ADD COMMENT
0
Entering edit mode

What is your overall aim with this? If you want to know which variants fall within coding regions then your easiest bet is to just use variant annotation software, like the Ensembl VEP. No need to mess about identifying the coding regions yourself.

ADD REPLY
0
Entering edit mode

Hi Emily, Thank you for your reply. I am working on some mathematical approximations, so I need to do this just to get the average gene length (i.e. exome length only) in my VCF file for the list of genes. I then need to get the average number of SNPs per gene from that list.

ADD REPLY
5
Entering edit mode
8.6 years ago
venu 7.1k

If I understand it properly, you need to find the exon length of 5000 genes you have. I would download hg19 GTF from ensemble and calculate the length of exons (all exons combined) for each gene. Something like

wget "ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz"

Calculate exon lengths

zcat Homo_sapiens.GRCh38.84.gtf.gz | awk '{if($3=="exon") print $10"\t"$5-$4}' | sed -e 's/"//g' -e 's/;//' | bedtools groupby -i - -g 1 -c 2 -o sum > Exon_lengths.txt

Then extract your list of genes from Exon_lengths.txt file with grep.

As vcf file contains variant positions, how one is going to get exons covered in vcf file. (May be I did not get it properly).

ADD COMMENT
0
Entering edit mode

Thanks Venu. How can I convert Ensemble gene IDs to hgnc symbol?

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode
8.6 years ago
cfarmeri ▴ 210

chr1 100000 101000 geneX . +

if your bed file is like above, you can obtain gene length awk command

awk 'BEGIN{OFS="\t"}{print $4,$3-$2}' <your BED>
ADD COMMENT
0
Entering edit mode

@cfameri Is it going to give me the exome (coding) length? Are you using any vcf files in this awk command? To get the length of whole gene, I could simply substract start position from the end position, but I need the coding length only. Thanks

ADD REPLY

Login before adding your answer.

Traffic: 1553 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6