Question

How to obtain the length of coding regions for the list of genes?

0

Entering edit mode

9.2 years ago

MAPK ★ 2.1k

I have a list of 5000 genes and their start and end position in bed format. I need to obtain the length of the coding regions (more preferably, the regions covered in exome sequencing data in my vcf files) of these genes. What is the best approach I can follow ?

exome length coding region • 6.2k views

ADD COMMENT • link updated 9.2 years ago by venu 7.1k • written 9.2 years ago by MAPK ★ 2.1k

0

Entering edit mode

What is your overall aim with this? If you want to know which variants fall within coding regions then your easiest bet is to just use variant annotation software, like the Ensembl VEP. No need to mess about identifying the coding regions yourself.

ADD REPLY • link 9.2 years ago by Emily 24k

0

Entering edit mode

Hi Emily, Thank you for your reply. I am working on some mathematical approximations, so I need to do this just to get the average gene length (i.e. exome length only) in my VCF file for the list of genes. I then need to get the average number of SNPs per gene from that list.

ADD REPLY • link 9.2 years ago by MAPK ★ 2.1k

0

Entering edit mode

9.2 years ago

cfarmeri ▴ 210

chr1 100000 101000 geneX . +

if your bed file is like above, you can obtain gene length awk command

awk 'BEGIN{OFS="\t"}{print $4,$3-$2}' <your BED>

ADD COMMENT • link 9.2 years ago by cfarmeri ▴ 210

0

Entering edit mode

@cfameri Is it going to give me the exome (coding) length? Are you using any vcf files in this awk command? To get the length of whole gene, I could simply substract start position from the end position, but I need the coding length only. Thanks

ADD REPLY • link 9.2 years ago by MAPK ★ 2.1k

score 5 · Accepted Answer · 2016-04-27

5

Entering edit mode

9.2 years ago

venu 7.1k

If I understand it properly, you need to find the exon length of 5000 genes you have. I would download hg19 GTF from ensemble and calculate the length of exons (all exons combined) for each gene. Something like

wget "ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/Homo_sapiens.GRCh38.84.gtf.gz"

Calculate exon lengths

zcat Homo_sapiens.GRCh38.84.gtf.gz | awk '{if($3=="exon") print $10"\t"$5-$4}' | sed -e 's/"//g' -e 's/;//' | bedtools groupby -i - -g 1 -c 2 -o sum > Exon_lengths.txt

Then extract your list of genes from Exon_lengths.txt file with grep.

As vcf file contains variant positions, how one is going to get exons covered in vcf file. (May be I did not get it properly).

ADD COMMENT • link 9.2 years ago by venu 7.1k

0

Entering edit mode

Thanks Venu. How can I convert Ensemble gene IDs to hgnc symbol?

ADD REPLY • link 9.2 years ago by MAPK ★ 2.1k

1

Entering edit mode

Ensembl ID to Gene Symbol

ADD REPLY • link 9.2 years ago by venu 7.1k