Getting avg gene length from GTF (for all splice isoforms)
2
0
Entering edit mode
7.5 years ago
rbronste ▴ 420

Hi,

Just looking for a way to get avg gene length from a GTF file acquired from UCSC table browser. At this stage not looking at alternative transcripts. Thanks.

Rob.

RNA-Seq UCSC table browser GTF • 3.4k views
ADD COMMENT
1
Entering edit mode
7.5 years ago

With a mix of Kent tools, BEDOPS and Unix utilities:

$ fetchChromSizes hg19 | awk '{print $1"\t0\t"$2}' > hg19.bed
$ gtf2bed < annotations.gtf | grep 'gene' | bedmap --echo --echo-map-size --delim '\t' hg19.bed - | awk 'BEGIN{ OFS="\t" }{ s=0; n=split($4,a,";"); if (n != 0) { for(i in a) { s+=a[i]; } m=s/n; print $1,$2,$3,m; } } else { print $1,$2,$3,"NA"; } }' > answer.bed
ADD COMMENT
0
Entering edit mode

The first two steps work well after installing and running the fetchChromSizes bash script. However getting the following error when trying the gtf2bed:

* Error in `convert2bed': realloc(): invalid next size: 0x0000000000bcf390 * /sonas-hs/smith/hpc_norepl/home/software/bedops/bin/gtf2bed: line 122: 700 Aborted (core dumped) ${cmd} ${options} - 0<&0 awk: (FILENAME=- FNR=1) fatal: division by zero attempted

ADD REPLY
0
Entering edit mode

What version of gtf2bed/convert2bed are you using? (convert2bed --version should give this info to you)

ADD REPLY
0
Entering edit mode

Using version: 2.4.25

ADD REPLY
0
Entering edit mode

There is an update in 2.4.26 that deals with segmentation faults in GTF conversion: https://bedops.readthedocs.io/en/latest/content/revision-history.html

If you update your version of BEDOPS to 2.4.26 and still run into errors, can you post your GTF somewhere or point me to a download (UCSC goldenpath or table name), as I'd like to try out conversion on this end, in that case.

I'll update my awk statement to deal with divide-by-zero errors. On reflection, you'd still get that error with conversion with chromosomes that have no overlapping genes, unless that case is dealt with.

ADD REPLY
1
Entering edit mode
7.5 years ago
Rohit ★ 1.5k

Assuming that you have it in the original GTF format

awk -F'\t' '$3=="gene"' infile.gtf | awk -F'\t' '{sum+=$5-$4; count+=1} END {print sum/count}'
ADD COMMENT
0
Entering edit mode

I have it in the originally exported from UCSC format, refSeq GTF file of gene components. However get a similar error to before with the suggested awk command:

awk: fatal: division by zero attempted

ADD REPLY
0
Entering edit mode

This can happen if none of the values in the third column are of the category "gene". Would be easier if you provide a sample of your file.

ADD REPLY

Login before adding your answer.

Traffic: 2541 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6