Hello Biostars community!
Can you kindly suggest me a method to compute the total genic and intergenic length of a chromosome?
Thank you very much for your help!
Hello Biostars community!
Can you kindly suggest me a method to compute the total genic and intergenic length of a chromosome?
Thank you very much for your help!
Since you have a GTF file, you can do the following in R:
library(GenomicRanges)
library(rtracklayer)
gtf = import("genes.gtf") # or whatever it's called
foo = split(gtf, seqnames(gtf))
sapply(foo, function(x) sum(width(reduce(x))))
You now have a list printed to the screen of the number of bases in a gene in each chromosome. You can merge the last two lines together if you'd like.
It looks like you have a .gtf file. That means you can extract the exon lines from the .gtf file and count and sum up the exonic intervals.
You can generate a sorted .bed file of exon coordinates by:
grep -P '\texon\t' your.gtf | cut -f 1,4,5 | sort -k1,1 -k2,2n > exons.bed
You can merge this exons.bed using bedtools:
bedtools merge -i exons.bed > exons.merged.bed
You can count/sum the intervals in the merged bed file:
awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}' exons.merged.bed
This should give you number of genic bases, if you are defining genic by just exons. To get intergenic, just sum up your chromosome lengths and subtract the genic number.
You can do all this in one line also.
grep -P '\texon\t' your.gtf | cut -f 1,4,5 | sort -k1,1 -k2,2n | bedtools merge -i stdin | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'
Hello everyone,
You can't extract the exon lines directly from the .gtf file and work with it like if it was a bed file. Bed files are 0-based and gtf files are 1-based. First, we have to convert the gtf file in a bed file. We can do it with sortBed
and gff2bed
from Bedops
:
# Convert from gff to bed
sortBed -i your.gtf | gff2bed > your.bed
And then extract the columns of interest from your new bed file.
You can check an example in my repository: https://github.com/tmontserrat/proportion_exon_regions
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Do you have annotations available for this genome?
Hello genomax,
Yes, I have the coordinates of annotated genes in a bed file. and I also have a sizes.genome file with the chromosomes sizes.
I assume you are asking this question because you are not familiar with a scripting language? If you are, you should be able to add up intervals for your genes using bed file start/stops and keep track of regions in between.
There may be a way to do this using
bedops
,awk
and/or a combination of both. Can you post the output of 10-15 lines (head -15 your.bed
) from your bed file to give people an idea of what you have?I am posting here a 10-line snapshot of my genes.bed: thank you for your help!
I did the following but i am not sure its correct:
I merged overlapping genes with bedtools merge. then I created 100 bp bins of the merged intervals using bedtools window maker. then summed up the 100bp bins. I took this as the total genic length. is it correct? because when I sum up the differences of the merged intervals without binning, the value is different.
to get the intergenic length, I used bedtoools complement. and made 100bp bins of complement intervals and summed up these bins to get the "intergenic length". is this correct?
Is there a better way to do it?
Many thanks!
What does
gene
equate to,exons
?gene equates to TSS to TES. exon coordinates are within these intervals in a separate gtf file that I didn't show here.
does genic length mean excluding introns? because for my analyses I need to take intron coordinates into account. in that case, do I have to extract intron coordinates and use these in the intergenic calculation? thanks for your help
If you only want to include what gets translated then you would need to exclude introns/UTR's etc. Otherwise TSS to TES is the full gene sequence.
What does that mean?
I am not sure what you are doing above in terms of
binning
things. Why not keep a single interval from TSS to TES asgene
.Edit: Yeah, never mind.