I would like to get a table from ensembl that includes the number of exons for each transcript. I can't find any options that correspond to this on biomart, though.
I would like to get a table from ensembl that includes the number of exons for each transcript. I can't find any options that correspond to this on biomart, though.
I wouldn't use biomaRt
, but try to use AWK
instead. If you download the annotation gtf
file from ensemble, you can try something like this with AWK
:
awk '$3=="exon" {print $0}' Homo_sapiens.GRCh38.78.gtf | awk '{ count[$10]++ } END { for (word in count) print word, count[word]}' > numberOfExonsPerGene.txt
ucsc (not biomart)
$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -P 3306 -D hg38 -e 'select chrom,name,exonCount from wgEncodeGencodeBasicV28;'
+-------+-------------------+-----------+
| chrom | name | exonCount |
+-------+-------------------+-----------+
| chr1 | ENST00000619216.1 | 1 |
| chr1 | ENST00000473358.1 | 3 |
| chr1 | ENST00000469289.1 | 2 |
| chr1 | ENST00000607096.1 | 1 |
| chr1 | ENST00000417324.1 | 3 |
| chr1 | ENST00000641515.2 | 3 |
| chr1 | ENST00000335137.4 | 1 |
| chr1 | ENST00000466430.5 | 4 |
| chr1 | ENST00000495576.1 | 2 |
| chr1 | ENST00000610542.1 | 4 |
| chr1 | ENST00000493797.1 | 2 |
| chr1 | ENST00000484859.1 | 2 |
| chr1 | ENST00000466557.6 | 8 |
| chr1 | ENST00000410691.1 | 1 |
| chr1 | ENST00000496488.1 | 2 |
| chr1 | ENST00000612080.1 | 1 |
| chr1 | ENST00000635159.1 | 2 |
| chr1 | ENST00000426406.3 | 1 |
(...)
+-------+-------------------+-----------+
Or use R with a transcript database (you can make your own from any GTF using the makeTxDbFromGFF command from the GenomicFeatures library):
## setup transcriptDb
txdb <- TxDb.Mmusculus.UCSC.mm10.ensGene
## get exon locations for each gene
exons <- exonsBy(txdb,'gene')
## print number of exons for each gene - just look at the top 6 with 'head'
head(sapply(exons,length))
ENSMUSG00000000001 ENSMUSG00000000003 ENSMUSG00000000028 ENSMUSG00000000031
9 9 24 15
ENSMUSG00000000037 ENSMUSG00000000049
41 16
I cannot see a a way either to directly get the number of exons in each transcript via biomart.
But you can choose Exon Stable ID
as an attribute for an output. You can then use a simple awk
script to count the exons per transcript. Let's assume you have choosen the attributes Gene stable ID
, Transcript stable ID
and Exon stable ID
for the output. Make sure you tick the point "Unique results only" and download as a TSV.
$ awk -v FS="\t" -v OFS="\t" 'NR>1 {transcript[$2]++;} END { for(t in transcript) print t, transcript[t] }' mart_export.txt
This will create a list with the transcript name given in column 2 as the key, and count each line with this transcript number. At the end we iterate over the list and print each count. If the transcript id is in a different column in your output, change the $2
to whatever it is.
fin swimmer
A tool to find genes of user-specific exon number
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I don't think biomart would store such aggregate data. You should be able to use UCSC MySQL tables to write a custom query, if you're not specific about using EnsEMBL. If you need EnsEMBL, you might need to get the CDs information and count exons yourself.
In addition to RamRS suggestion, if you could covert GTF to exons, like for example using this, you can groupBy the transcript and count the number of exons per transcript.
If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.