Hey,
I urgently need to download exon lengths of all genes using HUGO gene symbols. Can anyone help me with that. I was recommended Biomart or ensemble but I am new to both these. so any advise or piece of these software will help a lot.
Thanks
Hey,
I urgently need to download exon lengths of all genes using HUGO gene symbols. Can anyone help me with that. I was recommended Biomart or ensemble but I am new to both these. so any advise or piece of these software will help a lot.
Thanks
Using BioMart here: EnsEMBL BioMart
This gives you a list of all exons for all Ensembl genes that can be associated with a HUGO gene name. The chromosome start and end position of each exon can be used to calculate the exon size. Screenshots follow below and you can download the actual result here: HumanExons_Ensembl_v65.tsv.zip (contains 1,109,487 exon records)
You can use UCSC mysql database and awk to do this:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -P 3306 \
-e "select chrom,txStart,txEnd,K.name,X.geneSymbol,strand,exonStarts,exonEnds \
from knownGene as K,kgXref as X where X.kgId=K.name limit 10;" \
| awk 'BEGIN { OFS = "\t"; FS = "\t"}
(NR > 1) {
delete astarts;
delete aends;
split($7, astarts, /,/);
split($8, aends, /,/);
sizes=""
exonCount=0
for(i=1; i <= length(astarts); i++){
if (! astarts[i]) continue
sizes=sizes""(aends[i] - astarts[i])","
}
print $1,$2,$3,$5","$4,1,$6,substr(sizes, 1, length(sizes) - 1)
}' > hg19.exon.lengths.bed
The last column will be the exon lengths.
Note that I added "limit 10" for testing.
You can change hg19 to your organism of choice.
For APC e.g., I get 5 rows like following. WHich one I should select to sum up 7th column to get total exon length for APC? Longest? Thanks.
chr5 112071116 112209835 APC,uc010jby.1 1 + 362,153,85,202,109,114,84,105,99,379,96,140,78,117,215,8687 chr5 112101454 112209835 APC,uc010jbz.1 1 + 67,153,85,202,109,114,84,105,186,99,379,96,140,78,117,215,8687
Easy way without having to code much would be to use the UCSC table browser and download the bed file for Refseq or UCSC genes. This has all the exon start and stop coordinates. If you want to get fancy then yes, use the R package biomaRt
and query based on the gene names you want and get the exon coordinates.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
+1 for full example with screenshots!
Thanks so much! I just saw your reply, I was expecting an email from BioStar and thought, no one replied to my query.
What is the meaning of "Exon Rank"? When I learned an example query in R I saw repeat of exon ranks for the same gene. So to get the length of all exons for a gene, should I just sum over all exons belonging to a gene (meaning all rows) or do I have to worry about exon ranks as well? Thanks again.
Your welcome. The exon rank just refers to the exon's position within the transcript. Since there are multiple transcripts for many genes you will often see redundant ranks when grouping exons at the gene level. You should sum all exons for a transcript to get the transcript size. Because of the redundancies you should not do that at the gene level... A slightly different strategy should be taken to get the exon base count of all transcripts of a gene. i.e. you want to merge overlapping exons into what some call an 'exon content block' or a 'squashed transcriptome' then sum their sizes.
There is 'CDS Length' box in the GENE sections in the first picture, I was wondering what's the difference between CDS length and total exon length in one gene?
The total exon length will include those portions of the exons that are UTR (untranslated regions), CDS will include only the subset of the exon lengths that are predicted to be coding/translated (i.e. the 5' and 3' UTRs are not included).
Thanks. I downloaded the cDNA sequence of C.elegans's protein-coding genes from Biomart, I checked two genes and found that the length of cDNA is equal to the total length of exons within the two genes, can we get the total exon length by calculating the cDNA length?
I wonder if there is method to obtain mapping of RefSeq mRNA to CDS Length.
I understand that using BioMart, it is possible to query Ensembl Transcript and CDS Length by filtering with a list of given HGNC ID or RefSeq mRNA. But in Attribute part, it is not possible to select both RefSeq mRNA and CDS Length. If I craft an XML or Perl query with RefSeq mRNA and CDS Length as Attributes, the API throws error:
Attributes from multiple attribute pages are not allowed
.UCSC provides to dataset: ensGene and refGene, which contain Ensembl Transcript ID/RefSeq mRNA/HGNC symbol mapping to exon ranges. CDS length can be calculated by summing length of exons bounded by cdsStart and cdsEnd. However, the tables of UCSC lack of version number of both Ensembl Transcript ID and RefSeq mRNA. The tables are not version either and thus cannot be reliably and flexibly mapped to versions in Ensembl database.