Obtaining Exon Lengths:
3
5
Entering edit mode
12.9 years ago
Angel ▴ 220

Hey,

I urgently need to download exon lengths of all genes using HUGO gene symbols. Can anyone help me with that. I was recommended Biomart or ensemble but I am new to both these. so any advise or piece of these software will help a lot.

Thanks

human exon biomart ucsc browser • 20k views
ADD COMMENT
20
Entering edit mode
12.9 years ago

Using BioMart here: EnsEMBL BioMart

  1. Press the 'New' button in the top left corner.
  2. Choose Database -> Ensembl Genes 65
  3. Choose Dataset -> Homo sapiens genes (GRCh37.p5)
  4. Click the 'Attributes' section on the left. Then select 'Structures'
  5. Expand the 'GENE' and 'EXON' sections by pressing the '+' signs
  6. Check the following boxes: 'Ensembl Gene ID', 'Ensembl Transcript ID', 'Associated Gene Name', 'Ensembl Exon ID', 'Exon Chr Start (bp)', 'Exon Chr End (bp)', 'Exon Rank in Transcript'.
  7. Now select the 'Filters' section on the left.
  8. Select the 'Limit to genes...' option and select 'with HGNC ID(s)' from the pull down.
  9. Click the 'Results' button in the top left.
  10. Finally select the export options you want and hit 'Go'

This gives you a list of all exons for all Ensembl genes that can be associated with a HUGO gene name. The chromosome start and end position of each exon can be used to calculate the exon size. Screenshots follow below and you can download the actual result here: HumanExons_Ensembl_v65.tsv.zip (contains 1,109,487 exon records)


alt text


alt text


alt text

ADD COMMENT
4
Entering edit mode

+1 for full example with screenshots!

ADD REPLY
0
Entering edit mode

Thanks so much! I just saw your reply, I was expecting an email from BioStar and thought, no one replied to my query.

What is the meaning of "Exon Rank"? When I learned an example query in R I saw repeat of exon ranks for the same gene. So to get the length of all exons for a gene, should I just sum over all exons belonging to a gene (meaning all rows) or do I have to worry about exon ranks as well? Thanks again.

ADD REPLY
0
Entering edit mode

Your welcome. The exon rank just refers to the exon's position within the transcript. Since there are multiple transcripts for many genes you will often see redundant ranks when grouping exons at the gene level. You should sum all exons for a transcript to get the transcript size. Because of the redundancies you should not do that at the gene level... A slightly different strategy should be taken to get the exon base count of all transcripts of a gene. i.e. you want to merge overlapping exons into what some call an 'exon content block' or a 'squashed transcriptome' then sum their sizes.

ADD REPLY
0
Entering edit mode

There is 'CDS Length' box in the GENE sections in the first picture, I was wondering what's the difference between CDS length and total exon length in one gene?

ADD REPLY
0
Entering edit mode

The total exon length will include those portions of the exons that are UTR (untranslated regions), CDS will include only the subset of the exon lengths that are predicted to be coding/translated (i.e. the 5' and 3' UTRs are not included).

ADD REPLY
0
Entering edit mode

Thanks. I downloaded the cDNA sequence of C.elegans's protein-coding genes from Biomart, I checked two genes and found that the length of cDNA is equal to the total length of exons within the two genes, can we get the total exon length by calculating the cDNA length?

ADD REPLY
0
Entering edit mode

I wonder if there is method to obtain mapping of RefSeq mRNA to CDS Length.

I understand that using BioMart, it is possible to query Ensembl Transcript and CDS Length by filtering with a list of given HGNC ID or RefSeq mRNA. But in Attribute part, it is not possible to select both RefSeq mRNA and CDS Length. If I craft an XML or Perl query with RefSeq mRNA and CDS Length as Attributes, the API throws error: Attributes from multiple attribute pages are not allowed.

UCSC provides to dataset: ensGene and refGene, which contain Ensembl Transcript ID/RefSeq mRNA/HGNC symbol mapping to exon ranges. CDS length can be calculated by summing length of exons bounded by cdsStart and cdsEnd. However, the tables of UCSC lack of version number of both Ensembl Transcript ID and RefSeq mRNA. The tables are not version either and thus cannot be reliably and flexibly mapped to versions in Ensembl database.

ADD REPLY
7
Entering edit mode
12.9 years ago
brentp 24k

You can use UCSC mysql database and awk to do this:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -P 3306 \
    -e "select chrom,txStart,txEnd,K.name,X.geneSymbol,strand,exonStarts,exonEnds \
     from knownGene as K,kgXref as X where  X.kgId=K.name limit 10;" \
| awk 'BEGIN { OFS = "\t"; FS = "\t"}
(NR > 1) {
    delete astarts;
    delete aends;
    split($7, astarts, /,/);
    split($8, aends, /,/);
    sizes=""
    exonCount=0
    for(i=1; i <= length(astarts); i++){
        if (! astarts[i]) continue
        sizes=sizes""(aends[i] - astarts[i])","
    }
    print $1,$2,$3,$5","$4,1,$6,substr(sizes, 1, length(sizes) - 1)
}' > hg19.exon.lengths.bed

The last column will be the exon lengths.

Note that I added "limit 10" for testing.

You can change hg19 to your organism of choice.

ADD COMMENT
0
Entering edit mode

WOnderful! Thanks sooo much! THis also works beautifully. Sorry for delayed response as I didn't know if anyone replied in absence of any email sent from BioStar.

ADD REPLY
0
Entering edit mode

For APC e.g., I get 5 rows like following. WHich one I should select to sum up 7th column to get total exon length for APC? Longest? Thanks.

chr5 112071116 112209835 APC,uc010jby.1 1 + 362,153,85,202,109,114,84,105,99,379,96,140,78,117,215,8687 chr5 112101454 112209835 APC,uc010jbz.1 1 + 67,153,85,202,109,114,84,105,186,99,379,96,140,78,117,215,8687

ADD REPLY
0
Entering edit mode

Those are the different transcripts/splicing variants for that gene. The name like uc010jby.1 indicates the transcript, it makes more sense to get exon lengths per transcript.

ADD REPLY
0
Entering edit mode

Ok Thanks. Well then I also need to think about what makes sense with regards to what is being captured by Whole Exome for each gene.

ADD REPLY
0
Entering edit mode
12.9 years ago
Caddymob ★ 1.0k

Easy way without having to code much would be to use the UCSC table browser and download the bed file for Refseq or UCSC genes. This has all the exon start and stop coordinates. If you want to get fancy then yes, use the R package biomaRt and query based on the gene names you want and get the exon coordinates.

ADD COMMENT
0
Entering edit mode

I realize I might have to just rely on UCSC as I need genes based on hg18. My initial thought is that exon lengths won't vary much if it's hg18 or hg19. But I don't know this for sure.

Any comments? Thanks again for your reply.

ADD REPLY
0
Entering edit mode

The totalitly of exon lengths should not vary much - but there are clear differences between hg18 and hg19. You can however still get hg18 with biomaRt if you use the archives. check the docs, but listMarts(archive = TRUE) will show you the marts available in the archives.

ADD REPLY

Login before adding your answer.

Traffic: 2727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6