Easy way to fetch exons from with Exon stable ID / annotation question
1
1
Entering edit mode
6.5 years ago
zeelo ▴ 10

Hi BioStars,

I have a quick question. I was wondering about a straight-forward command-line way to fetch specific exon sequence data from published genomes, and not the entire gene. E.g., from the Honeybee genome, I want to fetch GB53581-RA-E5, GB52073-RA-E3, GB52625-RA-E6. I found a way for the entire genes but I only want the exons. Is there also a command-line way to identify (fetch?) the orthologous exons/genes from other organisms? Lastly, is there a way to programmatically associate the IDs from the honeybee annotation (say GB52073) with the NCBI Gene ID? (=410059).

Any help is greatly appreciated!

Zeelo

data mining genome exons • 1.9k views
ADD COMMENT
1
Entering edit mode

The first thing I would try is to find the gtf file and filter it based on the 3rd column for "exon" features:

zcat organism.gtf.gz | awk '$3=="exon"'

With this list of exon features you can grep for certain genes, then additionally grep for different exon numbers. The 3rd and 4th columns give coordinates for the exon which you can use to fetch the sequences.

One way to find orthologous genes would be to do a command line blast against the other organism.

ADD REPLY
0
Entering edit mode

Do you consider R biomaRt a straight-forward command-line way?

ADD REPLY
0
Entering edit mode

Hi b.nota,

Thanks for the hint. I explored the package and tried to figure out how it works. However, I couldn't really figure out how to add the ensembl metazoa mart to the package. I guess I must miss something obvious.

Z

ADD REPLY
1
Entering edit mode
6.5 years ago
Benn 8.3k

A small example with biomaRt in R.

library("biomaRt")

listMarts(host="metazoa.ensembl.org")
             biomart                       version
1       metazoa_mart      Ensembl Metazoa Genes 39
2 metazoa_variations Ensembl Metazoa Variations 39

ensembl <- useMart("metazoa_mart", dataset="amellifera_eg_gene", host="metazoa.ensembl.org")

filters <- listFilters(ensembl)
# type filters to see all

exon_seq <- getBM(attributes=c("ensembl_exon_id", "chromosome_name", "gene_exon"), filters = "ensembl_exon_id", values="GB53581-RA-E5", mart=ensembl, bmHeader=TRUE)
exon_seq
  Chromosome/scaffold name
1                        6
                                                                                                                                                                                                                                                                                                                                                              Exon sequences
1 GATTAATGTAAAATGTTACAAACAAATACATTCATTAACAAGTAAAACATCATTATAATTTGAATTTAATAATAAATACAAAATAATTTTATGAATTTTATGTATTTTACTAATTTTAATATTTTACAAATCCAGCCATTAACATGCTGAAGAGTACAAGTCTTTAATTGTGCTTTCTTTTCACCTAATGAACAACTTCCACCTATGCAATGTCTCCATCTTGTTAGTTTCCCTATACCACAAGTAACACTACAACTTGACCAGTTTCCCCATTGATCCCATATTTTTGGTCCTTGACGACTCATTCTATCTATTATTAAATTATTTGATTTCTAAATAAAAATATTATTACTGGAAGGGTT
  Exon stable ID
1  GB53581-RA-E5

You can of course remove the "chromosome_name", it was just to illustrate that you can add more attributes.

ADD COMMENT
0
Entering edit mode

Thanks a lot, this is great!

ADD REPLY
0
Entering edit mode

I am glad it can help you! If you want orthologs or other IDs check the attributes with listAttributes(ensembl), and make a new query with those attributes.

ADD REPLY

Login before adding your answer.

Traffic: 2558 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6