Hi BioStars,
I have a quick question. I was wondering about a straight-forward command-line way to fetch specific exon sequence data from published genomes, and not the entire gene. E.g., from the Honeybee genome, I want to fetch GB53581-RA-E5, GB52073-RA-E3, GB52625-RA-E6. I found a way for the entire genes but I only want the exons. Is there also a command-line way to identify (fetch?) the orthologous exons/genes from other organisms? Lastly, is there a way to programmatically associate the IDs from the honeybee annotation (say GB52073) with the NCBI Gene ID? (=410059).
Any help is greatly appreciated!
Zeelo
The first thing I would try is to find the gtf file and filter it based on the 3rd column for "exon" features:
With this list of exon features you can grep for certain genes, then additionally grep for different exon numbers. The 3rd and 4th columns give coordinates for the exon which you can use to fetch the sequences.
One way to find orthologous genes would be to do a command line blast against the other organism.
Do you consider R biomaRt a straight-forward command-line way?
Hi b.nota,
Thanks for the hint. I explored the package and tried to figure out how it works. However, I couldn't really figure out how to add the ensembl metazoa mart to the package. I guess I must miss something obvious.
Z