Hello everybody!!
I have to analyze exons' and intron's sequences of many organisms.
My question is: what is the most efficient way to retrieve all those sequences in fasta format?
Or in other words: which Database holds accessible information about exons and introns sequences?
I thought to download gff files of all the organisms, filter exons, and introns and then get their sequences by using bedtools get fasta.
But this process requires to download many genomes and seems not to be effective.
Any suggestion for this purpose?
How many genomes are you working with? Will you be downloading all intron and exon sequences for each genome? If so, where are the getting the coordinates from? For RefSeq data, you can download gff3 files, parse them for intron and exon coordinates and use edirect to download sequences in fasta format. But edirect would be an inefficient way to do this if you want to download sequences for all of the introns and exons. Downloading the entire genome sequence in fasta format to disk first would be much more efficient.
I am working on 80 organisms from Ensembl db. So this is not practical to download their full genome. Besides, on their gff3/gtf files, there are no introns at all ):
prokaryotes ?
of course not. I am looking only on mammals.