I currently have a database developed for comparative/ functional genomics with expression microarray datasets, however I am moving to RNA-Seq. For the microarrays, obtaining all of the genes I needed was trivial, I just used Biopython to pull them from NCBI. However the best place to start for dealing with RNA-Seq is less obvious to me.
The purpose of the database is to allow ortholog and functional annotation data to be mapped onto expression data. I plan on performing the assembly outside of the database and moving the assembled transcripts into it after completion. Last time I decided to stick with only the array probes for genes that had RefSeq IDs on them, this kept the ambiguity down and reduced the amount of crap that was kept. With RNA-Seq the idea would be to maintain the same level of strictness but expand the coverage to include everything, not just what is on the arrays.
What would be the best source to obtain non-redundant RefSeqs for all of the available genes in a given organism, for mRNA and protein? The database isn't for the assembly so I'm not sure UCSD would be the best option. Again, I just want high quality data that I can rBLAST my transcripts against. I've had issues in the past with using Biopython/etc to directly query NCBI for these. I've always ended up with a few hundred thousand entries, with problematic levels of redundancy and undesirable data. The other thing I have been thinking of has been to just download the genomes of the animals and split them using Biopython, but this isn't always perfect either.
Any suggestions?
So basically, if one wants to be able to mine orthologs in a de novo transcriptome assembly, where and what data would be best to obtain for the ortholog species? At the current point in time, I am debating between using BioMart, or mining the data from chromosome gbk files.