Hello All,
Most of the identified bacterial genes are hypothetical. I want to build a database of any and all bacterial genes so I can do an RNA-seq simulation of a microbiome.
However, I have yet to find a pre-build database like this. Refseq database does include several RNA.fna files, but the sequences are a small subset of what is available since it is curated.
Can anyone advise on the best way to make an query to a repository like GenBank similar to "all prokaryote, nucleotide, protein encoding" sequences.
Any solution is good, even an un-elegant one, such as downloading and parsing through all the database files myself. I could really use guidance on how to approach the problem.
Thanks! This is exactly what I needed. NCBI had it all along. I also found a suggestion here ): the .ffn files contain only gene coding sequences.