Entering edit mode
7.5 years ago
dabid
•
0
I want to generate simulated reads from the NCBI RefSeq using ART. As the NCBI RefSeq database is so big and have similar genomes although they are not redundant, I want to get a representative of every possible species in the RefSeq database (Viral, Bacteria, Archaea, etc). So, I will use this species representatives to generate the simulated reads instead of using the whole NCBI database.
Any hints on how to find/get the species representative of RefSeq NCBI?
Thanks.
And how would you select that one sequence (and have it represent) a species)? What exactly are you trying to do by making this dataset?
I want to make a comprehensive simulated reads to benchmark few metagenomic tools. But as the NCBI refseq is very huge (especially for bacteria more than 50000 genomes), I cannot use the whole refseq. This is why I thought about getting only one genome from every species in the refseq. In this way, I reduce the number of genomes that I will use to simulate reads.
Ah, you are planning to use a genome to generate representative reads (not one read per species as I mistakenly thought).
There are assembly summary files on NCBI's genome FTP site (e.g. this one is for RefSeq bacteria). You can get that file and pull out one representative genome (and its accession number). From there you can use the idea here to get the sequence.
yeah, I got the idea.. (Actually I found another link that did almost what I want to do) Thank you so much!