Species representtation of the NCBI RefSeq for simulated reads

0

Entering edit mode

7.5 years ago

dabid • 0

I want to generate simulated reads from the NCBI RefSeq using ART. As the NCBI RefSeq database is so big and have similar genomes although they are not redundant, I want to get a representative of every possible species in the RefSeq database (Viral, Bacteria, Archaea, etc). So, I will use this species representatives to generate the simulated reads instead of using the whole NCBI database.

Any hints on how to find/get the species representative of RefSeq NCBI?

Thanks.

dna ncbi genome simulated data • 1.6k views

ADD COMMENT • link updated 7.4 years ago by Biostar 20 • written 7.5 years ago by dabid • 0

0

Entering edit mode

And how would you select that one sequence (and have it represent) a species)? What exactly are you trying to do by making this dataset?

ADD REPLY • link 7.5 years ago by GenoMax 147k

0

Entering edit mode

I want to make a comprehensive simulated reads to benchmark few metagenomic tools. But as the NCBI refseq is very huge (especially for bacteria more than 50000 genomes), I cannot use the whole refseq. This is why I thought about getting only one genome from every species in the refseq. In this way, I reduce the number of genomes that I will use to simulate reads.

ADD REPLY • link 7.5 years ago by dabid • 0

0

Entering edit mode

Ah, you are planning to use a genome to generate representative reads (not one read per species as I mistakenly thought).

There are assembly summary files on NCBI's genome FTP site (e.g. this one is for RefSeq bacteria). You can get that file and pull out one representative genome (and its accession number). From there you can use the idea here to get the sequence.

ADD REPLY • link 7.5 years ago by GenoMax 147k

0

Entering edit mode

yeah, I got the idea.. (Actually I found another link that did almost what I want to do) Thank you so much!

ADD REPLY • link 7.5 years ago by dabid • 0

Login before adding your answer.