Hi all,
I am constructing the pangenome on rice based on iterative assembly approach. In general, I extracted the unmapped reads of my sample with the Nipponbare reference. I assembled unmapped reads into novo contigs using MaSuRCA. And now, I want to detect the contamination (non-green plant species and fungi) of these newly contigs through Blast search with NCBI nt database. I have checked the NCBI nt database, and there are a lot of nt files, including:
- (nt_euk) is associated with eukaryotic sequences
- prokaryotic (nt_prok)
- viral (nt_viruses); and other sequences (nt_others)
- nt.000.tar.gz to nt.124.tar.gz (But I don't know: what kind of species for this database ?)
If I want to construct a blast database for non-green plant species and fungi to detect contamination, which nt files should I download (nt_euk and ??? ) ?
Thank you.
Depending on how serious OP is about removing contamination assembly approach may only go so far. While there are many genomes in NCBI I don't know what % of those falls in category of "practically complete" (i.e anything more requires so much work that it is not worth it).