Question

Filter the non-redundant NCBI database based on taxonomy

1

Entering edit mode

6.0 years ago

dieunelderilus ▴ 10

Please is is there any way to extract all the eukaryotic reads from the non-redundant NCBI database ? What would be the exact command line that would help me to do that ? or where can I find the non-redundant Eukaryotic database ?

Thanks in advance for your support

DD

genebank database ncbi • 4.8k views

ADD COMMENT • link updated 6.0 years ago by vkkodali_ncbi ★ 3.8k • written 6.0 years ago by dieunelderilus ▴ 10

shenwei356 · Answer 1 · 2018-12-16

3

Entering edit mode

6.0 years ago

shenwei356 8.7k

Making nr blastdb for specific taxids

ADD COMMENT • link 6.0 years ago by shenwei356 8.7k

0

Entering edit mode

This tutorial recursively gets all taxID's which belong to a higher level taxID. These would be needed to get all sequences under the top level taxID (e.g. eukaryota).

ADD REPLY • link 6.0 years ago by GenoMax 147k

0

Entering edit mode

Thanks for sharing the tutorial , I am following the discussion and read the tutorial to see if could filter the the nr based on taxonomy

DD

ADD REPLY • link 6.0 years ago by dieunelderilus ▴ 10

0

Entering edit mode

I am following to this tutorial to filter the non-redundant database but I get stock in step 3. You suggest to NOT directly download nr.gz from ncbi ftp, in which the FASTA headers are not well formatted. However the link you provide to download the pre-formatted nr database is not working.

Please where can get the nr database that is well formatted to wok with this tutorial ?

Thanks in advance

ADD REPLY • link 6.0 years ago by dieunelderilus ▴ 10

0

Entering edit mode

ftp://ftp.ncbi.nlm.nih.gov/blast/db

ADD REPLY • link 6.0 years ago by shenwei356 8.7k

0

Entering edit mode

I am getting a problem with taxonkit tutorial. I get stock in the step 3 (option 1). After retrieving FASTA sequences from pre-formated blastdb, the perl one-liner that used to unfold records having mulitple accessions output an empty nr.$id.fa.gz.

Please any suggestions ?

ADD REPLY • link updated 5.9 years ago by shenwei356 8.7k • written 5.9 years ago by dieunelderilus ▴ 10

0

Entering edit mode

so, is it 6656 or 6665?

ADD REPLY • link 5.9 years ago by shenwei356 8.7k

0

Entering edit mode

Sorry, it was a problem in editing the message. the id is 6656

ADD REPLY • link 5.9 years ago by dieunelderilus ▴ 10

0

Entering edit mode

What's your taxid ??? 6656 is just the example.

ADD REPLY • link 5.9 years ago by shenwei356 8.7k

0

Entering edit mode

I need to retrieve get a new database for Chlorophyta with taxid 3041. the perl one-liner script outputs an empty file for watever taxid I use . I could not even reproduce the tutorial. It could be trivial but I don't know why I don't get it.

thank you for your assistance

DD

ADD REPLY • link 5.9 years ago by dieunelderilus ▴ 10

0

Entering edit mode

please paste results of

wc -l 3041.taxid.txt 
wc -l 3041.acc.txt

I'll check it on monday.

ADD REPLY • link 5.9 years ago by shenwei356 8.7k

0

Entering edit mode

Thanks they are copied here :

wc -l 3041.acc.txt 
430402 3041.acc.txt

wc -l 3041.taxid.txt 
9872 3041.taxid.txt

Note that I could not reproduce the tutorial, the option 2 of the step three works but , but could not pass the stp4 (make blasted). You did not provide how to deal with this error in the case a user would se the the option 2 of the step3. I try try to remove duplicate reads, and suspect this is not the good option.

Thanks again for your assistance

DD

ADD REPLY • link 5.9 years ago by dieunelderilus ▴ 10

0

Entering edit mode

Well, I tried 3041, and every step worked fine.

$ wc -l $id.taxid.txt
9703 3041.taxid.txt

$ wc -l $id.acc.txt
383332 3041.acc.txt

$ pigz -dc nr.$id.fa.gz | grep '^>' -c
383332

$ time makeblastdb -parse_seqids -in nr.$id.fa -dbtype prot -out nr.$id
Building a new DB, current time: 12/22/2018 23:33:14
New DB name:   /home/shenwei/data/data/blastdb/nr.3041
New DB title:  nr.3041.fa
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 383332 sequences in 8.6429 seconds.

real    0m8.806s
user    0m8.320s
sys     0m0.466s

I also add a faster way to get nr.$taxid.fa.gz from nr.fa.gz, in step 3 option 1.

ADD REPLY • link 5.9 years ago by shenwei356 8.7k

0

Entering edit mode

I can send nr.3041.fa.gz (70M) to you via email, if you give my address or write email to me.

ADD REPLY • link 5.9 years ago by shenwei356 8.7k

0

Entering edit mode

Thanks a lot! Please I would be happy if you can send the nr.$id.fa forme to me at derilus.dieunel@upr.edu or dieunelderilus@gmail.com.

But should keep testing to see where I am wrong, because this is a very interesting tools.

ADD REPLY • link 5.9 years ago by dieunelderilus ▴ 10

score 0 · Answer 2 · 2018-12-15

0

Entering edit mode

6.0 years ago

vkkodali_ncbi ★ 3.8k

If you don't mind using the -remote option, you can consider using the -entrez_query with txid2759[Organism].

ADD COMMENT • link 6.0 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

~~I am curious as to how NCBI separates txid2759[Organism] sequence in nr? If I look for entries in a local copy of nr I don't find anything matching txid2579.~~

ADD REPLY • link 6.0 years ago by GenoMax 147k

1

Entering edit mode

You probably mean txid2759 not txid2579. I have myself not used blast from command line this way but it is possible that the tax IDs int he local copy are the species level tax IDs where as 2759 includes all eukaryotes.

ADD REPLY • link 6.0 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

Yes, he/she may just give an invalid taxid. I also checked, and 2579 is not found in NCBI taxonomy database, but 2759 is right.

$ echo 2579 | taxonkit lineage 
2579
$ echo 2759 | taxonkit lineage 
2759    cellular organisms;Eukaryota

ADD REPLY • link 6.0 years ago by shenwei356 8.7k

0

Entering edit mode

My bad. Even though I wrote 2579 I was checking the right taxID.

With a local copy of nr, I see results from (573, a random taxID example)

blastdbcmd -db nr -entry all -outfmt %T,%a | awk -F "," '{OFS="\t"} {if ($1 == 573) print $1,$2}'

but the same search with 2759 generate zero results.

blastdbcmd -db nr -entry all -outfmt %T,%a | awk -F "," '{OFS="\t"} {if ($1 == 2759) print $1,$2}'

ADD REPLY • link 6.0 years ago by GenoMax 147k

1

Entering edit mode

That's why I wrote the tutorial, because most of the protein sequences are linked to species not higher taxonomy rank. You have to get all taxids of species belonging to the Eukayota.