Removing duplicate seq_ids in database
0
0
Entering edit mode
2.5 years ago
K • 0

Hello,

I am trying to create a database with genomes of different E. coli strains. I realized that if I remove the duplicate seq_ids, then the sequence similarity search I want to run for my query sequences wouldn't accurately tell me which strains the sequence is conserved in? Instead of removing the duplicate seq_ids, should I rename them instead?

make database

nohup makeblastdb -in all.fasta -dbtype prot -out 20220606_DB -parse_seqids > nohup_out.txt &

--> BLAST Database creation error: Error: Duplicate seq_ids are found: LCL|WP_000002542.1

remove duplicates

seqkit rmdup all.fasta > clean_all.fasta

--> [INFO] 9143515 duplicated records removed

seqkit duplicate seq_ids makeblastdb • 1.6k views
ADD COMMENT
1
Entering edit mode

To be simple, use seqkit rename all.fasta -o all.rename.fasta to make the IDs unique. Then ID would still be something likeLCL|WP_000002542.1 and LCL|WP_000002542.1_2.

ADD REPLY
0
Entering edit mode

Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 1957 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6