Removing duplicate seq_ids in database

0

Entering edit mode

2.5 years ago

K • 0

Hello,

I am trying to create a database with genomes of different E. coli strains. I realized that if I remove the duplicate seq_ids, then the sequence similarity search I want to run for my query sequences wouldn't accurately tell me which strains the sequence is conserved in? Instead of removing the duplicate seq_ids, should I rename them instead?

make database

nohup makeblastdb -in all.fasta -dbtype prot -out 20220606_DB -parse_seqids > nohup_out.txt &

--> BLAST Database creation error: Error: Duplicate seq_ids are found: LCL|WP_000002542.1

remove duplicates

seqkit rmdup all.fasta > clean_all.fasta

--> [INFO] 9143515 duplicated records removed

seqkit duplicate seq_ids makeblastdb • 1.6k views

ADD COMMENT • link 2.5 years ago by K • 0

1

Entering edit mode

To be simple, use seqkit rename all.fasta -o all.rename.fasta to make the IDs unique. Then ID would still be something likeLCL|WP_000002542.1 and LCL|WP_000002542.1_2.