Hello,
I am trying to create a database with genomes of different E. coli strains. I realized that if I remove the duplicate seq_ids, then the sequence similarity search I want to run for my query sequences wouldn't accurately tell me which strains the sequence is conserved in? Instead of removing the duplicate seq_ids, should I rename them instead?
make database
nohup makeblastdb -in all.fasta -dbtype prot -out 20220606_DB -parse_seqids > nohup_out.txt &
--> BLAST Database creation error: Error: Duplicate seq_ids are found: LCL|WP_000002542.1
remove duplicates
seqkit rmdup all.fasta > clean_all.fasta
--> [INFO] 9143515 duplicated records removed
To be simple, use
seqkit rename all.fasta -o all.rename.fasta
to make the IDs unique. Then ID would still be something likeLCL|WP_000002542.1
andLCL|WP_000002542.1_2.
Thank you!