Hi, I’ve downloaded several assemblies from RefSeq and will be generating a custom database using the makeblastdb command. The file headers need to be in a specific format to subsequently fetch individual sequences from blastn_out search results using blastdbcmd.
The headers for all fasta and multifasta files need to be formatted something like this: >gnl|uniqID|seq1 The assembly files downloaded from NCBI assembly site have headers like this:
>NZ_BP006234.1 Organism name strain AC2-110 genome
>NZ_BP006234.1 Organism name strain AC2-110 plasmid pxxx5, complete sequence
or like this:
>NZ_BB45345435.1 Organism name 73645 n_819_l_244_c_44.200821, whole genome shotgun sequence
>NZ_BB45345435.1 Organism name 73645 n_773_l_201_c_51.631840, whole genome shotgun sequence
They should look like this:
>NZ_BP006234.1|seq1
>NZ_BP006234.1|seq2
>NZ_BP006234.1|seq3
I think that the sed command can be used in this case but don't know what to provide so that it removes part and keeps the correct part of the string. Thank you for helping.
What have you tried so far?
What's your makeblastdb cmdline looking like? And while you're at it can you also post part of your fasta file?
the makeblastdb isn't the issue as I can get that to run. It's that the --parseseq ids isn't parsing the files correctly. After much review of the NCBI BLASt documentation, it appears that the issue is headers- they're a mess when downloaded from the NCBI assemblies site. I need to get rid of the spaces and the long names to have the files parsed correctly using blastdbcmd. As for the fasta, it's a straight forward fasta or multifasta file, ex) AAACCTCGGCCC of lengths between 200 bp- whole genome assemblies of ~ 4 MB.
which blast version are you running? I never seem to have had any trouble formatting blastDBs from fasta files with headers as you mention in your post.