Question

Change NCBI fasta file headers to makeblastdb format

0

Entering edit mode

6.6 years ago

chland • 0

Hi, I’ve downloaded several assemblies from RefSeq and will be generating a custom database using the makeblastdb command. The file headers need to be in a specific format to subsequently fetch individual sequences from blastn_out search results using blastdbcmd.

The headers for all fasta and multifasta files need to be formatted something like this: >gnl|uniqID|seq1 The assembly files downloaded from NCBI assembly site have headers like this:

>NZ_BP006234.1 Organism name strain AC2-110 genome
>NZ_BP006234.1 Organism name strain AC2-110 plasmid pxxx5, complete sequence
or like this:
>NZ_BB45345435.1 Organism name 73645 n_819_l_244_c_44.200821, whole genome shotgun sequence
>NZ_BB45345435.1 Organism name 73645  n_773_l_201_c_51.631840, whole genome shotgun sequence
They should look like this:
 >NZ_BP006234.1|seq1
>NZ_BP006234.1|seq2
>NZ_BP006234.1|seq3

I think that the sed command can be used in this case but don't know what to provide so that it removes part and keeps the correct part of the string. Thank you for helping.

assembly blast fasta headers format • 2.9k views

ADD COMMENT • link updated 6.6 years ago by h.mon 35k • written 6.6 years ago by chland • 0

1

Entering edit mode

What have you tried so far?

ADD REPLY • link 6.6 years ago by Joe 21k

0

Entering edit mode

What's your makeblastdb cmdline looking like? And while you're at it can you also post part of your fasta file?

ADD REPLY • link 6.6 years ago by lieven.sterck 15k

0

Entering edit mode

the makeblastdb isn't the issue as I can get that to run. It's that the --parseseq ids isn't parsing the files correctly. After much review of the NCBI BLASt documentation, it appears that the issue is headers- they're a mess when downloaded from the NCBI assemblies site. I need to get rid of the spaces and the long names to have the files parsed correctly using blastdbcmd. As for the fasta, it's a straight forward fasta or multifasta file, ex) AAACCTCGGCCC of lengths between 200 bp- whole genome assemblies of ~ 4 MB.

ADD REPLY • link 6.6 years ago by chland • 0

0

Entering edit mode

which blast version are you running? I never seem to have had any trouble formatting blastDBs from fasta files with headers as you mention in your post.

ADD REPLY • link 6.6 years ago by lieven.sterck 15k

score 0 · Answer 1 · 2018-04-24

I have no problem with the following:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/820/525/GCF_000820525.2_SMSRO_2016/GCF_000820525.2_SMSRO_2016_genomic.fna.gz
gunzip GCF_000820525.2_SMSRO_2016_genomic.fna.gz 
makeblastdb -dbtype nucl -in GCF_000820525.2_SMSRO_2016_genomic.fna -out S.poulsonii -parse_seqids
blastdbcmd -db S.poulsonii -entry NZ_JTLV02000002.1

Are the accessions you provided as example real? I can't find them. I hope they are not made up - they are not good reproducible examples if they are made up.