Question

How to remove space in headers of fasta files

0

Entering edit mode

9.9 years ago

Crystal ▴ 70

Hi,

I have a database with thousands sequences.

The format of the header for each sequence is like:

VFG0676 lef - anthrax toxin, lef, bacteria name (VF0142)

Since there are several space in the header, when I use alignment tool to blast my samples to this database, the only thing showed on my result is VFG0676.

Is there any way I can remove all the space in the head, so that my results can show the full description of the header?

I also want to extract sequence headers in this database, but now the extracted result only displays list of VFGs, no other information.

Can anyone help me with this.

Thanks

Crystal

next-gen sequence • 16k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Crystal ▴ 70

Ram · Answer 1 · 2015-03-18

5

Entering edit mode

9.9 years ago

apelin20 ▴ 480

sed 's, ,_,g' -i FASTA_file

Let me know f you need any help.

ADD COMMENT • link 9.9 years ago by apelin20 ▴ 480

0

Entering edit mode

Excuse me, is there a output file by using the code?

ADD REPLY • link 9.9 years ago by Crystal ▴ 70

2

Entering edit mode

The -i option edits in place. If you remove it then just redirect to a file.

ADD REPLY • link 9.9 years ago by Devon Ryan 105k

0

Entering edit mode

Thank you. Then I tried to extract all the headers in that file, but now the format is

VFG0676_lef_

and still didn't show the rest of the information.

I do went back and check the edited file, and the format of the headers is

VFG0676_lef_-_anthrax_toxin,_lef,_bacteria_name_(VF0142)

So I don't know if the problem is due to the code I used to extract headers from the file.

Thanks

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Crystal ▴ 70

1

Entering edit mode

seems like the "-" is somehow problematic. After doing

sed 's, ,_,g' -i FASTA_file

Try

sed 's,-,,g' -i FASTA_file

This should remove the - from the header. Normally, - aren't a problem but you can still remove it.

ADD REPLY • link 9.9 years ago by apelin20 ▴ 480

0

Entering edit mode

Well, now the extracted headers are longer, but still not the full descriptions. :(

The format is VFG0676_lef_-_anthrax_toxin

Also I forgot to mention that there is [] for bacteria name, it is like [bacteria_name].

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Crystal ▴ 70

1

Entering edit mode

Great.... looks like the commas a problem too.... we shall slay them as well!

sed 's.,..g' -i FASTA_file

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by apelin20 ▴ 480

1

Entering edit mode

I think I also need to remove [] and () in the headers, too.

Should I use code like:

sed 's,(),,g' -i FASTA_file
sed 's,[],,g' -i FASTA_file

OR

sed 's,[,,g' -i FASTA_file
sed 's,],,g' -i FASTA_file
sed 's,(,,g' -i FASTA_file
sed 's,),,g' -i FASTA_file

PS: As a noob to this forum, I can only post five messages/day.

Thanks

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Crystal ▴ 70

1

Entering edit mode

sed 's,(),,g' will remove (), not ( and ) individually. You could do sed 's/[()\[]//g;s/\]//g' to remove [,],(, and ) in a single go.

BTW, there's probably a shorter way of doing that, but I can't get sed to allow [ and ] together in a list...

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Devon Ryan 105k

0

Entering edit mode

If you are on OS X and want to edit in place, it is slightly different: sed -i '' 's/ /_/g' foo.fa

ADD REPLY • link 9.9 years ago by Alex Reynolds 36k

Ram · Answer 2 · 2015-03-18

4

Entering edit mode

9.9 years ago

Brian Bushnell 20k

BBTools has a read reformatter which will replace all of the whitespace in headers with underscores:

reformat.sh in=reads.fasta out=fixed.fasta addunderscore

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Brian Bushnell 20k