Hi,
I have a database with thousands sequences.
The format of the header for each sequence is like:
VFG0676 lef - anthrax toxin, lef, bacteria name (VF0142)
Since there are several space in the header, when I use alignment tool to blast my samples to this database, the only thing showed on my result is VFG0676.
Is there any way I can remove all the space in the head, so that my results can show the full description of the header?
I also want to extract sequence headers in this database, but now the extracted result only displays list of VFGs, no other information.
Can anyone help me with this.
Thanks
Crystal
Excuse me, is there a output file by using the code?
The -i option edits in place. If you remove it then just redirect to a file.
Thank you. Then I tried to extract all the headers in that file, but now the format is
and still didn't show the rest of the information.
I do went back and check the edited file, and the format of the headers is
So I don't know if the problem is due to the code I used to extract headers from the file.
Thanks
seems like the "-" is somehow problematic. After doing
Try
This should remove the - from the header. Normally, - aren't a problem but you can still remove it.
Well, now the extracted headers are longer, but still not the full descriptions. :(
The format is
VFG0676_lef_-_anthrax_toxin
Also I forgot to mention that there is
[]
for bacteria name, it is like[bacteria_name]
.Great.... looks like the commas a problem too.... we shall slay them as well!
I think I also need to remove
[]
and()
in the headers, too.Should I use code like:
OR
PS: As a noob to this forum, I can only post five messages/day.
Thanks
sed 's,(),,g'
will remove()
, not(
and)
individually. You could dosed 's/[()\[]//g;s/\]//g'
to remove[
,]
,(
, and)
in a single go.BTW, there's probably a shorter way of doing that, but I can't get sed to allow
[
and]
together in a list...If you are on OS X and want to edit in place, it is slightly different:
sed -i '' 's/ /_/g' foo.fa