How to remove fasta headers in a multifasta file and write file name as a fasta header?
3
1
Entering edit mode
4.2 years ago
Kumar ▴ 120

I have fasta file namely 119XCA.fasta as shown below,

>cellulase
ATGCTA
>gyrase
TGATGCT
>16s
TAGTATG

I need to remove all the fasta headers, keep the sequences one by one and need to write file name as a fasta header. The expected outcome is shown below,

>119XCA
ATGCTA
TGATGCT
TAGTATG

I have used the following script sed '/^>/d' foo.fa > out.fa which remove the fasta headers but, i do not know how to manage to write file name as a header. Therefore, please help me to do the same.

gene sequence genome alignment next-gen • 3.0k views
ADD COMMENT
3
Entering edit mode
4.2 years ago
Joe 21k

Not the prettiest code in the world, but this will work.

Run it like so: bash scriptname.sh /path/to/files/*.fasta

for file in $1 ; do
    cat $file | sed -e '1!{/^>.*/d;}' | \
                sed ':a;N;$!ba;s/\n//2g' | \
                sed '1!s/.\{80\}/&\n/g' | \
                sed "s|>.*$|>${file##*/}|g" > $(basename "${file##*/}" ".fasta" ).fa
done

You can also do it as a oneliner for a single file if needed:

cat filename.fasta | sed -e '1!{/^>.*/d;}' | sed ':a;N;$!ba;s/\n//2g' | sed '1!s/.\{80\}/&\n/g' | sed "s|>.*$|>${file##*/}|g" > $(basename "${file##*/}" ".fasta" ).fa
ADD COMMENT
0
Entering edit mode

(Note the first 3 sed calls are useful for concatenating any fasta)

ADD REPLY
0
Entering edit mode

I know this is super old, dunno if anyone will see but I'll give it a try.

I liked this one-liner, tried it, and it works in the sense that it deletes all headers in multifasta and concatenate sequences in one big sequence and at the beginning there is one header. It's just that for some reason the header and file name are authorized_keys.fa instead of the original file name. Does anyone know why?

This is what I work on: a multifasta file of 8 genes, Every sequence has the same header (species name) and this is also the name of the multifasta file. So - filename is Lkooheri.fasta and it looks like:

>Lhookeri
RRKVN...
>Lhookeri
STLGKLLP...
>Lhookeri
VKEFG...
>Lhookeri
LIRMDACIA...
>Lhookeri
RRKVN...
>Lhookeri
STLGKLLP...
>Lhookeri
VKEFG...
>Lhookeri
LIRMDACIA...
ADD REPLY
0
Entering edit mode

How are you calling the oneliner?

There's no part of the command which could create the string authorized_keys.fa de novo, so it must be coming from files in your local environment (authorised_keys is part of the SSH config).

ADD REPLY
0
Entering edit mode

thank you for getting back to me :) I ran the command from the terminal exactly as you wrote it, while in the same folder as my multifasta. I checked the number of amino acid residues in my new mono-fasta file (called authorized keys) and it is the same as in multifasta so it works, and I changed the name of the header and file name manually, it's just bugging me what is wrong. :) I am not a Linux expert so can't figure this out on my own but at least it's working.

ADD REPLY
0
Entering edit mode

did you save the code as a file and then ran it like bash scriptname.sh /path/to/files/*.fasta?

ADD REPLY
0
Entering edit mode
$ more te.fa
>Lhookeri
RRKVN...
>Lhookeri
STLGKLLP...
>Lhookeri
VKEFG...
>Lhookeri
LIRMDACIA...
>Lhookeri
RRKVN...
>Lhookeri
STLGKLLP...

You only need the first sed command from @Joe's example to get the result. Save a to a new file by using > new.fa at the end of the command below.

$ cat te.fa | sed -e '1!{/^>.*/d;}' 
>Lhookeri
RRKVN...
STLGKLLP...
VKEFG...
LIRMDACIA...
RRKVN...
STLGKLLP...
VKEFG...
LIRMDACIA...
ADD REPLY
0
Entering edit mode

Hi, thank you for the suggestion. It looks easier indeed. I tried your command and yes, it concatenates all fastas in file to one big sequence and leaves just one header at the top but the difference is that there are spaces left at the lines where the end of one sequence used to be in the original file. and I'm not sure if that will interfere with my downstream analysis (aligning with other sequences). Ignore the asterisks. enter image description here

ADD REPLY
0
Entering edit mode

Yeah this is what the other elements of the subsequent sed commands deal with (linearising, and then wrapping back to 80 chars).

Just FYI, I don't think that command will deal with the stop codons, so they may persist in the final sequence.

ADD REPLY
0
Entering edit mode

Just to belabour the point - I rechecked this and the code definitely works as intended (for piping it can be simplified to not create the file as so:

$ cat test.fa
>Header_1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>Header_2
CCCCCCDDDDDD
>Header_3
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

$ cat test.fa | gsed -e '1!{/^>.*/d;}' | gsed ':a;N;$!ba;s/\n//2g' | gsed '1!s/.\{80\}/&\n/g'  

>Header_1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCDDDDDDEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Note that the command above uses gsed which is what will be needed if you are on non-GNU systems.

ADD REPLY
2
Entering edit mode
4.2 years ago
Shred ★ 1.6k

Assuming you're using BASH, use basename to get the filename with no PATH. Like:

filename=$(basename -i file | cut -d'.' -f1)

Then you could replace it using sed

sed -i "s/^\>.*$/$filename/" your.fasta

Remember to use double quotes to use variables in sed.

ADD COMMENT
0
Entering edit mode

I don't think this will concatenate the sequence?

ADD REPLY
0
Entering edit mode

He said he's already got the concatenated file.

ADD REPLY
2
Entering edit mode
4.2 years ago

try this:

$ cat test.fa
>cellulase
ATGCTA
>gyrase
TGATGCT
>16s
TAGTATG

$  awk 'BEGIN {print ">"ARGV[1]};!/^>/{print}' test.fa

>test.fa
ATGCTA
TGATGCT
TAGTATG

$ cat <(echo ">"$basename test.fa) <(grep -v ">" test.fa) (note:extra space in header)
> test.fa
ATGCTA
TGATGCT
TAGTATG
ADD COMMENT

Login before adding your answer.

Traffic: 2538 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6