Hello All,
I have a multi fasta file with millions of sequences. I want to duplicate a part of the header and join it to the header itself with a pipe, while another part (of the header) should be deleted.
Let's say I have a fasta file, "input.fasta," which looks like this:
>Gene1 wbdfwbf
ATGCCGATGCAGTGACG
>Gene2 wbdwe
ATGCAGTGACGTAGCAG
>Gene3 wdbwd
TGACGTAGCGTAGCAG
I want it to convert to:
>Gene1|Gene1
ATGCCGATGCAGTGACG
>Gene2|Gene2
ATGCAGTGACGTAGCAG
>Gene3|Gene3
TGACGTAGCGTAGCAG
First, I used cut -d ' ' -f 1 < input.fasta > out1.fasta
for deleting space followed by all the characters from the header and then added a pipe by doing perl -p -e 's/^(>.*)$/$1\|/g' out1.fasta > out2.fasta
out2.fasta looks like this:
>Gene1|
ATGCCGATGCAGTGACG
>Gene2|
ATGCAGTGACGTAGCAG
>Gene3|
TGACGTAGCGTAGCAG
Now I am stuck here. I have come across many posts on deleting duplicates on the forum but didn't see any post on duplicating fasta header. Could you please help me with this or point out a solution if it has already been discussed?
Many Thanks, PSP
Thank you very much!