I have a multifasta file that looks like this
>TRINITY_DN10231_c0_g1_i1 len=1399 path=[0:0-908 1:909-912 2:913-1398]
ATATATATATAT
>TRINITY_DN10231_c0_g1_i2 len=1399 path=[0:0-908 1:909-912 2:913-1398]
GGCGCGCGGAGAGAGA
And I have a txt.file (tab delimited) that looks like this:
TRINITY_DN10231_c0_g1_i1 UBQ5_TOBAC
TRINITY_DN10231_c0_g1_i2 UBQ5_TOBAC
The text file has abbreviated transcript names that I would like to use to rename my fasta files. I would like to remove the len= and path sections from my new fasta. I ran the following code to rename the fasta sequences and what I would like to get is seen below:
>TRINITY_DN10231_c0_g1_i1_UBQ5_TOBAC
awk '
FNR==NR{
a[$1]=$1 $2
next
}
($2 in a) && /^>/{
print ">"a[$2]
next
}
1
' txt.file FS="[> ]" fasta.fa > newfasta.fasta
What I get however, is this:
TRINITY_DN10231_c0_g1_i1UBQ5_TOBAC
I've tried tweaking the code in the initial argument defining the array but that removes all of the headers. Not sure where to go next. Any help would be appreciated.
Maybe
a[$1]=$2
only?That only gives me this:
with seqkit and awk:
input: