Question

Edit a Multifasta File Through awk

0

Entering edit mode

5.3 years ago

pthom010 ▴ 40

I have a multifasta file that looks like this

>TRINITY_DN10231_c0_g1_i1 len=1399 path=[0:0-908 1:909-912 2:913-1398]
ATATATATATAT
>TRINITY_DN10231_c0_g1_i2 len=1399 path=[0:0-908 1:909-912 2:913-1398]
GGCGCGCGGAGAGAGA

And I have a txt.file (tab delimited) that looks like this:

TRINITY_DN10231_c0_g1_i1        UBQ5_TOBAC
TRINITY_DN10231_c0_g1_i2        UBQ5_TOBAC

The text file has abbreviated transcript names that I would like to use to rename my fasta files. I would like to remove the len= and path sections from my new fasta. I ran the following code to rename the fasta sequences and what I would like to get is seen below:

>TRINITY_DN10231_c0_g1_i1_UBQ5_TOBAC

awk '
FNR==NR{
a[$1]=$1 $2
next
}
($2 in a) && /^>/{
print ">"a[$2]
next
}
1
' txt.file FS="[> ]" fasta.fa > newfasta.fasta

What I get however, is this:

TRINITY_DN10231_c0_g1_i1UBQ5_TOBAC

I've tried tweaking the code in the initial argument defining the array but that removes all of the headers. Not sure where to go next. Any help would be appreciated.

fasta unix awk • 1.7k views

ADD COMMENT • link 5.3 years ago by pthom010 ▴ 40

0

Entering edit mode

Maybe a[$1]=$2 only?

ADD REPLY • link 5.3 years ago by Asaf 10k

0

Entering edit mode

That only gives me this:

TRINITY_DN10231_c0_g

ADD REPLY • link 5.3 years ago by pthom010 ▴ 40

0

Entering edit mode

with seqkit and awk:

$ awk '{print $1}' file.fa                                                                                                     
>TRINITY_DN10231_c0_g1_i1
ATATATATATAT
>TRINITY_DN10231_c0_g1_i2
GGCGCGCGGAGAGAGA

$ awk '{print $1}' file.fa | seqkit replace  --quiet  -p "(.+)" -r '{kv}' -k file.txt    

>UBQ5_TOBAC
ATATATATATAT
>UBQ5_TOBAC
GGCGCGCGGAGAGAGA

input:

$ cat file.fa                                                                                                                  
>TRINITY_DN10231_c0_g1_i1 len=1399 path=[0:0-908 1:909-912 2:913-1398]
ATATATATATAT
>TRINITY_DN10231_c0_g1_i2 len=1399 path=[0:0-908 1:909-912 2:913-1398]
GGCGCGCGGAGAGAGA

ADD REPLY • link 5.3 years ago by cpad0112 21k

score 0 · Answer 1 · 2020-05-06

0

Entering edit mode

5.3 years ago

Zhilong Jia ★ 2.2k

cat 1.txt

TRINITY_DN10231_c0_g1_i1        UBQ5_TOBAC
TRINITY_DN10231_c0_g1_i2        UBQ5_TOBAC

cat 2.txt

>TRINITY_DN10231_c0_g1_i1 len=1399 path=[0:0-908 1:909-912 2:913-1398]
ATATATATATAT
>TRINITY_DN10231_c0_g1_i2 len=1399 path=[0:0-908 1:909-912 2:913-1398]
GGCGCGCGGAGAGAGA

awk 'FNR==NR{data[$1]=$2; next}{if ($1 ~/>/) {aa=$1 "_" data[$1]; print $0} else {print} }' 1.txt 2.txt

>TRINITY_DN10231_c0_g1_i1 len=1399 path=[0:0-908 1:909-912 2:913-1398]
ATATATATATAT
>TRINITY_DN10231_c0_g1_i2 len=1399 path=[0:0-908 1:909-912 2:913-1398]
GGCGCGCGGAGAGAGA

ADD COMMENT • link 5.3 years ago by Zhilong Jia ★ 2.2k

0

Entering edit mode

Should I cat both the fasta and the txt file? I'm a bit confused.

ADD REPLY • link 5.3 years ago by pthom010 ▴ 40

0

Entering edit mode

No, just show the content of the files to clarify 1.txt and 2.txt .

ADD REPLY • link 5.3 years ago by Zhilong Jia ★ 2.2k