renaming fasta file with part of fasta header
1
0
Entering edit mode
4.4 years ago
KG ▴ 10

I have about 100 multiple fasta files (e.g., file.faa), which I have to rename with the species name mentioned in the fasta header. The fasta headers of these files are in the following format:

>XP_003072227.1 aminopeptidase N [Encephalitozoon intestinalis ATCC 50506]

How do I rename 'file.faa' to 'Encephalitozoon intestinalis.faa'?

I saw that people have used awk and sed for a similar purpose but could not figure out what I have to do. Any help is appreciated.

Thank you.

awk sed • 1.4k views
ADD COMMENT
0
Entering edit mode

See answers here for inspiration: Rename FASTA files according to FASTA file header

You will need to make some changes to the solutions. Please do not use spaces in file names even though your OS may allow them.

ADD REPLY
0
Entering edit mode

I have tried doing

for i in *.faa; do 
 mv $i $(head -1 $i | cut -f1 -d ' ' | tr -d '>' ).faa
done

This changed 'file.faa' to 'XP_003072227.1.faa' but I need 'Encephalitozoon intestinalis.faa'.

How to modify the script?

ADD REPLY
1
Entering edit mode
4.4 years ago
GenoMax 147k

Create a backup copy of your files before trying the following. Replace this line below in your command above

head -1 $i | cut -f1 -d ' ' | tr -d '>'

with following

head -1 test.fa | cut -f2 -d [ | cut -f1,2 -d " " | awk -F " " '{OFS="_"}{print $1,$2}' /dev/stdin

With your example this produces:

$ head -1 test.fa | cut -f2 -d [ | cut -f1,2 -d " " | awk -F " " '{OFS="_"}{print $1,$2}' /dev/stdin
Encephalitozoon_intestinalis
ADD COMMENT

Login before adding your answer.

Traffic: 2799 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6