Question

Append fasta header to corresponding fasta filename

0

Entering edit mode

4.7 years ago

genomes_and_MGEs ▴ 10

Hey everyone, When you donwload a given assembly from Refseq NCBI, the filename with be for example GCF_006351845.1_ASM635184v1_genomic.fna and the corresponding fasta header

>NZ_CP040904.1 Enterococcus faecium strain N56454 chromosome, complete genome

After some formatting, all my fasta headers are like this, for example:

>NZ_CP040904.1_Ef

I would like to rename my filename like this Ef_GCF_006351845.1_ASM635184v1_genomic.fna. So, copying the text after the last underscore on the fasta header, and moving it to the beginning of the filename.

Could you guys help me out?

Thanks!

sequence • 1.1k views

ADD COMMENT • link updated 4.7 years ago by cpad0112 21k • written 4.7 years ago by genomes_and_MGEs ▴ 10

0

Entering edit mode

Here's some logic to approach the problem:

For each of these files, you should pick the first line, cut out the second part where each part is separated by _ and store that part in a variable. Now you should rename the file so this variable precedes the actual file name. This can be done in a loop that contains two commands. bash should do this, you won't need any programming language.

ADD REPLY • link 4.7 years ago by Ram 45k

0

Entering edit mode

I'm no expert in this, but I wrote this

for F in *.fna ; do N=$(awk -F '>|_' '/^>/ {print $4}' $F) ; echo mv -v $F $N_$F ; done

I understand it should be something similar to this, but I'm making some mistakes. Could you help me out?

ADD REPLY • link updated 4.7 years ago by Ram 45k • written 4.7 years ago by genomes_and_MGEs ▴ 10

1

Entering edit mode

Change cp to mv to rename instead of copy.

for F in $(find . -name "*.fna" -printf "%f\n"); do
  N=$(head -n1 $F | cut -d"_" -f3)_$F
  cp $F $N
done

ADD REPLY • link 4.7 years ago by rpolicastro 13k