Renaming fasta files with their headers (gene name) ?
1
0
Entering edit mode
2.6 years ago
sunnykevin97 ▴ 990

Hi

I had around 10,0000 gene sequences in individual fasta files. I'd like to rename each file with their header name containing the gene name.

Original file

head 1.fasta

==> 1.fasta <==

> Gloriosasuperba; 8324-9004; -; atp6 

ATGACAGTAAGCCTTTTTGACCAATTTATGAGCCCCACACTACTAGGCATCCCCCTGCTC

Modified file

head atp6.fasta

==> atp6.fasta <==

> atp6 
ATGACAGTAAGCCTTTTTGACCAATTTATGAGCCCCACACTACTAGGCATCCCCCTGCTC
gene genome • 1.4k views
ADD COMMENT
3
Entering edit mode
2.6 years ago

I would not rename the files, but write the contents into a new file, which is trivial with awk:

cat *.fasta | tr "\n" ";" | tr ">" "\n>" | awk -F ";" '{a="../outdir/$4.fasta" ; print ">"$4"\n"$5 > a}'

This quick n' dirty solution assumes that there are no extra linebreaks in the sequences of your FASTA files and that there is a "outdir" folder in the parent directory to write the files into. You can also write your output to the current directory, but then your input/output might get mixed, so I would avoid that.

ADD COMMENT
0
Entering edit mode

I executed the awk in the directory, containing all the files.

I created dir with "out"

cat *.fasta | tr "\n" ";" | tr ">" "\n>" | awk -F ";" '{print ">"$4"\n"$5 > "../out/$5.fasta"}' 

awk: cmd. line:1: (FILENAME=- FNR=1) fatal: cannot redirect to `../out/$5.fasta': No such file or directory

ADD REPLY
1
Entering edit mode

I slightly edited my initial reply since the direct redirect to a mix of string and field variable doesn't seem to work. So it is now assigned to a variable a for being written out.

But this doesn't explain the "No such file or directory" error, because it would have created a file with the name $5.fasta in the out directory instead. Did you create the out folder in the parent directory or as a subfolder in the current directory? I am almost sure that this was the problem. Try again after running mkdir -p ../out.

ADD REPLY
0
Entering edit mode

Thanks Matthias,

It generates a concatenated file ($5.fasta), that is true.

with a fixed string length

head -n 6 $5.fasta

> trnW(tga)
AGAGACTTAGGCTAATATAAAACCAAGAGCCTTCAAAGCCCTAAATGAAAGTGAAAATCC
> trnA(gca)
AAAGTTTTAGCTTAATTAAAGTGTCTGTTTTGCGTACAGAAGATGTGGGTTAGTGTCCTG
> trnN(aac)
TAGATGGAGGCTCCTTGGTTTGAGCGTTTAGCTGTTAACTAAGAGTTTGTAGGATCGAAG
> trnC(tgc)
AGTCCCATGGTGTAACATATAAGATTGCAAATCTTAAGACGCAGATTAATATTTGCTGGG

> atp6
 GCTCTAGCTATTTCTCTTCCTTGATTAATATTCCCTGCCCCTTCAACTCGATGATTAAAT

> atp8
CTAATTATTCTTCCCCCTAAAGTGATTGCTCATACTTTCCCAAATGAACCAACCCTACAA

I'm looking for the whole sequence, how I do it ?

Further, the concatenated file needs to be splitted in to individual fasta files.

ADD REPLY
2
Entering edit mode

That is weird, because tr should call the corresponding command to replace the respective letters:

echo "DNA" | tr "D" "R"
RNA 

But then let's go step by step and without using tr.

Does

cat *.fasta | paste - - | sed 's/;/\t/g'

give you

Gloriosasuperba  8324-9004   -   atp6   ATGACAGTAAGCCTTTTTGACCAATTTATGAGCCCCACACTACTAGGCATCCCCCTGCTC

and

cat *.fasta | paste - - | sed 's/;/\t/g' | awk '{fields[NF] = NF};END{for(i in fields) print fields[i]}'

returns only 5?

ADD REPLY
0
Entering edit mode

Yes, you're right.

ADD REPLY
0
Entering edit mode

If you only have entries with 5 fields and the contents of each FASTA file are now on one line, this should then give you the desired output:

 cat *.fasta | paste - - | sed 's/;/\t/g' | awk -F ";" '{a="../out/$4.fasta" ; print ">"$4"\n"$5 > a}'

If you have multiple sequences for the same gene (e.g. transcripts), then use >> a such that they are concatenated.

ADD REPLY
0
Entering edit mode

In the cmd, what does the tr $4 and $5 represents for ?

'{print ">"$4"\n"$5 > "../out/$5.fasta"}'
ADD REPLY
1
Entering edit mode

$4 and $5 are the respective columns.

echo "aaaa bbbb cccc dddd" | awk '{print $1 > $2; print $3 > $4}'

should give you the file bbbb with the content aaaa and the file dddd with the content cccc.

So what I am attempting is:

  • get all of the FASTA header plus the actual DNA sequence in one line.
  • combine the columns accordingly and reintroduce the newline "\n" characters at the appropriate spot.
  • print that to a file, using the column with the gene name as output file name.
ADD REPLY

Login before adding your answer.

Traffic: 1941 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6