Here's another Perl option:
use strict;
use warnings;
for my $file ( grep $_ ne $0, <*> ) {
open my $fh, '<', $file or die $!;
while (<$fh>) {
chomp;
s/^>.+\K/-$file/;
print "$_\n";
}
}
Usage: perl script.pl [>outFile]
The last, optional parameter directs output to a file.
Drop the script into the directory where only your fasta files live that you want to combine. The script will read all the file names in that directory--filtering out its own name--iterate through all, and append the current file name it's reading to the end of each fasta header line of the file.
It's problematic just sending file names on the command line to such a script, since you'd need to enclose each name within quotes, as Pierre Lindenbaum did with his example, because of the spaces. This solution bypasses the need to send those names to the script.
How does this work? The script uses a glob
(<*>
) to get a listing of all files in the current directory, then grep
s each against the script's name, so it's not going to read in the script. As each file line is read, the input record separator (usually \n
) is chomp
ped off, and a substitution--which looks for a >
at the start of a line--is used. The substitution matches the complete header line, but the \K
says "\K
eep all to the left," so a hyphen and the file name is added to that line. Finally, that line is print
ed. (Note there is no close $fh;
. Perl will automatically close
the open
ed file when a new file handle is assigned to $fh
.)
The following Python script produces the same results:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import re
for fileName in os.listdir('.'):
if fileName == os.path.basename(__file__) or os.path.isdir(fileName):
break
inFile = open(fileName, 'r')
for line in inFile:
print re.sub(r'^(>.+)', r'\1-%s' % fileName, line.rstrip('\n'))
Usage: python script.py [>outFile]
Hope this helps!
Be aware that if you keep the space in the species name, it will generate a break in the FASTA header. So for example in the first line, Act1-Homo will be interpreted as the sequence ID and sapiens as a sequence description.
Yes, it should be considered.