Merge Files (Adding File Names To Sequence Names)
4
2
Entering edit mode
11.0 years ago
biolab ★ 1.4k

Dear all,

I have many sequence files. An example is shown below.

FILE1:  Homo sapiens
>Act1
tgtgcatgcacatgactgaccaca
>Tub1
gttgcatgcatgcatgacgtacactg
...............


FILE2: C elegans
>Act1
tgcatgtgactgactgac
>Tub1
catgcatgactgcatgactgactg
.................


FILE3: A thaliana
>Act1
actgactgtacgtgactgactactgac
>Tub1
tgtctgatgcatgcatgactgactgtgacactg
.................

I need to merge all files and add species name to sequence name, the output is like below.

>Act1-Homo sapiens
tgtgcatgcacatgactgaccaca
>Tub1-Homo sapiens
gttgcatgcatgcatgacgtacactg
>Act1- C elegans
tgcatgtgactgactgac
>Tub1- C elegans
catgcatgactgcatgactgactg
>Act1-A thaliana
actgactgtacgtgactgactactgac
>Tub1-A thaliana
tgtctgatgcatgcatgactgactgtgacactg

Would anyone please inform me a perl, awk or shell code to finish this? Thank you very much!!

perl awk • 8.9k views
ADD COMMENT
0
Entering edit mode

Be aware that if you keep the space in the species name, it will generate a break in the FASTA header. So for example in the first line, Act1-Homo will be interpreted as the sequence ID and sapiens as a sequence description.

ADD REPLY
0
Entering edit mode

Yes, it should be considered.

ADD REPLY
4
Entering edit mode
11.0 years ago
awk '/^>/ {printf("%s-%s\n",$0,FILENAME);next;} {print;}' "Homo sapiens" "C elegans" "A thaliana"
ADD COMMENT
4
Entering edit mode

Hi Pierre, here is a shorter version (I like to code golf): awk '{print /^>/ ? $0"-"FILENAME : $0}' *. If the line starts with >, print the line $0 + the FILENAME, else print the line. I replaced the list of files by a star, to be sure to adapt to the situation where many files have to be treated.

ADD REPLY
1
Entering edit mode

Thank you Frederic :-)

ADD REPLY
0
Entering edit mode

Thank you both. The awk commands are really cool.

ADD REPLY
0
Entering edit mode

Hi Pierre, Thank you greatly for your answer. However, the above is an example only. I have many files. I wrote a perl script (I am a true beginner) to do this job.

#!/usr/bin/perl
use strict;
use warnings;

my @files = @ARGV;

foreach my $filename (@files){
    $a = shift $filename;
    while (my $line = <>){
    chomp $line;
    if ($line =~/>/){
        print "$line" . '-' . "$a\n";
        } else {
        print "$line\n";}
    }
}

My command is $ perl merge.pl *.fa
The error messag is Not an ARRAY reference at merge.pl line 8. Could you please check the above script and correct it? Thank you very much!

ADD REPLY
1
Entering edit mode

Correcting your code:

#!/usr/bin/perl
use strict;
use warnings;

my @files = @ARGV;

foreach my $filename (@files){
    open FH, "$filename" or die;
    while (my $line = <FH>){
    chomp $line;
    if ($line =~/>/) {
        print "$line" . '-' . "$filename\n";
    } else {
        print "$line\n";}
    }
    close FH;
}

Check that $filename contains your file name and need to be used to read the file content as FH. After each cycle, you need to close it.

ADD REPLY
1
Entering edit mode

Well done! In case you might be interested, you could do the following:

#!/usr/bin/perl
use strict;
use warnings;

my @files = @ARGV;

foreach my $filename (@files) {
    open my $fh, '<', $filename or die $!;
    while (<$fh>) {
        chomp;
        print /^>/ ? "$_-$filename\n" : "$_\n";
    }
}

This uses the three-argument form of open, including lexical file handles. Note that close $fh; is missing. Perl will automatically close the file when a new handle is assignned to $fh. Also, the above takes advantage of Perl's default scalar $_ when chompping and matching. Finally, the if-then-else is replaced by the ternary operator.

ADD REPLY
3
Entering edit mode
11.0 years ago
Kenosis ★ 1.3k

Here's another Perl option:

use strict;
use warnings;

for my $file ( grep $_ ne $0, <*> ) {
    open my $fh, '<', $file or die $!;

    while (<$fh>) {
        chomp;
        s/^>.+\K/-$file/;
        print "$_\n";
    }
}

Usage: perl script.pl [>outFile]

The last, optional parameter directs output to a file.

Drop the script into the directory where only your fasta files live that you want to combine. The script will read all the file names in that directory--filtering out its own name--iterate through all, and append the current file name it's reading to the end of each fasta header line of the file.

It's problematic just sending file names on the command line to such a script, since you'd need to enclose each name within quotes, as Pierre Lindenbaum did with his example, because of the spaces. This solution bypasses the need to send those names to the script.

How does this work? The script uses a glob (<*>) to get a listing of all files in the current directory, then greps each against the script's name, so it's not going to read in the script. As each file line is read, the input record separator (usually \n) is chompped off, and a substitution--which looks for a > at the start of a line--is used. The substitution matches the complete header line, but the \K says "\Keep all to the left," so a hyphen and the file name is added to that line. Finally, that line is printed. (Note there is no close $fh;. Perl will automatically close the opened file when a new file handle is assigned to $fh.)

The following Python script produces the same results:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import re

for fileName in os.listdir('.'):
    if fileName == os.path.basename(__file__) or os.path.isdir(fileName):
        break

    inFile = open(fileName, 'r')
    for line in inFile:
        print re.sub(r'^(>.+)', r'\1-%s' % fileName, line.rstrip('\n'))

Usage: python script.py [>outFile]

Hope this helps!

ADD COMMENT
0
Entering edit mode

Hi Kenosis and JC, your solutions are really helpful, and make me better understand perl. Thank you very much!

ADD REPLY
0
Entering edit mode

You're most welcome, biolab!

ADD REPLY
2
Entering edit mode
11.0 years ago
Malcolm.Cook ★ 1.5k

here's a perl one-liner to call from the command line that you can use with your multiple fasta files:

perl -lpe '$_.="-$ARGV" if m/^>/' *.fa
ADD COMMENT
0
Entering edit mode

Hi Malcolm, thank you very much!

ADD REPLY
0
Entering edit mode

Simple and very useful!! Thank you!!

ADD REPLY
1
Entering edit mode
11.0 years ago
commonsnook ▴ 40

Hello biolab, I don't know if you are a beginner in Perl or in programming. If you are starting at programming too, I would suggest you to try python, as it is closer to english. Here is a longer version that is easier to read and understand:

#!/usr/bin/python
import os

def merge(file_name):
    data = []
    with open(file_name) as fh:
        for line in fh.readlines():
            line = line.replace("\n", '')
            if line.startswith('>'):
                new_header = (line+" -"+file_name)
                data.append(new_header+"\n")
            else:
                data.append(line+"\n")

    with open('output','a') as fh:
        fh.write(''.join(data))

for file_name in os.listdir('.'):
    if file_name.endswith(".fa"):
        merge(file_name)
ADD COMMENT
1
Entering edit mode

Thanks, tarzvf, I am new in programming. Recently I spend much time on perl. I can learn Python in the meantime, which i believe is very useful. Actually I mostly focus on lab work, but the increasing sequencing technology urges me to begin leraning some R, shell and perl programming. I really find these bioinformatics tools powerful and useful. Thank you for your suggestions.

ADD REPLY

Login before adding your answer.

Traffic: 1505 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6