In Perl, read the sequences and headers of one file into a hash table. Use the sequence as the key and the header as the value in the key-value pair stored in the hash table.
Once you have the "first-file" hash table, read through the second file in order. At each sequence, test if the first-file hash table has an entry with a matching key (sequence) with some header.
Given files A.fa
and B.fa
:
$ cat A.fa
>seq1
ACTGACTGACTG
>seq2
AAAAAA
>seq3
AAAAAA
>seq4
AAAAA
$ cat B.fa
>seq1
AAAAAA
>seq2
AAAAAA
>seq3
AAAAAA
>seq4
ACTGACTGACTG
You could have a Perl script called intersection.pl
that keeps track of matching sequence/header combinations:
#!/usr/bin/env perl
use strict;
use warnings;
# read sequences into A hash table
my $aHt;
my $aCnt = 0;
my ($header, $sequence);
open(my $aFh, "<", "A.fa") or die "could not open A.fa";
while (<$aFh>) {
chomp;
if ($aCnt % 2 == 0) {
$header = $_;
}
else {
$sequence = $_;
if (! defined $aHt->{$sequence}) {
$aHt->{$sequence} = ();
}
push(@{$aHt->{$sequence}}, $header);
}
$aCnt++;
}
close($aFh);
# test sequences from B and report matches with A hash table
my $bCnt = 0;
open(my $bFh, "<", "B.fa") or die "could not open B.fa";
while (<$bFh>) {
chomp;
if ($bCnt % 2 == 0) {
$header = $_;
}
else {
$sequence = $_;
if (defined $aHt->{$sequence}) {
print STDOUT "matches on sequence [$sequence] from file A [@{$aHt->{$sequence}}] to file B [$header]\n";
}
}
$bCnt++;
}
close($bFh);
Note that we are storing a list of headers in the hash table for a given key/sequence. This is because your test input has duplicate sequences, so we have to keep track of multiple headers.
We also assume that your FASTA files are single-line. In other words, a sequence follows a header on a single line. These may not be correct assumptions, so reformat your input, if necessary, or adjust the code to deal with multiline FASTA.
When we run this on your test input:
$ ./intersection.pl
matches on sequence [AAAAAA] from file A [>seq2 >seq3] to file B [>seq1]
matches on sequence [AAAAAA] from file A [>seq2 >seq3] to file B [>seq2]
matches on sequence [AAAAAA] from file A [>seq2 >seq3] to file B [>seq3]
matches on sequence [ACTGACTGACTG] from file A [>seq1] to file B [>seq4]
Hello angelo.armijos.iv,
We prefer that you first try to solve this on your own and if it doesn't work show us what you tried. We are more eager to point out your mistakes and put you back on track, rather than giving a full solution which will learn you nothing.
In addition, I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
Hello Angelo,
I do not know how to do it in Perl. Nevertheless, we are currently developing SEDA (http://www.sing-group.org/seda/), a Java application to easily process FASTA files.
Among its functions, the "Remove redundant sequences" option (see section 3.3 of the manual: http://www.sing-group.org/seda/downloads/manuals/seda-user-manual-1.0.0.pdf) has been precisely created for doing what you are looking for. With the "Save merged headers into a file" you can obtain a similar report to the one that you are looking for.
With best regards,
Hugo.