You might work with the following as a start:
#!/usr/bin/perl
use strict;
use warnings;
my $header = "";
my $seq = "";
my $fastaRef;
# read FASTA in from standard input
while(<>) {
chomp;
if (/>/) {
if (length $seq > 0) {
push(@{$fastaRef->{$seq}}, $header);
}
$header = substr $_, 1;
}
else {
$seq .= $_;
}
}
push (@{$fastaRef->{$seq}}, $header);
# print out merged-header sequences
foreach my $seqKey (keys %{$fastaRef}) {
my @headers = @{$fastaRef->{$seqKey}};
print STDOUT ">".join("_", @headers)."\n".$seqKey."\n";
}
One thought is to pre-process the FASTA input to "explode" each N-sequence to four (or power-of-four) ACTG-sequences. Each exploded sequence is given a modified header name. When merging, you can filter out any unmerged N-sequences based on the modified header name; the matched sequence will merge its header with another matched sequence header.
>seq1
AATGC
>seq2
ANTGC
>seq3
AGATC
This could become something like:
>seq1
AATGC
>seq2-modA
AATGC
>seq2-modC
ACTGC
>seq2-modG
AGTGC
>seq2-modT
ATTGC
>seq3
AGATC
When you merge this modified FASTA input with the above script, you get something like this:
>seq1_seq2-modA
AATGC
>seq2-modC
ACTGC
>seq2-modG
AGTGC
>seq2-modT
ATTGC
>seq3
AGATC
Your last step is to remove *-modN
sequences that only have one match (to themselves) to get a final answer:
>seq1_seq2-modA
AATGC
>seq3
AGATC
This should be an easy modification of the script above, looking for any sequences with a one-element array and then a *-modN
suffix.
Thank you for answer.