Merging of sequence chunks
2
0
Entering edit mode
7.0 years ago
jan • 0

Hello!

I would like to merge all amino acid sequences with identical names in a given fasta file (trimmed alignment). I know that this can be done for manually selected sequences with Aliview, however, I am looking for a tool or script (bash, perl, python) that allows me to do that for hundreds of files without manual selection. Non-matching amino acids should be replaced by "X". If the script/tool allows to set up a sequence similarity and overlapping length threshold, even better :) Any help would be much appreciated. Thank you!

My file:

>Blue
MADKLTRIAIVNHDKCKPKKCRQECKKSCPVVRMGKLCIEVTPQSKIAWISETLCIGCGI
>Red
CIKKCPFGALSIVNLPSNLEKETTHRYCAI------------------------------
>Red
-----------------------THRYCANAFKLHRLPIPRPGEVL--------------
>Red
--------------------------------------------------TNGIGKSTAL
>Yellow
KILAGKQKPNLGKYDDPPDWQEILTYFRGSELQNYFTKILEDDLKAIIKPQYVDQIPKAA
>Green
KGTVGSILDRKDETKTQAI-----------------------------------------
>Green
-------------TKKQAIVCQQLDLTHLKERNVEDLSGGELQRFACAVVCIQDQICKKI

Desired output:

>Blue
MADKLTRIAIVNHDKCKPKKCRQECKKSCPVVRMGKLCIEVTPQSKIAWISETLCIGCGI
>Red
CIKKCPFGALSIVNLPSNLEKETTHRYCAXAFKLHRLPIPRPGEVL----TNGIGKSTAL                                                          
>Yellow
KILAGKQKPNLGKYDDPPDWQEILTYFRGSELQNYFTKILEDDLKAIIKPQYVDQIPKAA
>Green
KGTVGSILDRKDETKXQAIVCQQLDLTHLKERNVEDLSGGELQRFACAVVCIQDQICKKI
alignment sequence • 2.5k views
ADD COMMENT
0
Entering edit mode

jan : Please test the perl solution offered by @JC. Then accept it as well (you can accept multiple answers) if all looks well.

ADD REPLY
3
Entering edit mode
7.0 years ago

using bioalcidaejdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

and the following script:

final Map<String,List<FastaSequence>> id2seqs= new HashMap<>();
stream().forEach(S->{
    String n = S.getName();
    if(!id2seqs.containsKey(n)) id2seqs.put(n,new ArrayList<>());
    id2seqs.get(n).add(S);
    });

for(final List<FastaSequence> seqs: id2seqs.values())
{
println(">"+seqs.get(0).getName());
int len = seqs.stream().mapToInt(S->S.length()).max().getAsInt();
for(int i=0;i< len;i++)
    {
    char c='\0';
    for(FastaSequence s:seqs)
        {
        if(i>=s.length() || s.charAt(i)=='-'  ||  c==s.charAt(i)) continue;
        if(c=='\0')
            {
            c=s.charAt(i);
            }
        else
            {   
            c='X';
            }
        }
    print(c=='\0'?'-':c);
    }
println();
}

usage:

$ java -jar  dist/bioalcidaejdk.jar -f script.txt in.fa

output:

>Red
CIKKCPFGALSIVNLPSNLEKETTHRYCAXAFKLHRLPIPRPGEVL----TNGIGKSTAL
>Blue
MADKLTRIAIVNHDKCKPKKCRQECKKSCPVVRMGKLCIEVTPQSKIAWISETLCIGCGI
>Yellow
KILAGKQKPNLGKYDDPPDWQEILTYFRGSELQNYFTKILEDDLKAIIKPQYVDQIPKAA
>Green
KGTVGSILDRKDETKXQAIVCQQLDLTHLKERNVEDLSGGELQRFACAVVCIQDQICKKI
ADD COMMENT
0
Entering edit mode

Thanks Pierre! Unfortunately, however, I don't seem to be able to install BioAlcidaejdk (there is no bioalcidaejdk.jar in dist/). Can you maybe provide some guidelines how to install this? Is it also running on a Mac?

ADD REPLY
0
Entering edit mode

what happened after you typed :

make bioalcidaejdk
ADD REPLY
0
Entering edit mode

Sorry, my fault. After running sudo apt install default-jdk, it worked. The tool is just great! Thank you very much :-) Could the script also be modified in a way that it accepts headers that are identical only before the "@" sign (e.g., >red@abcd and >red@efgh)?

ADD REPLY
1
Entering edit mode

change String n = S.getName();

to

String n = S.getName();
int a=n.indexOf('@');
if(a!=-1) n=n.substring(0,a);
ADD REPLY
2
Entering edit mode
7.0 years ago
JC 13k

Perl option:

#!/usr/bin/perl

use strict;
use warnings;

my %seqs = ();

# load sequences
$/ = "\n>"; # ingest all fasta sequences
    while (<>) {
    s/>//g;
    my ($name, @seq) = split (/\n/, $_);
    #$name =~ s/@.+//; # uncomment to edit names like red@abc, red@edf
    my $seq = join ("", @seq);
    push @{ $seqs{$name} }, $seq;
}

# process all sequences
foreach my $seq (keys %seqs) {
    print ">$seq\n";
    my $nseqs = scalar @{ $seqs{$seq} };
    if ($nseqs == 1) { # for just one sequence
        print "$seqs{$seq}[0]\n";
    }
    else {
        my $len = length $seqs{$seq}[0];
        for (my $i=0; $i<=$len-1; $i++) {
            my %b = ();
            for (my $j=0; $j<=$nseqs-1; $j++) {
                $b{ substr ($seqs{$seq}[$j], $i, 1) }++;
            }
            my $obs = join ("", keys %b);
            $obs =~ s/-//;
            if    (length $obs == 1) { print "$obs"; }
            elsif (length $obs == 0) { print    "-"; }
            else                     { print    "X"; }
        }
        print "\n";
    }
}

Usage:

$ perl mergeSeqs.pl < test_seqs.fa
>Red
CIKKCPFGALSIVNLPSNLEKETTHRYCAXAFKLHRLPIPRPGEVL----TNGIGKSTAL
>Green
KGTVGSILDRKDETKXQAIVCQQLDLTHLKERNVEDLSGGELQRFACAVVCIQDQICKKI
>Blue
MADKLTRIAIVNHDKCKPKKCRQECKKSCPVVRMGKLCIEVTPQSKIAWISETLCIGCGI
>Yellow
KILAGKQKPNLGKYDDPPDWQEILTYFRGSELQNYFTKILEDDLKAIIKPQYVDQIPKAA
ADD COMMENT
0
Entering edit mode

Thanks JC, the script is doing the job!

ADD REPLY

Login before adding your answer.

Traffic: 1324 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6