Taking Out Specific Columns From An Alignment
3
1
Entering edit mode
12.0 years ago
figo ▴ 220

Hi All,

I wanted to write a script to remove certain columns which are in array from an alignment fasta file. I wrote a perl script using bioperl which is like this

#!/usr/bin/perl
use Bio::Seq;
use Bio::SeqIO;
use Bio::AlignIO;
use Bio::Align::AlignI;
use Bio::SimpleAlign;
my $str = Bio::AlignIO->new('-file' => 'aligned.fasta');
my $aln = $str->next_aln();
$out = $aln->remove_columns([4,5]);
my $aio = Bio::AlignIO->new(-fh=>\*STDOUT,-format=>"fasta");
$aio->write_aln($out);

Input data (example file)

>a_1
MEKFGMNFGGGPSKKDLL
>a_2
MEKFGMNFGGGPSKKDLL
>a_3
MEKFGMNFGGGPSKKDLL

But I am having problem that the remove _columns function only works for 2 columns only when I try to put say (e.g [4,5,6]) it doesn't work. So any one can tell me to come up with a solution may be in perl not using bioperl or something else.

perl alignment • 5.6k views
ADD COMMENT
1
Entering edit mode

Will you reformat your code so it can be read? Also can you provide a small piece of data to test?

ADD REPLY
1
Entering edit mode

formatting fixed.

ADD REPLY
0
Entering edit mode

looks like remove_columns function is used to remove certain types of alignments. From the API docs:

"Creates an aligment with columns removed corresponding to the specified criteria: 'match'|'weak'|'strong'|'mismatch'|'gaps'"

It doesn't remove column by position. There is a slice function to get subcolumns of the alignment. You can probably use that to come up with something.

ADD REPLY
0
Entering edit mode

The perl solutions below will work if you want to remove the same character by position from each sequence. But you should consider if you want to account for gaps that potentially could be inserted in the sequences after alignment.

ADD REPLY
0
Entering edit mode

this post was cited in

https://www.sciencedirect.com/science/article/pii/S1055790319301903

Multiple auto- and allopolyploidisations marked the Pleistocene history of the widespread Eurasian steppe plant Astragalus onobrychis (Fabaceae)

ADD REPLY
2
Entering edit mode
12.0 years ago
JC 13k

Another solution in Perl considering multi-line fasta:

#!/usr/bin/perl
use strict;
use warnings;

$ARGV[1] or die "use deleteAlignColumn.pl COLUMNS FASTA\n";
my @cols = split (/,/, $ARGV[0]);
@cols = sort {$b <=> $a} @cols;
open FA, "$ARGV[1]" or die;
$/ = "\n>";
while (<FA>) {
    s/>//g;
    my ($id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq;
    foreach my $pos (@cols) {
        substr ($seq, $pos - 1, 1) = '';
    }
    print ">$id\n";
    while ($seq) {
         print substr($seq, 0, 80);
         print "\n";
         substr($seq, 0, 80) = '';
    }
}
close FA;

Testing with your data:

$ perl deleteAlignColumn.pl 4,5,6 test.fa 
>a_1
MEKNFGGGPSKKDLL
>a_2
MEKNFGGGPSKKDLL
>a_3
MEKNFGGGPSKKDLL
ADD COMMENT
0
Entering edit mode

I like your use of substring.

ADD REPLY
0
Entering edit mode

How can I send these removed columns into another file. I tried to modify this script, but it is printing up to the column number I have given.

ADD REPLY
1
Entering edit mode
12.0 years ago

If your sequences don't span multiple lines this will work. If they do you can change the input record separator '$/ '

The script:

#!/usr/bin/perl                                                                                                                              
use strict;
use warnings;
use Getopt::Long;



#-----------------------------------------------------------------------------                                                                      
#----------------------------------- MAIN ------------------------------------                                                                      
#-----------------------------------------------------------------------------                                                                      
my $usage = "                                                                                                                                       

Synopsis:                                                                                                                                           

REMOVE_COLUMNS_MULTI_FASTA -c 1,2,3,4 my.multi.fasta                                                                                                

Description:                                                                                                                                        

-c is not zero based!                                                                                                                               

";


my ($help);
my $columns;
my $opt_success = GetOptions('help'      => \$help,
                             'columns=s' => \$columns );

die $usage if $help || ! $opt_success;

my $file = shift;
die $usage unless $file;

my @columns = map {$_ -= 1} split /,/, $columns;

open (my $IN, '<', $file) or die "Can't open $file for reading\n$!\n";

while (my $line = <$IN>) {
    chomp $line;
    if ($line =~ /^>/){
        print "$line\n"
    }
    else{
        my @seq = split //, $line;
        foreach my $c (@columns){
            delete $seq[$c];
        }
        my @p_seq = grep { defined } @seq;
        print join '', @p_seq;
        print "\n";
    }
}

Usage:

perl REMOVE_COLUMNS_MULTI_FASTA -c 1,5,18 example.fasta

Output:

>a_1
EKFMNFGGGPSKKDL
>a_2
EKFMNFGGGPSKKDL
>a_3
EKFMNFGGGPSKKDL
ADD COMMENT
1
Entering edit mode
12.0 years ago

Just for fun, a solution in R. Ofcourse, Perl/Python is faster but another solution in R.

This is slow, because of for loop and can be improved a bit but will be stil slow.

# Usage : Rscript removeColFas.R file.fa 2,3,5,9

# reading file in
fas=read.csv(commandArgs(TRUE)[1],header=F,stringsAsFactors=FALSE)

# reading the fas columns to be removed
rem=as.numeric(c(strsplit((commandArgs(TRUE)[2]),',')[[1]]))

# splitting the characters and removing the non-desired columns
d=unlist(lapply(as.character(fas[seq(2,nrow(fas),by=2),]),function(x){paste(strsplit(x,'')[[1]][-rem],collapse='')}))

# reordering  the data back
count=0;for(i in seq(2,nrow(fas),by=2)){count=count+1;fas[i,]<-d[count]}

# writing the file out
write.table(fas,paste(commandArgs(TRUE)[1],"_edited.fa",sep=''),quote=FALSE,col.names=FALSE,row.names=FALSE)

Cheers

ADD COMMENT

Login before adding your answer.

Traffic: 1259 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6