Question

Taking Out Specific Columns From An Alignment

1

Entering edit mode

12.8 years ago

figo ▴ 220

Hi All,

I wanted to write a script to remove certain columns which are in array from an alignment fasta file. I wrote a perl script using bioperl which is like this

#!/usr/bin/perl
use Bio::Seq;
use Bio::SeqIO;
use Bio::AlignIO;
use Bio::Align::AlignI;
use Bio::SimpleAlign;
my $str = Bio::AlignIO->new('-file' => 'aligned.fasta');
my $aln = $str->next_aln();
$out = $aln->remove_columns([4,5]);
my $aio = Bio::AlignIO->new(-fh=>\*STDOUT,-format=>"fasta");
$aio->write_aln($out);

Input data (example file)

>a_1
MEKFGMNFGGGPSKKDLL
>a_2
MEKFGMNFGGGPSKKDLL
>a_3
MEKFGMNFGGGPSKKDLL

But I am having problem that the remove _columns function only works for 2 columns only when I try to put say (e.g [4,5,6]) it doesn't work. So any one can tell me to come up with a solution may be in perl not using bioperl or something else.

perl alignment • 6.2k views

ADD COMMENT • link updated 12.8 years ago by Sukhi Singh 11k • written 12.8 years ago by figo ▴ 220

1

Entering edit mode

Will you reformat your code so it can be read? Also can you provide a small piece of data to test?

ADD REPLY • link 12.8 years ago by Zev.Kronenberg 12k

1

Entering edit mode

formatting fixed.

ADD REPLY • link 12.8 years ago by brentp 24k

0

Entering edit mode

looks like remove_columns function is used to remove certain types of alignments. From the API docs:

"Creates an aligment with columns removed corresponding to the specified criteria: 'match'|'weak'|'strong'|'mismatch'|'gaps'"

It doesn't remove column by position. There is a slice function to get subcolumns of the alignment. You can probably use that to come up with something.

ADD REPLY • link 12.8 years ago by Damian Kao 16k

0

Entering edit mode

The perl solutions below will work if you want to remove the same character by position from each sequence. But you should consider if you want to account for gaps that potentially could be inserted in the sequences after alignment.

ADD REPLY • link 12.8 years ago by Damian Kao 16k

0

Entering edit mode

this post was cited in

https://www.sciencedirect.com/science/article/pii/S1055790319301903

Multiple auto- and allopolyploidisations marked the Pleistocene history of the widespread Eurasian steppe plant Astragalus onobrychis (Fabaceae)

ADD REPLY • link 6.1 years ago by Pierre Lindenbaum 166k

score 2 · Answer 1 · 2012-10-24

2

Entering edit mode

12.8 years ago

JC 13k

Another solution in Perl considering multi-line fasta:

#!/usr/bin/perl
use strict;
use warnings;

$ARGV[1] or die "use deleteAlignColumn.pl COLUMNS FASTA\n";
my @cols = split (/,/, $ARGV[0]);
@cols = sort {$b <=> $a} @cols;
open FA, "$ARGV[1]" or die;
$/ = "\n>";
while (<FA>) {
    s/>//g;
    my ($id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq;
    foreach my $pos (@cols) {
        substr ($seq, $pos - 1, 1) = '';
    }
    print ">$id\n";
    while ($seq) {
         print substr($seq, 0, 80);
         print "\n";
         substr($seq, 0, 80) = '';
    }
}
close FA;

Testing with your data:

$ perl deleteAlignColumn.pl 4,5,6 test.fa 
>a_1
MEKNFGGGPSKKDLL
>a_2
MEKNFGGGPSKKDLL
>a_3
MEKNFGGGPSKKDLL

ADD COMMENT • link 12.8 years ago by JC 13k

0

Entering edit mode

I like your use of substring.

ADD REPLY • link 12.8 years ago by Zev.Kronenberg 12k

0

Entering edit mode

How can I send these removed columns into another file. I tried to modify this script, but it is printing up to the column number I have given.

ADD REPLY • link 10.4 years ago by venu 7.1k

score 1 · Answer 2 · 2012-10-24

If your sequences don't span multiple lines this will work. If they do you can change the input record separator '$/ '

The script:

#!/usr/bin/perl                                                                                                                              
use strict;
use warnings;
use Getopt::Long;



#-----------------------------------------------------------------------------                                                                      
#----------------------------------- MAIN ------------------------------------                                                                      
#-----------------------------------------------------------------------------                                                                      
my $usage = "                                                                                                                                       

Synopsis:                                                                                                                                           

REMOVE_COLUMNS_MULTI_FASTA -c 1,2,3,4 my.multi.fasta                                                                                                

Description:                                                                                                                                        

-c is not zero based!                                                                                                                               

";


my ($help);
my $columns;
my $opt_success = GetOptions('help'      => \$help,
                             'columns=s' => \$columns );

die $usage if $help || ! $opt_success;

my $file = shift;
die $usage unless $file;

my @columns = map {$_ -= 1} split /,/, $columns;

open (my $IN, '<', $file) or die "Can't open $file for reading\n$!\n";

while (my $line = <$IN>) {
    chomp $line;
    if ($line =~ /^>/){
        print "$line\n"
    }
    else{
        my @seq = split //, $line;
        foreach my $c (@columns){
            delete $seq[$c];
        }
        my @p_seq = grep { defined } @seq;
        print join '', @p_seq;
        print "\n";
    }
}

Usage:

perl REMOVE_COLUMNS_MULTI_FASTA -c 1,5,18 example.fasta

Output:

>a_1
EKFMNFGGGPSKKDL
>a_2
EKFMNFGGGPSKKDL
>a_3
EKFMNFGGGPSKKDL

score 1 · Answer 3 · 2012-10-24

Just for fun, a solution in R. Ofcourse, Perl/Python is faster but another solution in R.

This is slow, because of for loop and can be improved a bit but will be stil slow.

# Usage : Rscript removeColFas.R file.fa 2,3,5,9

# reading file in
fas=read.csv(commandArgs(TRUE)[1],header=F,stringsAsFactors=FALSE)

# reading the fas columns to be removed
rem=as.numeric(c(strsplit((commandArgs(TRUE)[2]),',')[[1]]))

# splitting the characters and removing the non-desired columns
d=unlist(lapply(as.character(fas[seq(2,nrow(fas),by=2),]),function(x){paste(strsplit(x,'')[[1]][-rem],collapse='')}))

# reordering  the data back
count=0;for(i in seq(2,nrow(fas),by=2)){count=count+1;fas[i,]<-d[count]}

# writing the file out
write.table(fas,paste(commandArgs(TRUE)[1],"_edited.fa",sep=''),quote=FALSE,col.names=FALSE,row.names=FALSE)

Cheers