Hello, I'm a beginner of bioPerl, I have two sequence files in FASTA format, in which there are a lot of similar sequence. I want to extract only different sequences and eventually to know if they come from file1 or file2. Thanks fo help. Marco
Hello, I'm a beginner of bioPerl, I have two sequence files in FASTA format, in which there are a lot of similar sequence. I want to extract only different sequences and eventually to know if they come from file1 or file2. Thanks fo help. Marco
I am assuming that by similar sequence you mean same ID.
Bioperl or not, you need to go through both files, take a note of IDs, and then test if the ID was in one or two sequences and act accordingly.
If you can fit all the data in memory, the easiest and most perlish thing is to read everything, filename, ID and sequence, into a hash and then query the hash. If you expect to be dealing with huge data sets, it is best to leave sequences out and deal with IDs and filenames only. When you know the subset you are interested in, you can go through sequence files again. Alternatively, you can create a database out of sequences that allows random access to the data.
Show us some code and we'll guide you on.
diff <(cat file1.fasta | grep ">" | sort) <(cat file2.fasta | grep ">" | sort)
To the diff command add:
| grep "^<" | awk -F\> '{print $2}'
To the diff command add:
| grep "^>" | awk -F\> '{print $3}'
diff <(cat file1.fasta | grep ">" | sort) <(cat file2.fasta | grep ">" | sort) | grep "^>" | awk -F\> '{print $3}'
CPS1_PENJA, 623 bases, 9643C925 checksum.
Y5SS2_CAEEL, 623 bases, E99BFB checksum.
YUA6_CAEEL, 623 bases, A389AA05 checksum.
diff <(cat file1.fasta | grep ">" | sort) <(cat file2.fasta | grep ">" | sort) | grep "^<" | awk -F\> '{print $2}'
YSS2_CAEEL, 623 bases, E99BFB checksum.
The powerful command line tools are brought to you by GNU - the brilliant software written overwhelmingly in C an ported to all imaginable kernels: Solaris, OpenSolaris, OpenIndiana, FreeBSD, OpenBSD, NetBSD, Linux, Cygwin (Windows XP, Server, Vista, and 7).
Hm, nice use of process substitution, a bash trick that doesn't get nearly enough airtime. Some nits:
grep ^>
comm
is a cool tool for this sort of check:
comm -1 -2 <(grep '^>' file1.fasta | sort) <(grep '^>' file1.fasta | sort)
I'm not 100% clear on whether this is really what Marco wants to do.
Hm, nice use of process substitution, a bash trick that doesn't get nearly enough airtime. Some nits: you can just grep the files directly and skip the Useless Use Of Cat. Patterns that can be anchored should be: grep ^>. And finally, comm is a cool tool for this sort of check: comm -1 -2 <(grep '^>' file1.fasta | sort) <(grep '^>' file1.fasta | sort).
@adb, Some people may prefer cat file | grep
. It's more like LEGO blocks - you can replace the left part or the right part. Diff's <
and >
notation is pretty intuitive while comm's -1 -2 -3
arguments are confusing. On that point, I think you wanted to say comm -2 -3
(-3
suppressed lines that appear in both files).
Try putting sequences in one file as keys in a %hash
and then checking the second file with a:
if (!exists $hash{$sequence})
or something like that?
I am also a beginner and wanted to compare two fasta files, and keep ONLY the sequences that are not common. So I wrote that...:
#!/usr/bin/perl -w
#######################################################
# Author : Aurelie K
# date : 07 Apr 2011
# email : 4urelie.k@gmail.com
# Pupose : compare to fasta files and keep only sequences not in common (header name comparison, i.e. the entire line)
#####################################################
use warnings;
use Bio::Perl;
use Bio::Seq;
use Bio::SeqIO;
my $usage = "perl scriptname.pl inputfile1 inputfile2 outputfile\n";
if (@ARGV != 3) {
print "$usage\n";
exit;
}
my $file1 = shift or die "$usage\n";
my $file2 = shift or die "$usage\n";
my $outputfile = shift or die "$usage\n";
my $tempoutputfile = "$outputfile.temp" or die "$usage\n";
# Reading the first file and store it into a hash
####################################################
my $FastaFile1 = Bio::SeqIO->new(-file => $file1, -format => 'fasta', -alphabet => 'dna') or die "Failed to create SeqIO object from $file1 $!\n";
my %fastaH1 =();
while( my $seqFile1 = $FastaFile1->next_seq() ) {
unless (exists $fastaH1{$seqFile1->display_id."\t".$seqFile1->desc}) {
$fastaH1{$seqFile1->display_id."\t".$seqFile1->desc} = $seqFile1->seq; #key of the hash is fasta header (all line) and value is sequence.
}
}
# Opening output file
####################################################
my $output1 = Bio::SeqIO->new(-file => ">$tempoutputfile", -format => 'fasta') or die "Failed to create temp outputfile $tempoutputfile $!\n";
# Reading second file
# -> store it into a hash
# -> and write sequences of file2 that are not in file1
####################################################
my $FastaFile2 = Bio::SeqIO->new(-file => $file2, -format => 'fasta', -alphabet => 'dna') or die "Failed to create SeqIO object from $file2 $!\n";
my %fastaH2 =();
while( my $seqFile2 = $FastaFile2->next_seq() ) {
unless (exists $fastaH2{$seqFile2->display_id."\t".$seqFile2->desc}) {
$fastaH2{$seqFile2->display_id."\t".$seqFile2->desc} = $seqFile2->seq; #key of the hash is fasta header (all line) and value is sequence.
unless (exists $fastaH1{$seqFile2->display_id."\t".$seqFile2->desc}) {
$output1->write_seq($seqFile2);
}
}
}
# Write sequences of file1 that are not in file2
####################################################
open OUT, ">>$tempoutputfile" or die "Failed to open FH from $tempoutputfile $!\n";
foreach my $key (keys %fastaH1){
unless (exists $fastaH2{$key}) {
print OUT ">$key\n$fastaH1{$key}\n";
}
}
close OUT;
my $input = Bio::SeqIO->new(-file => $tempoutputfile, -format => 'fasta') or die "Failed to create inputfile from $tempoutputfile $!\n";
my $output2 = Bio::SeqIO->new(-file => ">$outputfile", -format => 'fasta') or die "Failed to create outputfile $outputfile $!\n";
# Rewrite sequences in fasta format...
while( my $seq = $input->next_seq() ) {
$output2->write_seq($seq);
}
unlink "$tempoutputfile";
print "\n--- Unique sequences are written in $outputfile ---\n\n";
exit;
I don't need to know where they are from, but you can probably modify the header to include this information...
I also needed the whole lines and not only the basic header (display_id would write only the first word)
Hope that can help you
Please, if you find a way to make this better, simpler, more efficient, tell me!
A friend found that on internet (very much simpler - you just need to concatenate your two fasta files before, and it will remove sequences which have same ID and same sequence):
use strict;
use Bio::SeqIO;
my %unique;
my $file = shift or die;
my $seqio = Bio::SeqIO->new(-file => $file, -format => "fasta");
my $outseq = Bio::SeqIO->new(-file => ">$file.uniq", -format => "fasta");
while(my $seqs = $seqio->next_seq) {
my $id = $seqs->display_id;
my $seq = $seqs->seq;
unless(exists($unique{$seq})) {
$outseq->write_seq($seqs);
$unique{$seq} +=1;
}
}
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hi Marco! You're welcome to ask beginner questions here. If you show us what you've tried so far in code and/or thought, I'm sure someone will be willing and able to answer your question.
Hi Marco! You're welcome to ask beginner questions here. If you show us what you've tried so far in code and/or thought, I'm sure someone will be willing and able to answer your question. For instance, you might want to align your sequences first.
Can you clarify whether, for example, seqA in file1 is similar only to sequences in file1, or file2, or both?
Sai Liu asks: What's the meaning of similar sequence? You can give us some extract examples.
I have 2 file in which there are a lot of sequence, but most of them are the same. I want to exclude the identical sequence and extract only different sequence ant to know from which file (1 or 2) they came. for similar sequence I mean identical sequence with identical name, ID etc.