IDs that are in file1 but not if file2

Question

Print Different Id From Sequence Comparison Of Two Fasta Files

4

Entering edit mode

13.9 years ago

User 3566 ▴ 40

Hello, I'm a beginner of bioPerl, I have two sequence files in FASTA format, in which there are a lot of similar sequence. I want to extract only different sequences and eventually to know if they come from file1 or file2. Thanks fo help. Marco

sequence comparison fasta bioperl perl • 8.2k views

ADD COMMENT • link updated 13.9 years ago by Aleksandr Levchuk 3.2k • written 13.9 years ago by User 3566 ▴ 40

2

Entering edit mode

Hi Marco! You're welcome to ask beginner questions here. If you show us what you've tried so far in code and/or thought, I'm sure someone will be willing and able to answer your question.

ADD REPLY • link 13.9 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

Hi Marco! You're welcome to ask beginner questions here. If you show us what you've tried so far in code and/or thought, I'm sure someone will be willing and able to answer your question. For instance, you might want to align your sequences first.

ADD REPLY • link 13.9 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

Can you clarify whether, for example, seqA in file1 is similar only to sequences in file1, or file2, or both?

ADD REPLY • link 13.9 years ago by Neilfws 49k

0

Entering edit mode

Sai Liu asks: What's the meaning of similar sequence? You can give us some extract examples.

ADD REPLY • link 13.9 years ago by Neilfws 49k

0

Entering edit mode

I have 2 file in which there are a lot of sequence, but most of them are the same. I want to exclude the identical sequence and extract only different sequence ant to know from which file (1 or 2) they came. for similar sequence I mean identical sequence with identical name, ID etc.

ADD REPLY • link 13.9 years ago by User 3566 ▴ 40

score 4 · Answer 1 · 2011-03-14

I am assuming that by similar sequence you mean same ID.

Bioperl or not, you need to go through both files, take a note of IDs, and then test if the ID was in one or two sequences and act accordingly.

If you can fit all the data in memory, the easiest and most perlish thing is to read everything, filename, ID and sequence, into a hash and then query the hash. If you expect to be dealing with huge data sets, it is best to leave sequences out and deal with IDs and filenames only. When you know the subset you are interested in, you can go through sequence files again. Alternatively, you can create a database out of sequences that allows random access to the data.

Show us some code and we'll guide you on.

Ram · Answer 2 · 2011-04-10

2

Entering edit mode

13.9 years ago

Aleksandr Levchuk 3.2k

The Shell Solution

diff <(cat file1.fasta | grep ">" | sort)  <(cat file2.fasta | grep ">" | sort)

IDs that are in file1 but not if file2

To the diff command add:

| grep "^<" | awk -F\> '{print $2}'

IDs that are in file2 but not if file1

To the diff command add:

| grep "^>" | awk -F\> '{print $3}'

Examples<

diff <(cat file1.fasta | grep ">" | sort)  <(cat file2.fasta | grep ">" | sort) | grep "^>" | awk -F\> '{print $3}'

CPS1_PENJA, 623 bases, 9643C925 checksum.
Y5SS2_CAEEL, 623 bases, E99BFB checksum.
YUA6_CAEEL, 623 bases, A389AA05 checksum.

diff <(cat file1.fasta | grep ">" | sort)  <(cat file2.fasta | grep ">" | sort)  | grep "^<" | awk -F\> '{print $2}'

YSS2_CAEEL, 623 bases, E99BFB checksum.

Credits

The powerful command line tools are brought to you by GNU - the brilliant software written overwhelmingly in C an ported to all imaginable kernels: Solaris, OpenSolaris, OpenIndiana, FreeBSD, OpenBSD, NetBSD, Linux, Cygwin (Windows XP, Server, Vista, and 7).

ADD COMMENT • link updated 5.4 years ago by Ram 44k • written 13.9 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

Hm, nice use of process substitution, a bash trick that doesn't get nearly enough airtime. Some nits:

"Useless Use of Cat Award": you can just grep the files directly.
patterns that can be anchored should be: grep ^>
comm is a cool tool for this sort of check: comm -1 -2 <(grep '^>' file1.fasta | sort) <(grep '^>' file1.fasta | sort)

I'm not 100% clear on whether this is really what Marco wants to do.

ADD REPLY • link updated 5.4 years ago by Ram 44k • written 13.9 years ago by Adb • 0

0

Entering edit mode

Hm, nice use of process substitution, a bash trick that doesn't get nearly enough airtime. Some nits: you can just grep the files directly and skip the Useless Use Of Cat. Patterns that can be anchored should be: grep ^>. And finally, comm is a cool tool for this sort of check: comm -1 -2 <(grep '^>' file1.fasta | sort) <(grep '^>' file1.fasta | sort).

ADD REPLY • link 13.9 years ago by Adb • 0

0

Entering edit mode

@adb, Some people may prefer cat file | grep. It's more like LEGO blocks - you can replace the left part or the right part. Diff's < and > notation is pretty intuitive while comm's -1 -2 -3 arguments are confusing. On that point, I think you wanted to say comm -2 -3 (-3 suppressed lines that appear in both files).

ADD REPLY • link 13.9 years ago by Aleksandr Levchuk 3.2k

Ram · Answer 3 · 2011-03-18

0

Entering edit mode

13.9 years ago

Panagiotis Alexiou ▴ 220

Try putting sequences in one file as keys in a %hash and then checking the second file with a:

if (!exists $hash{$sequence})

or something like that?

ADD COMMENT • link updated 5.4 years ago by Ram 44k • written 13.9 years ago by Panagiotis Alexiou ▴ 220

Ram · Answer 4 · 2011-04-07

I am also a beginner and wanted to compare two fasta files, and keep ONLY the sequences that are not common. So I wrote that...:

#!/usr/bin/perl -w

#######################################################
# Author :  Aurelie K
# date   :  07 Apr 2011    
# email  :  4urelie.k@gmail.com
# Pupose :  compare to fasta files and keep only sequences not in common (header name comparison, i.e. the entire line)
#####################################################

use warnings;
use Bio::Perl;
use Bio::Seq;
use Bio::SeqIO;

my $usage = "perl scriptname.pl inputfile1 inputfile2 outputfile\n";

if (@ARGV != 3) {
    print "$usage\n";
    exit;
}

my $file1 = shift or die "$usage\n";
my $file2 = shift or die "$usage\n";
my $outputfile = shift or die "$usage\n";
my $tempoutputfile = "$outputfile.temp" or die "$usage\n";

# Reading the first file and store it into a hash
####################################################

my $FastaFile1 = Bio::SeqIO->new(-file => $file1, -format => 'fasta', -alphabet => 'dna') or die "Failed to create SeqIO object from $file1 $!\n";

my %fastaH1 =();
while( my $seqFile1 = $FastaFile1->next_seq() ) {
    unless (exists $fastaH1{$seqFile1->display_id."\t".$seqFile1->desc}) {
        $fastaH1{$seqFile1->display_id."\t".$seqFile1->desc} = $seqFile1->seq; #key of the hash is fasta header (all line) and value is sequence.
    }
}

# Opening output file
####################################################

my $output1 = Bio::SeqIO->new(-file => ">$tempoutputfile", -format => 'fasta') or die "Failed to create temp outputfile $tempoutputfile $!\n";

# Reading second file
# -> store it into a hash 
# -> and write sequences of file2 that are not in file1
####################################################

my $FastaFile2 = Bio::SeqIO->new(-file => $file2, -format => 'fasta', -alphabet => 'dna') or die "Failed to create SeqIO object from $file2 $!\n";

my %fastaH2 =();
while( my $seqFile2 = $FastaFile2->next_seq() ) {
    unless (exists $fastaH2{$seqFile2->display_id."\t".$seqFile2->desc}) {
        $fastaH2{$seqFile2->display_id."\t".$seqFile2->desc} = $seqFile2->seq; #key of the hash is fasta header (all line) and value is sequence.
        unless (exists $fastaH1{$seqFile2->display_id."\t".$seqFile2->desc}) {
            $output1->write_seq($seqFile2);
        }
    }
}

# Write sequences of file1 that are not in file2 
####################################################

open OUT, ">>$tempoutputfile" or die "Failed to open FH from $tempoutputfile $!\n";
foreach my $key (keys %fastaH1){
    unless (exists $fastaH2{$key}) {
        print OUT ">$key\n$fastaH1{$key}\n";
    }
}
close OUT;

my $input = Bio::SeqIO->new(-file => $tempoutputfile, -format => 'fasta') or die "Failed to create inputfile from $tempoutputfile $!\n";
my $output2 = Bio::SeqIO->new(-file => ">$outputfile", -format => 'fasta') or die "Failed to create outputfile $outputfile $!\n";

# Rewrite sequences in fasta format...
while( my $seq = $input->next_seq() ) {
    $output2->write_seq($seq);
}

unlink "$tempoutputfile";

print "\n--- Unique sequences are written in $outputfile ---\n\n";

exit;

I don't need to know where they are from, but you can probably modify the header to include this information...

I also needed the whole lines and not only the basic header (display_id would write only the first word)

Hope that can help you

Please, if you find a way to make this better, simpler, more efficient, tell me!

Ram · Answer 5 · 2011-04-08

A friend found that on internet (very much simpler - you just need to concatenate your two fasta files before, and it will remove sequences which have same ID and same sequence):

use strict;
use Bio::SeqIO;
my %unique;

my $file = shift or die;
my $seqio  = Bio::SeqIO->new(-file => $file, -format => "fasta");
my $outseq = Bio::SeqIO->new(-file => ">$file.uniq", -format => "fasta");

while(my $seqs = $seqio->next_seq) {
    my $id  = $seqs->display_id;
    my $seq = $seqs->seq;
    unless(exists($unique{$seq})) {
        $outseq->write_seq($seqs);
        $unique{$seq} +=1;
    }
}