Identify Duplicates In Two Fasta Files But Not Merge Them
2
1
Entering edit mode
10.7 years ago
gtho123 ▴ 260

I have two relatively large FASTA files of ESTs that have similar sequences in them, but they have different IDs. I wish to cycle through every sequence in file A and find the corresponding sequence in file B if it exists.

However I do not wish to merge the files, merely identify corresponding sequences and extract the respective IDs for each file. This way I could analyse each file separately and then compare them easily.

I realize I could do reciprocal BLASTs and take the top hits that agree to create a table, but not wanting to reinvent the wheel I wish to know whether there is an existing program or script that would do this. Any ideas?

Any help would be greatly appreciated.

EDIT: I should clarify. My two files of ESTs come from different sources. I know one is really a whole lot of ESTs assembled together into "putative transcripts". Therefore I do not feel comfortable assuming that the similar sequences are the same length and start and end in the same points in the sequence. I therefore need a method that can identify the similar sequences given these constraints.

duplicates fasta • 4.7k views
ADD COMMENT
2
Entering edit mode
10.7 years ago

I just wrote these one-liners for you:

When exact match is required

When inexact match is required

In such case, probably the only solution I can think of is to cluster your sequences together.

alt text

I have a python package that has an embarassingly-parallel (multithreaded + utilises streaming SIMD extensions) implementation for qgram-based edit distance measurement and is useful for hierarchical-clustering of DNA sequences assuming lengths are SOMEWHAT similar

You use hclust.py script as follows:

$ ./hclust.py -f test.fasta -t 32
Clust_1 C8_Fusobacteriaceae;C28_Fusobacteriaceae;C106_Fusobacteriaceae;C121_Fusobacteriaceae;C146_Fusobacteriaceae;C180_Fusobacteriaceae
Clust_2 C3_Verrucomicrobiaceae;C47_Verrucomicrobiaceae;C48_Verrucomicrobiaceae;C126_Verrucomicrobiaceae;C134_Verrucomicrobiaceae;C162_Verrucomicrobiaceae
Clust_3 C31_Planctomycetaceae;C137_Planctomycetaceae;C142_Planctomycetaceae
Clust_4 C43_Porphyromonadaceae;C55_Porphyromonadaceae;C159_Porphyromonadaceae;C160_Porphyromonadaceae
Clust_5 C4_Bacteroidaceae;C11_Bacteroidaceae;C51_Bacteroidaceae;C54_Bacteroidaceae;C64_Bacteroidaceae;C85_Bacteroidaceae;C94_Bacteroidaceae;C98_Bacteroidaceae;C103_Bacteroidaceae;C109_Bacteroidaceae;C131_Bacteroidaceae;C132_Bacteroidaceae;C154_Bacteroidaceae;C175_Bacteroidaceae;C185_Bacteroidaceae;C191_Bacteroidaceae
Clust_6 C13_Chlorobiaceae;C22_Chlorobiaceae;C26_Chlorobiaceae;C30_Chlorobiaceae;C46_Chlorobiaceae;C59_Chlorobiaceae;C66_Chlorobiaceae;C70_Chlorobiaceae;C104_Chlorobiaceae;C111_Chlorobiaceae;C135_Chlorobiaceae;C143_Chlorobiaceae;C188_Chlorobiaceae
Clust_7 C124_Spirochaetaceae;C169_Spirochaetaceae

It takes a -c cutoff switch so if you use -c 1 then it will cluster all the sequences with an edit distance of 1. You may adjust -c parameter to fulfill your needs

You can also use Robert Edgar's usearch.

./usearch7.0.1001_i86linux32 -usearch_global f1.fasta -db f2.fasta -strand plus -id 0.97 -uc map.uc

It will cluster sequences from f1.fasta around sequences in f2.fasta based on given identity (0.97) i.e. 97% similarity. You can adjust the parameter -id accordingly. The following clustering will be generated in map.uc file. Check my tutorial (UPARSE section) on how to manipulate this file

Best Wishes, Umer

ADD COMMENT
0
Entering edit mode

This seems a very elegant way to identify duplicates. However I can't be certain my two files contain exact duplicates in terms of length.

ADD REPLY
0
Entering edit mode

I have updated my original post to answer this issue!

ADD REPLY
0
Entering edit mode
10.7 years ago
Prakki Rama ★ 2.7k

I tried the following perl script,

used How To Remove The Same Sequences In The Fasta Files?'s script according to my needs.

   ####-----Reformatted the fasta file into one line fasta format first---###
   sed -e '/^>/s/$/@/' -e 's/^>/#/' a1.fa | tr -d '\n'|tr "#" "\n"| tr "@" "\t" |sort -u -t '     ' -f -k 2,2 |sed '/^$/d'|sed -e 's/^/>/' -e 's/\t/\n/' >re_a1.fasta
   sed -e '/^>/s/$/@/' -e 's/^>/#/' a2.fa | tr -d '\n'|tr "#" "\n"| tr "@" "\t" |sort -u -t '     ' -f -k 2,2 |sed '/^$/d'|sed -e 's/^/>/' -e 's/\t/\n/' >re_a2.fasta

SCRIPT

open FH,"re_a1.fasta";
%Hash=();
while(<FH>)
{
    $header1=$_;
    $seq1=<FH>;
    $Hash{$header1}="$seq1";
}    

    open HF,"re_a2.fasta";
    while(<HF>)
    {
        $header2=$_;
        $seq2=<HF>;
        while(($key,$value) = each (%Hash))
        {
            if($value eq $seq2)
        {
        chomp($key);chomp($header2);
        $key =~ s/>//;  ##remove >
        $header2 =~ s/>//; ##remove >
        print "\"$key\" in FILE1 and \"$header2\" in FILE2 are exactly the same\n";
        }
        }
    }
    close(FH);
    close(HF);



 ___FILE1_____
    >1
    ATGCTCGTATATGCGATCGA
    >2
    GATGCTAGTGCATGTATATA
    >3
    TATCTAGCTAGATAGCTAGTA

     ___FILE2_____
    >seq1
   GATGCTAGTGCATGTATATA
   >seq3
   GCTGATCCACAGATCGGCT
   >seq2
   GCTGCATGCTGTATGCGCCGTAGTC

OUTPUT

 "2" in FILE1 and "seq1" in FILE2 are exactly the same
ADD COMMENT
0
Entering edit mode

Interesting. While sequences in my two files are similar, I cannot assume that they are the same length. Consequently they may not start and end in the same place. Perhaps I was not clear enough, if so I apologize. However I need something a little more sensitive than the "eq" operator. If I could come up with something that could do this and swap it wit the "eq" operator then this script would be great.

ADD REPLY
0
Entering edit mode

In the case you do not have sequences of same length and you are not expecting 100% match, then i suggest trying CD-HIT-EST from weizhong lab, where you can vary the parameters such as length of shorter sequence and its identity when matched to longer sequence. And as Umer has suggested Usearch is also helpful.

ADD REPLY

Login before adding your answer.

Traffic: 2065 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6