I have two relatively large FASTA files of ESTs that have similar sequences in them, but they have different IDs. I wish to cycle through every sequence in file A and find the corresponding sequence in file B if it exists.
However I do not wish to merge the files, merely identify corresponding sequences and extract the respective IDs for each file. This way I could analyse each file separately and then compare them easily.
I realize I could do reciprocal BLASTs and take the top hits that agree to create a table, but not wanting to reinvent the wheel I wish to know whether there is an existing program or script that would do this. Any ideas?
Any help would be greatly appreciated.
EDIT: I should clarify. My two files of ESTs come from different sources. I know one is really a whole lot of ESTs assembled together into "putative transcripts". Therefore I do not feel comfortable assuming that the similar sequences are the same length and start and end in the same points in the sequence. I therefore need a method that can identify the similar sequences given these constraints.
This seems a very elegant way to identify duplicates. However I can't be certain my two files contain exact duplicates in terms of length.
I have updated my original post to answer this issue!