Hallo folks.
I am trying to compare fasta files of predicted miRNAs with one another. This problem does not seem to be very hard for someone more skilled in bioinformatics than me but after hours of forum searches I could not solve the problem.
As an example I have 3 fasta files like:
File_1:
>miRNA1
AGTCGTCA
>miRNA2
AGTTCGTT
File_2:
>miRNA1
TTGACT
>miRNA2
AGTCGTCA
File_3:
>miRNA3
TTACATT
>miRNA5
TTGACT
From these files I would like to get a table that shows the presence or absence of each sequence. As you can see the same sequence could have a different ID. In the best case even if there is one mismatch between the sequences they should be counted as equal.
I am sorry to ask such a simple question but i couldnt figure it out by myself. So every help is welcome.
The closest solution i found so far is: Count Sequences In Multi Fasta File .
The script Kenosis provided does a part of the job:
"
use strict;
use warnings;
use File::Basename;
use Sort::Naturally;
local $/ = '>';
my ( %files, %query, %omit, @F );
$omit{$_}++ for ( $ARGV[0], basename $0);
while (<>) {
chomp;
$query{ $F[0] } = $F[1] if @F = split;
}
local $/ = "\n";
my @files = nsort @ARGV = grep !$omit{$_}, <*>;
while (<>) {
next if /^>/;
chomp;
$files{$_}{$ARGV}++;
}
print join( "\t", 'tag', 'seq', @files ), "\n";
for my $key ( nsort keys %query ) {
my @seqCount;
push @seqCount, $files{ $query{$key} }->{$_} || 0 for @files;
print join( "\t", $key, $query{$key}, @seqCount ), "\n";
}
"
But unfortunately I am unable to adept it to my problem. Because the script compares the sequences by ID and does not allow mismatches it produces many duplicate sequences in the output.
Are these long sequences? How many sequences per file? Are they in single-line format, or multi-line?
The sequences are short <30bp and the files are in single line format.
When applying the 1 mismatch between sequences, does this count for length too?
Yes, this would be perfect.
I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
I usually like blastclust which is now a legacy software but this post recommending cd-hit might be useful How to cluster sequences present in multiple fasta files
Thx for the help. The cd-hit webserver seems to take forever for a small job and can only handle two files. So I dont think this will solve my problem