Comparison between multiple fasta files in terms of presence or absence of certain sequences
2
0
Entering edit mode
7.4 years ago

Hallo folks.

I am trying to compare fasta files of predicted miRNAs with one another. This problem does not seem to be very hard for someone more skilled in bioinformatics than me but after hours of forum searches I could not solve the problem.

As an example I have 3 fasta files like:

File_1:

>miRNA1 
AGTCGTCA
>miRNA2 
AGTTCGTT

File_2:

>miRNA1
TTGACT
>miRNA2
AGTCGTCA

File_3:

>miRNA3
TTACATT
>miRNA5
TTGACT

From these files I would like to get a table that shows the presence or absence of each sequence. As you can see the same sequence could have a different ID. In the best case even if there is one mismatch between the sequences they should be counted as equal.

I am sorry to ask such a simple question but i couldnt figure it out by myself. So every help is welcome.

The closest solution i found so far is: Count Sequences In Multi Fasta File .

The script Kenosis provided does a part of the job:

"

use strict;
use warnings;
use File::Basename;
use Sort::Naturally;

local $/ = '>';
my ( %files, %query, %omit, @F );
$omit{$_}++ for ( $ARGV[0], basename $0);

while (<>) {
    chomp;
    $query{ $F[0] } = $F[1] if @F = split;
}

local $/ = "\n";
my @files = nsort @ARGV = grep !$omit{$_}, <*>;

while (<>) {
    next if /^>/;
    chomp;
    $files{$_}{$ARGV}++;
}

print join( "\t", 'tag', 'seq', @files ), "\n";

for my $key ( nsort keys %query ) {
    my @seqCount;
    push @seqCount, $files{ $query{$key} }->{$_} || 0 for @files;
    print join( "\t", $key, $query{$key}, @seqCount ), "\n";
}

"

But unfortunately I am unable to adept it to my problem. Because the script compares the sequences by ID and does not allow mismatches it produces many duplicate sequences in the output.

sequence RNA-Seq alignment • 4.8k views
ADD COMMENT
1
Entering edit mode

Are these long sequences? How many sequences per file? Are they in single-line format, or multi-line?

ADD REPLY
0
Entering edit mode

The sequences are short <30bp and the files are in single line format.

ADD REPLY
1
Entering edit mode

When applying the 1 mismatch between sequences, does this count for length too?

ADD REPLY
0
Entering edit mode

Yes, this would be perfect.

ADD REPLY
1
Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY
0
Entering edit mode

I usually like blastclust which is now a legacy software but this post recommending cd-hit might be useful How to cluster sequences present in multiple fasta files

ADD REPLY
0
Entering edit mode

Thx for the help. The cd-hit webserver seems to take forever for a small job and can only handle two files. So I dont think this will solve my problem

ADD REPLY
4
Entering edit mode
7.4 years ago
st.ph.n ★ 2.7k

Others may have command-line solutions. Here's one in python, untested, that I think should do what you need.

Fasta files:

one.fasta

>miRNA1
AGTCGTCA
>miRNA2
AGTTCGTT
>miRNA3
AGTCGTGA

two.fasta

>miRNA1
TTGACT
>miRNA2
AGTCGTCA

three.fasta

>miRNA3
TTACATT
>miRNA5
TTGACT
>miRNA3
AGTCGTGA

First, make a file with all the unique sequences from each of your fasta files:

cat file1.fasta file2.fasta file3.fasta | grep -v '>' | sort | uniq > all.txt

Save this code as comp_seqs.py , and run as python comp_seqs.py > output.xls in the location that has the three fasta files, and the new file with all the sequences:

#!/usr/bin/env python
import regex
from collections import defaultdict
from itertools import izip

def open_fasta(file):
        """ Open fasta file, and store in dictionary, as {'sequence:header'} """
        s = {}
        with open(file, 'r') as fasta:
                for line in fasta:
                        if line.startswith('>'):
                                h = line.strip().split('>')[1]
                                n = next(fasta).strip()
                                s[n] = h
        return s

def hamming_distance(s1, s2):
    """ Count number of mismatched characters in equal length strings. """
    if len(s1) != len(s2): raise ValueError('string lengths do not match')
    return sum(a != b for a, b in izip(s1, s2))


def find_seqs(fh, p):
        """ Find sequences from 'all sequence list' that either match as identical sequence, or mismatch of 1, store as defaultdict [{sequence:[id1, id2, id3...]}] """
        seqs = open_fasta(fh)
        for i in seqs:
                if i in x:
                        match[i][p] = seqs[i]
                else:
                        for j in x:
                                if len(i) == len(j):
                                        if hamming_distance(i, j) <= 1:
                                                match[i][p] = seqs[i] + '+1'

# Open concated fasta file, containing unique sequences
with open('all.txt', 'r') as a:
        x = [line.strip() for line in a]

# empty defaultdict for reulsts
match = defaultdict(list)

for i in x:
        for n in range(3):
                match[i].append('')


# Open each fasta, and search for sequences in 'all sequence list'
find_seqs('one.fasta', 0)
find_seqs('two.fasta', 1)
find_seqs('three.fasta', 2)

# Print results in tab delimited format
print 'Sequence\tFile1_ID\tFile2_ID\tFile3_ID'
for m in match:
        print m, '\t', '\t'.join(match[m])

Output from example above:

Sequence    File1_ID    File2_ID    File3_ID
AGTCGTCA    miRNA1  miRNA2  
TTGACT      miRNA1  miRNA5
AGTTCGTT    miRNA2      
AGTCGTGA    miRNA3      miRNA3
TTACATT             miRNA3
ADD COMMENT
0
Entering edit mode

Thank you for your help. The problem is it is not working properly. The output it produces is:

defaultdict(<type 'list'>, {'AGTCGTCA': ['miRNA1', 'miRNA2'], 'AGTTCGTT': ['miRNA2'], 'TTACATT': ['miRNA3'], 'TTGACT': ['miRNA1', 'miRNA5']})
Sequence    File1 ID    File2 ID    File3 ID
AGTCGTCA    miRNA1  miRNA2
AGTTCGTT    miRNA2
TTACATT miRNA3
TTGACT  miRNA1  miRNA5

So all the ID are in the left most column.

I tried to change the script so it produces a csv-file with this result:

defaultdict(<type 'list'>, {'AGTCGTCA': ['miRNA1', 'miRNA2'], 'AGTTCGTT': ['miRNA2'], 'TTACATT': ['miRNA3'], 'TTGACT': ['miRNA1', 'miRNA5']})
Sequence,File1 ID,File2 ID,File3 ID
AGTCGTCA,miRNA1,miRNA2
AGTTCGTT,miRNA2
TTACATT,miRNA3
TTGACT,miRNA1,miRNA5

Thanks again for your efforts. But the script does not sort the IDs to the right column, or am I doing something wrong?

But dont get me wrong, this script is already helping a lot by filtering out nearly duplicate sequences. I only wish the IDs would be in a column for each file so I could figure out how many and which sequences are present in which file.

ADD REPLY
1
Entering edit mode

@Bernhard I edited the code above, and added the example files I'm using. I am also now piping the output to a output.xls, which by the extension will open in excel. Let me know if this is in working order, and if so, please accept the answer.

ADD REPLY
0
Entering edit mode

Thank you so much! This is perfect^^

ADD REPLY
1
Entering edit mode
7.4 years ago
TriS ★ 4.7k

two approaches you could try are in perl and unix perl: from previous question in biostar UNIX: AWK solution

ADD COMMENT
0
Entering edit mode

These solutions are a bit over my head. I am not good at programing. I am able to use a script or to combine several scripts and programs. But unfortunately I only understand the simplest scripts. Thx for helping.

ADD REPLY
0
Entering edit mode

You can try using Sort in MS Excel.

ADD REPLY
0
Entering edit mode

Yes, I could do this. But I have files with hundreds of sequences and to compare them by hand takes ages. That is why I would like to find an automated solution.

ADD REPLY

Login before adding your answer.

Traffic: 2209 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6