compare two fasta files
5
0
Entering edit mode
7.1 years ago

hello everyone, I want to compare 2 fasta files and discard matching sequences and print out the rest . for example file 1 :

>3-23985
UUCCACAGCUUUCUUGAACU
>9-10946
UGAAGCUGCCAGCAUGAUCU
>15-8830
UUAGAUUCACGCACAAACUCG
>17-8260
UUCCACAGCUUUCUUGAACUU
>35-5014
UUCCACAGCUUUCUUGAACC

file 2

>3-23985
UUCCACAGCUUUCUUGAACU
>15-8830
UUAGAUUCACGCACAAACUCG
>35-5014
UUCCACAGCUUUCUUGAACC

Desired output :

>9-10946
UGAAGCUGCCAGCAUGAUCU
>17-8260
UUCCACAGCUUUCUUGAACUU

Please do help me on this regard

Thank you

biopython awk bioawk fasta • 2.2k views
ADD COMMENT
2
Entering edit mode
7.1 years ago

Assuming there are only two lines per sequence and . (else use this).

cat   in1.fa  in2.fa | paste - - | sort | uniq -u | tr "\t" "\n"
ADD COMMENT
1
Entering edit mode
$ cat test1.fa test2.fa | paste --  | sort | uniq -u | tr "\t" "\n"
>17-8260
>9-10946
UGAAGCUGCCAGCAUGAUCU
UUCCACAGCUUUCUUGAACUU

.

$ cat test1.fa test2.fa | paste - - | sort | uniq -u | tr  '\t' '\n'
>17-8260
UUCCACAGCUUUCUUGAACUU
>9-10946
UGAAGCUGCCAGCAUGAUCU

.

 $ uname -a
Linux name 4.8.0-53-generic #56~16.04.1-Ubuntu SMP Tue May 16 01:18:56 UTC 2017 x86_64 GNU/Linux
 $ tr --version
tr (GNU coreutils) 8.28
Copyright (C) 2017 Free Software Foundation, Inc.
ADD REPLY
0
Entering edit mode

this is not

paste --

but

paste - -
ADD REPLY
1
Entering edit mode
7.1 years ago
venu 7.1k

From the example, I'm assuming, you have single line fasta (if not linearise) and you would like to extract unique fasta entries from file1 compared to file2

grep -F -x -v -f <(grep '^>' f2.txt) <(grep '^>' f1.txt) | while read -r ID; do grep "$ID" -A 1 f1.txt ; done

You can change f1.txt and f2.txt places if you want the results in otherway around.

ADD COMMENT
1
Entering edit mode
7.1 years ago
$ grep ">" -h test2.fa test1.fa | sort | uniq -u | grep  --no-group-separator -A 1 -f - test1.fa

output:

>9-10946
UGAAGCUGCCAGCAUGAUCU
>17-8260
UUCCACAGCUUUCUUGAACUU

input:test1.fa and test2.fa

test1.fa

$ cat test1.fa 
>3-23985
UUCCACAGCUUUCUUGAACU
>9-10946
UGAAGCUGCCAGCAUGAUCU
>15-8830
UUAGAUUCACGCACAAACUCG
>17-8260
UUCCACAGCUUUCUUGAACUU
>35-5014
UUCCACAGCUUUCUUGAACC

test2.fa

$ cat test2.fa 
>3-23985
UUCCACAGCUUUCUUGAACU
>15-8830
UUAGAUUCACGCACAAACUCG
>35-5014
UUCCACAGCUUUCUUGAACC
ADD COMMENT
1
Entering edit mode
7.1 years ago

Here's a Python-based way to do this, that does not use sorting:

#!/usr/bin/env python

import sys

if len(sys.argv) != 3:
    raise SystemError("Usage: ./filter_fasta.py target.fa query.fa")

target = sys.argv[1]
query = sys.argv[2]

m = {}
k = None
v = ''

with open(query, "r") as qfh:
    for line in qfh:
        line = line.strip()
        if line.startswith('>'):
            if k:
                m[k] = v
                v = ''
                k = line
            else:
                v += line
    m[k] = v

k = None
v = ''

with open(target, "r") as rfh:
    for line in rfh:
        line = line.strip()
        if line.startswith('>'):
            if k and (k not in m):
                sys.stdout.write("%s\n%s\n" % (k, v))
            k = line
            v = ''
        else:
            v += line
    if k and (k not in m):
        sys.stdout.write("%s\n%s\n" % (k, v))

Output:

$ ./filter_fasta.py f1.fa f2.fa 
>9-10946
UGAAGCUGCCAGCAUGAUCU
>17-8260
UUCCACAGCUUUCUUGAACUU

You could probably use the other approaches if your inputs are small or don't need much preprocessing.

Here are a couple advantages of my approach:

  1. Other approaches require sorting. For large datasets, sorting can get expensive in time. My approach reads through each file once to make hash tables ("dictionaries" in Python-speak), and as O(n+m) < O(nlogn + mlogm) you'll end up spending a lot less time making hash tables than on sorting, if your inputs are very large.

  2. You don't need to linearize the FASTA file inputs. This script takes in multiline FASTA.

ADD COMMENT
0
Entering edit mode
7.1 years ago

Dear all.. Thank you so much for your prompt reply ... It really helped me in my work...

ADD COMMENT

Login before adding your answer.

Traffic: 1896 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6