$ uname -a
Linux name 4.8.0-53-generic #56~16.04.1-Ubuntu SMP Tue May 16 01:18:56 UTC 2017 x86_64 GNU/Linux
$ tr --version
tr (GNU coreutils) 8.28
Copyright (C) 2017 Free Software Foundation, Inc.
From the example, I'm assuming, you have single line fasta (if not linearise) and you would like to extract unique fasta entries from file1 compared to file2
grep -F -x -v -f <(grep '^>' f2.txt) <(grep '^>' f1.txt) | while read -r ID; do grep "$ID" -A 1 f1.txt ; done
You can change f1.txt and f2.txt places if you want the results in otherway around.
Here's a Python-based way to do this, that does not use sorting:
#!/usr/bin/env python
import sys
if len(sys.argv) != 3:
raise SystemError("Usage: ./filter_fasta.py target.fa query.fa")
target = sys.argv[1]
query = sys.argv[2]
m = {}
k = None
v = ''
with open(query, "r") as qfh:
for line in qfh:
line = line.strip()
if line.startswith('>'):
if k:
m[k] = v
v = ''
k = line
else:
v += line
m[k] = v
k = None
v = ''
with open(target, "r") as rfh:
for line in rfh:
line = line.strip()
if line.startswith('>'):
if k and (k not in m):
sys.stdout.write("%s\n%s\n" % (k, v))
k = line
v = ''
else:
v += line
if k and (k not in m):
sys.stdout.write("%s\n%s\n" % (k, v))
You could probably use the other approaches if your inputs are small or don't need much preprocessing.
Here are a couple advantages of my approach:
Other approaches require sorting. For large datasets, sorting can get expensive in time. My approach reads through each file once to make hash tables ("dictionaries" in Python-speak), and as O(n+m) < O(nlogn + mlogm) you'll end up spending a lot less time making hash tables than on sorting, if your inputs are very large.
You don't need to linearize the FASTA file inputs. This script takes in multiline FASTA.
.
.
this is not
but