Hi everyone; this is my first question on the forum.
How can I compare if two fasta files contain the same sequence headers?
Does any BioPython module exist for doing this?
Thanks in advance, peixe
Hi everyone; this is my first question on the forum.
How can I compare if two fasta files contain the same sequence headers?
Does any BioPython module exist for doing this?
Thanks in advance, peixe
In BioPython you could do the following:
from Bio import SeqIO
def get_ids(fname):
reader = SeqIO.parse(fname, 'fasta')
ids = map(lambda x: x.id, reader)
return set(ids)
s1 = get_ids('f1.fasta')
s2 = get_ids('f2.fasta')
print 'Common:', s1 & s2
print 'In f1 and not f2:', s1 - s2
print 'In f2 and not f1:', s2 - s1
prints for my testcase:
Common: set(['a', 'c', 'b'])
In f1 and not f2: set(['x'])
In f2 and not f1: set(['y'])
This works if you add the 'fasta' format information to the Seq.IO.parse call:
s1 = setrec.id for rec in SeqIO.parse('f1.fasta', 'fasta'))
s2 = setrec.id for rec in SeqIO.parse('f2.fasta', 'fasta'))
However, an advantage of using the function that Istvan mentioned above is that you can create a function to extract part of the fast description (such as the accession number) with his method that you cannot with this shorter form. For example:
from Bio import SeqIO
def parse_accession(record_id):
data = record_id.split('|')
return data[3]
def get_ids(fname):
reader = SeqIO.parse(fname, 'fasta')
ids = map(lambda x: parse_accessionx.id), reader)
return set(ids)
s1 = get_ids('f1.fasta')
s2 = get_ids('f2.fasta')
print 'Common:', s1 & s2
print 'In f1 and not f2:', s1 - s2
print 'In f2 and not f1:', s2 - s1
Assuming you have two files (file1.fa and file2.fa) with multiple sequences, you can do something like this:
cat file1.fa | tr "[a-z]" "[A-Z]" | sed "s/^>.*/;/" | tr -d "\n" | tr ";" "\n" | sort > file1.txt
cat file2.fa | tr "[a-z]" "[A-Z]" | sed "s/^>.*/;/" | tr -d "\n" | tr ";" "\n" | sort > file2.txt
diff file1.txt file2.txt
A brief explanation:
- tr "[a-z]" "[A-Z]" : Transform letters to upper case
- sed "s/^>.*/;/" : Transform fasta annotations to a semicolon
- tr -d "\n" : delete all newline characters
- tr ";" "\n" : transform semicolons to newlines
At this point you get a file with one fasta sequence per line. Then just sort and compare.
linearize the fasta as a tab delimited file, remove the empty (first) line, sort on the sequence column. Then join on the second column and cut to only keep the name having the same sequence.
cat file1.fasta |\
awk '/^>/ {printf("\n%s\t",$0); next;} {printf("%s",toupper($0));} END{ printf("\n");}' |\
egrep -v '^$' |\
sort -t ' ' -k2,2 > file1.tsv
cat file2.fasta |\
awk '/^>/ {printf("\n%s\t",$0); next;} {printf("%s",toupper($0));} END{ printf("\n");}' |\
egrep -v '^$' |\
sort -t ' ' -k2,2 > file2.tsv
join -t ' ' -1 2 -2 2 file1.tsv file2.tsv | cut -d ' ' -f 2,3
It can be more shorter if the header is case sensitive:
awk '/^>/{if (a[$1]>=1){print $1}a[$1]++}' file1 file2
cmp <( grep '^>' file1 ) <( grep '^>' file2 )
This requires bash (not tcsh), and only tells you if the sequence headers are exactly the same in exactly the same order. Substitute "diff" for "cmp" to get a list of differences, but that will eat memory if there are lots of sequences (say, 100s of thousands). If the order shouldn't matter, sort the intermediate streams:
cmp <( grep '^>' file1 | sort ) <( grep '^>' file2 | sort )
(Also note that Fasta is a remarkably cumbersome format to use. You might be served better with BAM/SAM.)
from Bio import SeqIO
def get_ids(fname):
reader = SeqIO.parse(fname, 'fasta')
ids = map(lambda x: x.id, reader)
return set(ids)
s1 = get_ids('f1.fasta')
s2 = get_ids('f2.fasta')
print 'Common:', s1 & s2
print 'In f1 and not f2:', s1 - s2
print 'In f2 and not f1:', s2 - s1
Hi, there, regarding this answer, I am trying to do the same, except to output the results to different output files for each set of results, but I am new to Bioinformatics and I can't seem to get the output format right? Instead of Print I I try SeqIO.write("output_common.tab" 'w') and variants of this but can't seem to get it to output the information to a file instead of the terminal window? Sorry for such a newbie question but I can't seem to get it right.
I think you must give your sequences a format... 'fasta'; or whatever. Remeber to close the filehandle after this function call.
This is from the manual of BioPython...
SeqIO.write(sequences, handle, format)
where: sequences - A list (or iterator) of SeqRecord objects, or (if using Biopython 1.54 or later) a single SeqRecord.
handle - File handle object to write to, or filename as string
format - lower case string describing the file format to write.
Next time, ask this on the answer you are refering to. Hope this could help! (:
from Bio import SeqIO
def get_ids(fname):
reader = SeqIO.parse(fname, 'fasta')
ids = map(lambda x: x.id, reader)
return set(ids)
s1 = get_ids('f1.fasta')
s2 = get_ids('f2.fasta')
print 'Common:', s1 & s2
print 'In f1 and not f2:', s1 - s2
print 'In f2 and not f1:', s2 - s1
Hi there, I can't seem to get this to work with trying to compare the following 2 FASTA files: f1.fasta f2.fasta Command$: python FASTA_diff_test.py f1.fasta f2.fasta
This is the error message I get back: Traceback (most recent call last): File "FASTA_diff_test.py", line 8, in [?] s1 = get_ids('f1.fasta') File "FASTA_diff_test.py", line 4, in get_ids reader = SeqIO.parse(fname, 'fasta') File "/usr/lib/pymodules/python2.6/Bio/SeqIO/init.py", line 424, in parse raise TypeError("Need a file handle, not a string (i.e. not a filename)") TypeError: Need a file handle, not a string (i.e. not a filename)
I can't figure out where the problem is, does anyone have any ideas?
I'm new to python and programming in general. I got the exact same thing (or at least I thought it was!) to work on Monday but for some reason it won't work now! Best Wishes,
Laura
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
So you should edit your question to read "contain the same sequence headers", rather than "the same sequences" which is a different check.
welcome ! Does each fasta file contain more than one sequence ?
Yes! They have exactly 8 sequences each, and I need to check if sequences names match between the two fasta files, and return the differences, if exist. Thank u, @pierre! ;)
if it's a comparison of sequence IDs, then it'd be as easy as 'grepping' all the IDs, and then comparing the two lists in Excel or with other online tools (e.g. http://jura.wi.mit.edu/bioc/tools/compare.php)
Ok, thank you all guys!