Question

Compare Two Fasta Files By Headers

8

Entering edit mode

14.1 years ago

Peixe ▴ 660

Hi everyone; this is my first question on the forum.

How can I compare if two fasta files contain the same sequence headers?

Does any BioPython module exist for doing this?

Thanks in advance, peixe

fasta comparison sequence biopython • 14k views

ADD COMMENT • link updated 14.1 years ago by Marvin ▴ 900 • written 14.1 years ago by Peixe ▴ 660

1

Entering edit mode

So you should edit your question to read "contain the same sequence headers", rather than "the same sequences" which is a different check.

ADD REPLY • link 14.1 years ago by Neilfws 49k

0

Entering edit mode

welcome ! Does each fasta file contain more than one sequence ?

ADD REPLY • link 14.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Yes! They have exactly 8 sequences each, and I need to check if sequences names match between the two fasta files, and return the differences, if exist. Thank u, @pierre! ;)

ADD REPLY • link 14.1 years ago by Peixe ▴ 660

0

Entering edit mode

if it's a comparison of sequence IDs, then it'd be as easy as 'grepping' all the IDs, and then comparing the two lists in Excel or with other online tools (e.g. http://jura.wi.mit.edu/bioc/tools/compare.php)

ADD REPLY • link 14.1 years ago by Israel Barrantes ▴ 790

0

Entering edit mode

Ok, thank you all guys!

ADD REPLY • link 14.1 years ago by Peixe ▴ 660

12

Entering edit mode

14.1 years ago

Pablo ★ 1.9k

Assuming you have two files (file1.fa and file2.fa) with multiple sequences, you can do something like this:

cat file1.fa | tr "[a-z]" "[A-Z]" | sed "s/^>.*/;/"  | tr -d "\n" | tr ";" "\n" | sort > file1.txt

cat file2.fa | tr "[a-z]" "[A-Z]" | sed "s/^>.*/;/"  | tr -d "\n" | tr ";" "\n" | sort > file2.txt

diff file1.txt file2.txt

A brief explanation:

  - tr "[a-z]" "[A-Z]" : Transform letters to upper case
  - sed "s/^>.*/;/" : Transform fasta annotations to a semicolon
  - tr -d "\n" : delete all newline characters
  - tr ";" "\n" : transform semicolons to newlines

At this point you get a file with one fasta sequence per line. Then just sort and compare.

ADD COMMENT • link 14.1 years ago by Pablo ★ 1.9k

0

Entering edit mode

If you just want to compare the headers, then the commands are: cat file.fa | grep "^>" | sort > header.txt Do that for both files and then compare the TXT files using "diff"

ADD REPLY • link 14.1 years ago by Pablo ★ 1.9k

0

Entering edit mode

Ok, nice answer..! It did the job.! Thank u @Pablo. Its always nice to see the explanation..

ADD REPLY • link 14.1 years ago by Peixe ▴ 660

4

Entering edit mode

14.1 years ago

Pierre Lindenbaum 166k

linearize the fasta as a tab delimited file, remove the empty (first) line, sort on the sequence column. Then join on the second column and cut to only keep the name having the same sequence.

cat file1.fasta  |\
awk '/^>/ {printf("\n%s\t",$0); next;} {printf("%s",toupper($0));} END{ printf("\n");}' |\
egrep -v '^$' |\
sort -t '   ' -k2,2 > file1.tsv

cat file2.fasta  |\
awk '/^>/ {printf("\n%s\t",$0); next;} {printf("%s",toupper($0));} END{ printf("\n");}' |\
egrep -v '^$' |\
sort -t '   ' -k2,2 > file2.tsv

join -t '  ' -1 2 -2 2 file1.tsv file2.tsv | cut -d '      ' -f 2,3

ADD COMMENT • link 14.1 years ago by Pierre Lindenbaum 166k

2

Entering edit mode

14.1 years ago

Benm ▴ 710

It can be more shorter if the header is case sensitive:

awk '/^>/{if (a[$1]>=1){print $1}a[$1]++}' file1 file2

ADD COMMENT • link 14.1 years ago by Benm ▴ 710

1

Entering edit mode

13.9 years ago

Marvin ▴ 900

cmp <( grep '^>' file1 ) <( grep '^>' file2 )

This requires bash (not tcsh), and only tells you if the sequence headers are exactly the same in exactly the same order. Substitute "diff" for "cmp" to get a list of differences, but that will eat memory if there are lots of sequences (say, 100s of thousands). If the order shouldn't matter, sort the intermediate streams:

cmp <( grep '^>' file1 | sort ) <( grep '^>' file2 | sort )

(Also note that Fasta is a remarkably cumbersome format to use. You might be served better with BAM/SAM.)

ADD COMMENT • link 13.9 years ago by Marvin ▴ 900

0

Entering edit mode

13.9 years ago

User 8087 ▴ 30

from Bio import SeqIO

def get_ids(fname):
    reader = SeqIO.parse(fname, 'fasta')
    ids = map(lambda x: x.id, reader)
    return set(ids)

s1 = get_ids('f1.fasta')
s2 = get_ids('f2.fasta')

print 'Common:', s1 & s2
print 'In f1 and not f2:', s1 - s2
print 'In f2 and not f1:', s2 - s1

Hi, there, regarding this answer, I am trying to do the same, except to output the results to different output files for each set of results, but I am new to Bioinformatics and I can't seem to get the output format right? Instead of Print I I try SeqIO.write("output_common.tab" 'w') and variants of this but can't seem to get it to output the information to a file instead of the terminal window? Sorry for such a newbie question but I can't seem to get it right.

ADD COMMENT • link updated 13.9 years ago by lh3 33k • written 13.9 years ago by User 8087 ▴ 30

0

Entering edit mode

I think you must give your sequences a format... 'fasta'; or whatever. Remeber to close the filehandle after this function call.

This is from the manual of BioPython...

    SeqIO.write(sequences, handle, format)

where: sequences - A list (or iterator) of SeqRecord objects, or (if using Biopython 1.54 or later) a single SeqRecord.

handle - File handle object to write to, or filename as string

format - lower case string describing the file format to write.

Next time, ask this on the answer you are refering to. Hope this could help! (:

ADD REPLY • link 13.9 years ago by Peixe ▴ 660

0

Entering edit mode

13.9 years ago

User 8087 ▴ 30

from Bio import SeqIO

def get_ids(fname):
    reader = SeqIO.parse(fname, 'fasta')
    ids = map(lambda x: x.id, reader)
    return set(ids)

s1 = get_ids('f1.fasta')
s2 = get_ids('f2.fasta')

print 'Common:', s1 & s2
print 'In f1 and not f2:', s1 - s2
print 'In f2 and not f1:', s2 - s1

Hi there, I can't seem to get this to work with trying to compare the following 2 FASTA files: f1.fasta f2.fasta Command$: python FASTA_diff_test.py f1.fasta f2.fasta

This is the error message I get back: Traceback (most recent call last): File "FASTA_diff_test.py", line 8, in [?] s1 = get_ids('f1.fasta') File "FASTA_diff_test.py", line 4, in get_ids reader = SeqIO.parse(fname, 'fasta') File "/usr/lib/pymodules/python2.6/Bio/SeqIO/init.py", line 424, in parse raise TypeError("Need a file handle, not a string (i.e. not a filename)") TypeError: Need a file handle, not a string (i.e. not a filename)

I can't figure out where the problem is, does anyone have any ideas?

I'm new to python and programming in general. I got the exact same thing (or at least I thought it was!) to work on Monday but for some reason it won't work now! Best Wishes,

Laura

ADD COMMENT • link updated 13.9 years ago by lh3 33k • written 13.9 years ago by User 8087 ▴ 30

score 7 · Accepted Answer · 2011-07-12

7

Entering edit mode

14.1 years ago

Istvan Albert 103k

In BioPython you could do the following:

from Bio import SeqIO

def get_ids(fname):
    reader = SeqIO.parse(fname, 'fasta')
    ids = map(lambda x: x.id, reader)
    return set(ids)

s1 = get_ids('f1.fasta')
s2 = get_ids('f2.fasta')

print 'Common:', s1 & s2
print 'In f1 and not f2:', s1 - s2
print 'In f2 and not f1:', s2 - s1

prints for my testcase:

Common: set(['a', 'c', 'b'])
In f1 and not f2: set(['x'])
In f2 and not f1: set(['y'])

ADD COMMENT • link 14.1 years ago by Istvan Albert 103k

0

Entering edit mode

You could shorten that with a generator expression,

s1 = setrec.id for rec in SeqIO.parse('f1.fasta'))
s2 = setrec.id for rec in SeqIO.parse('f2.fasta'))

I find this much simpler than introducing a function to do it.

ADD REPLY • link 13.0 years ago by Peter 6.0k

0

Entering edit mode

This works if you add the 'fasta' format information to the Seq.IO.parse call:

s1 = setrec.id for rec in SeqIO.parse('f1.fasta', 'fasta'))
s2 = setrec.id for rec in SeqIO.parse('f2.fasta', 'fasta'))

However, an advantage of using the function that Istvan mentioned above is that you can create a function to extract part of the fast description (such as the accession number) with his method that you cannot with this shorter form. For example:

from Bio import SeqIO

def parse_accession(record_id):
    data = record_id.split('|')
    return data[3]

def get_ids(fname):
    reader = SeqIO.parse(fname, 'fasta')
    ids = map(lambda x: parse_accessionx.id), reader)
    return set(ids)

s1 = get_ids('f1.fasta')
s2 = get_ids('f2.fasta')

print 'Common:', s1 & s2
print 'In f1 and not f2:', s1 - s2
print 'In f2 and not f1:', s2 - s1

ADD REPLY • link 11.4 years ago by Burke ▴ 290