Question

How to find the longest common sequence for a cluster of sequences in a fasta file using python?

0

Entering edit mode

8.8 years ago

grayapply2009 ▴ 300

I have a fasta file in which sequences are clustered and sorted by IDs. I want to find the longest sequence for each cluster and write them to a new file. How do I do it with python?

Here is the format of my fasta file:

>abc var1

kdfafaljflasjfalsjfaljfs

>abc var2

lasuowiejwaljflaj

>abc var3

lajflasjfowijflasjfopiefjjkfldfjqop

>dce var1

owiepqfpufaplddfpqoiwejlkdf

>dce var2

qopwelsmdfljfaldjfaopif

>red var1

alsdfowejfsladfjojflsdfjsdfjaslfjk

>red var2

lsdfjjqowjelsaflasflfnkdaflasfj

>red var3

kahfiqwuefkasdnkashdfiqfkasjdfh

>red var4

akhqioweadhauisydklsdfksdyiofjasldfhihladfni

common fasta phthon longest • 3.3k views

ADD COMMENT • link 8.8 years ago by grayapply2009 ▴ 300

Ram · Answer 1 · 2016-02-03

1

Entering edit mode

8.8 years ago

dbrowne.up ▴ 80

Check out the Python module pyfaidx: https://github.com/mdshw5/pyfaidx

It makes doing this sort of thing super easy. You may have to experiment a bit to figure out how to do exactly what you are wanting to, but with pyfaidx, you have a nice interface to access each sequence in your file and get information about each sequence, i.e. length, name, etc.

ADD COMMENT • link 8.8 years ago by dbrowne.up ▴ 80

0

Entering edit mode

It looks like a lot of work. I'm trying it. Thank you for your advice.

ADD REPLY • link 8.8 years ago by grayapply2009 ▴ 300

0

Entering edit mode

pyfaidx will not work on this type of FASTA because the indexing process splits each sequence name on whitespace, so you'd end up with non-unique identifiers. This was a design decision to match the samtools behavior.

ADD REPLY • link 8.8 years ago by Matt Shirley 10k

1

Entering edit mode

Thanks for pointing it out, Matt. I noticed that too. However, the integrated faidx commandline tool is really handy for doing other things with your fasta file.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by grayapply2009 ▴ 300

score 1 · Answer 2 · 2016-02-04

1

Entering edit mode

8.8 years ago

grayapply2009 ▴ 300

Hey folks,

I found a solution from another post. Here is the link for those who are in the same boat with me.

How to extract the longest isoform from multi fasta file

ADD COMMENT • link 8.8 years ago by grayapply2009 ▴ 300