I have a fasta file in which sequences are clustered and sorted by IDs. I want to find the longest sequence for each cluster and write them to a new file. How do I do it with python?
Here is the format of my fasta file:
>abc var1
kdfafaljflasjfalsjfaljfs
>abc var2
lasuowiejwaljflaj
>abc var3
lajflasjfowijflasjfopiefjjkfldfjqop
>dce var1
owiepqfpufaplddfpqoiwejlkdf
>dce var2
qopwelsmdfljfaldjfaopif
>red var1
alsdfowejfsladfjojflsdfjsdfjaslfjk
>red var2
lsdfjjqowjelsaflasflfnkdaflasfj
>red var3
kahfiqwuefkasdnkashdfiqfkasjdfh
>red var4
akhqioweadhauisydklsdfksdyiofjasldfhihladfni
It looks like a lot of work. I'm trying it. Thank you for your advice.
pyfaidx will not work on this type of FASTA because the indexing process splits each sequence name on whitespace, so you'd end up with non-unique identifiers. This was a design decision to match the samtools behavior.
Thanks for pointing it out, Matt. I noticed that too. However, the integrated faidx commandline tool is really handy for doing other things with your fasta file.