Question

How to filter "productive" amino acid sequences

0

Entering edit mode

15 months ago

sil_bioinfo ▴ 50

Hello,

I have a fasta file with different amino acid sequences, for example:

>abc
HSTSDSAQTMFPVALLLLAAGSCVKGEQLTQPTSVTVQPGQRLTITCQVSYSLGTYFTAW
IRQPAGKGLEWIGMRSTGASYYKDSLKNKFSIDLDTSSKTVTLNGQNVQPEDTAVYYCAR
APSRGFDYWGKGTMVTITSATPKGPTVFPL

>def
TARQIQHKPCFL*LCCCWQLDHV*RVNS*HSRPL*LCSQVNV*PSPVRSLILLVPTSQLG
SDSLQEKDWSGLE*DLLELHTTKIH*RTSSVST*TLPAKL*L*MDRMCSLKTLLCITVPE
RPVGVLTTGGKAPWSPSPRPPQRDQLCFL*

>ghi
GSQHVRFSTNHVSCSSAAVGSWIMCEG*TVDTADLCDCAARSTSDHHLSGLLFSW*LLHS
LDQTACRKRTGVDWEQIYWSCILQRFIKEQVQYRLRHFQQNCDSKWTECAA*RHCCVLLC
QTTGSGSWLLGERHHGHHHLGHPKGTNCVSS

and I want to filter out the sequences that are "productive" from the "non-productive" ones.

Additional info: I had translated every DNA sequence to amino acid sequence in all 6 frames.

By "non-productive" I mean those that don't translate into proteins (don't have the amino acid M and/or have too many stop codons). I would like to filter out these non-productive sequences in a fasta file.

As for the "productive" ones, I would also like to save every "productive" sequence only with the complete frame in another fasta file.

Is there any software tool where I can do this? If there isn't, I'm trying to do it in python... but I'm stuck... Any ideas you can come up with are welcome.

Thank you in advance

protein fasta • 768 views

ADD COMMENT • link updated 14 months ago by GenoMax 148k • written 15 months ago by sil_bioinfo ▴ 50

1

Entering edit mode

Please do not delete posts once they have at least one comment or an answer.

ADD REPLY • link 14 months ago by GenoMax 148k

0

Entering edit mode

don't have methionine

At beginning of sequence?

ADD REPLY • link 15 months ago by GenoMax 148k

0

Entering edit mode

in general, around all the sequence

ADD REPLY • link 15 months ago by sil_bioinfo ▴ 50

score 0 · Answer 1 · 2023-09-28

0

Entering edit mode

15 months ago

GenoMax 148k

If you are simply looking to filter out sequences that contain a stop (*) then you can do the following:

Code to linearize fasta courtesy of @Pierre.

$ awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < test.fasta | grep -v "*" | tr "\t" "\n" | fold -w 60
>abc
HSTSDSAQTMFPVALLLLAAGSCVKGEQLTQPTSVTVQPGQRLTITCQVSYSLGTYFTAW
IRQPAGKGLEWIGMRSTGASYYKDSLKNKFSIDLDTSSKTVTLNGQNVQPEDTAVYYCAR
APSRGFDYWGKGTMVTITSATPKGPTVFPL

ADD COMMENT • link 15 months ago by GenoMax 148k

0

Entering edit mode

Hi, I would like to filter out sequences that, for example, don't have a methionine (M) and/or have a lot of stop codons (*) in the middle, not just one. These sequences would be the "non-productive" ones, and I would like to create a fasta file with these sequences too.

ADD REPLY • link 15 months ago by sil_bioinfo ▴ 50