Question

Simple metric to remove outliers from MSA

0

Entering edit mode

3.5 years ago

danilobritorocha • 0

I did an MSA for a group of proteins and some sequences are really wrong. The average of aa in these sequences is about 76, but some unique sequences have 226 aa, others 31 or 54, that is, outliers. I want to avoid using another tool in the pipeline just to remove those sequences. Is there a simple metric that I can use to cut this sequences that I can use? Probably something that cut sequences that deviates from the average length of the sequences. But I need something more rigorous that this to justify this choice. Someone could help with this?

Outlier MSA Aligment • 909 views

ADD COMMENT • link updated 3.5 years ago by Mensur Dlakic ★ 28k • written 3.5 years ago by danilobritorocha • 0

0

Entering edit mode

Since they are outliers (as per OP), use a different substitution matrix that reflects sequence conservation among sequences of interest.

ADD REPLY • link 3.5 years ago by cpad0112 21k

0

Entering edit mode

The solution I was thinking about is something like: "The unique sequences that deviate from the average length X% have been removed"[article to justify this decision]. These proteins are essential for the functioning of the virus, so I decided to remove these unique sequences with anomalous length, they are probably annotations errors. Number of sequence in total: 251.253. Am I wrong thought that way?

ADD REPLY • link 3.5 years ago by danilobritorocha • 0

1

Entering edit mode

Any justification for such removal must come from statistical, evolutionary (sequence conservation/ taxonomy) or functional significance, IMO. Unless length is related to such significance(s), it is not advisable (IMO).

ADD REPLY • link 3.5 years ago by cpad0112 21k

score 0 · Answer 1 · 2021-05-20

I did an MSA for a group of proteins and some sequences are really wrong.

In the immortal words of Big Lebowsky: Well, that's just like your opinion, man. I don't think sequences are right or wrong because their length is not what we expect. They can be incomplete (truncated) or have an additional domain that makes them larger. The way I think about it, the aligned sequences are either related or not. If they are related, the length is irrelevant, because the fragmented sequences may have some evolutionary signal in them that is worth preserving.

On the other hand, for alignment visualization purposes it may be desirable to remove sequences that are too long or too short. I don't think you need to justify that other than to say that sequences larger than size X or smaller than size Y were removed for the purpose of cleaner visualization.