I did an MSA for a group of proteins and some sequences are really wrong. The average of aa in these sequences is about 76, but some unique sequences have 226 aa, others 31 or 54, that is, outliers. I want to avoid using another tool in the pipeline just to remove those sequences. Is there a simple metric that I can use to cut this sequences that I can use? Probably something that cut sequences that deviates from the average length of the sequences. But I need something more rigorous that this to justify this choice. Someone could help with this?
Since they are outliers (as per OP), use a different substitution matrix that reflects sequence conservation among sequences of interest.
The solution I was thinking about is something like: "The unique sequences that deviate from the average length X% have been removed"[article to justify this decision]. These proteins are essential for the functioning of the virus, so I decided to remove these unique sequences with anomalous length, they are probably annotations errors. Number of sequence in total: 251.253. Am I wrong thought that way?
Any justification for such removal must come from statistical, evolutionary (sequence conservation/ taxonomy) or functional significance, IMO. Unless length is related to such significance(s), it is not advisable (IMO).