Hello everyone!
I'm working with fasta files containing sequences from different organisms, and for some of them I have more than one sequence. I would like to have only one representative sequence per organism, and I'd like it to be the longest one in each case. I've spent some time looking for an answer and learning to use some command line tools, but I couldn't get it right. My file kinda looks like this
>Mouse01
ATGGGTGTGGAGAGAGAGAGAGAGTGATGATGGAAGTGTGTGGTGATGATG
>Mouse02
ATGGGTGTGGAGAGAGAGAGAGAGTGATGATGGAAGTGTG
>Chimpanzee
ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATG
>Human01
ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATG
>Human02
ATGGGTGTGGAGAGAGAGAGATATTGATGATGGAAGTGTGTGGAGATGCACGTGAGA
In this case, I'd like to keep Mouse01, Chimpanzee, and Human02.
The workflow, I think, would be:
1) Identify sequences of the same species by regex (e.g. Mouse, Human)
2) Count sequence length for species with more than one match
3) Keep only the longest sequence in species with more than one match, leave the rest (e.g. Chimpanzee) untouched.
I bet there must be some magical recipe or one-liner to do this using command line, but how would it look like?
Thanks from a very very rookie bioinformatic tools learner.
What about matching sequences with the same length?
What will be your downstream analyses? Selecting the longest sequence may not be the best approach, depending on how your data has been generated and what you want to do.
Well, I intend to perform tests for positive selection among a set of ortholog genes. In some cases I know that the orthology is not 1-to-1, so I thought of keeping the longest sequence in species with no 1-to-1 homology. Perhaps this is not the best approach, in which case I'd appreciate your suggestions.
Probably PhyloTreePruner is a better option then.