Hello!
I am using BLAST+ on a Linux command terminal and have concatenated the input so that it contains multiple protein sequences from the same protein family of one organism. I am then using BLASTp to align this to the predicted protein sequences of another organism in order to find proteins of the same family. This gives an output of the top 50 hits for each protein sequence input with the subject sequence ID and subject sequence. I have tried to sort this (sort -u) and used sed to remove gaps (sed -e "/-//g"), I have also added > to each sequence ID (sed 's/^/>/') for fasta format as I intend to pipe this into a multiple sequence alignment.
However, as the consensus that managed to match is different for each ID I cant use uniq -u to remove repeats.
What I want to do is take the longest matching sequence for each ID and remove all of the smaller sequences but I'm very new to this kind of computing and dont know which tool to use. I need something that will analyse the sequence ID to group them and then select based on the associated sequence length.
Any advice will be appreciated.
please, give us a sample of your input/output
The longest sequence by actual length, or alignment length to query?