I have a large fasta file of new species, I want to find extract a particular protein sequence. I also know a protein sequence of a similar species, which potentially can be used for finding the protein sequence in my data. How could I do it? Thank you very much!
You didn't mention what kind of organism you are working with and what kind of protein you are looking for?
If it happens to be a microbial genome and the protein falls within the categories of antimicrobial peptides, antibiotic resistance genes or biosynthetic gene clusters, you could try running funcscan or also genomeannotator (no stable release yet, though) on the data? Also Bactopia or and Anvi'o feature appropriate tools to proceed with annotation. Finding your protein of interest in the annotated assembly is probably more straightforward.
If you wish to start from the known protein sequence of the similar species, I would subject that first to a search for protein domains. Domains are usually better conserved than the remainder of the protein. Take the corresponding nucleotide sequence of the domains from the reference of that species and try seqkit fish or seqkit locate on your scaffolds. Maybe you are lucky and can locate a site with the similar domains next to each other in your assembly?
Thank you! I need this for eukaryotic organism and known protein sequences, but I guess seqkit would work well!