Hi, I'm trying to make a list of amino acid sequences that are present in all of a selection of FASTA files I have. To make things confusing they all different feature IDs. Is there a script I can run that would be capable of doing this?
Thanks!
Hi, I'm trying to make a list of amino acid sequences that are present in all of a selection of FASTA files I have. To make things confusing they all different feature IDs. Is there a script I can run that would be capable of doing this?
Thanks!
One possibility can be
file1.fasta
is >protein1
, you can change it to >file1_protein1
CD-HIT will then generate a list, which sequences are all similar and the representative sequence of the cluster. Because, you already have sequence header with file information in it, you will now easily know which proteins are present in multiple FASTA files.
~Prakki Rama.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Please clarify your specific problem or add additional details to highlight exactly what you need.
Your comments are supposed to be pasted in these boxes based on the forum rules.
Yes. You can automate using
sed
.Eg: