Create list of sequences present in multiple FASTA files
1
0
Entering edit mode
10.4 years ago
biostars • 0

Hi, I'm trying to make a list of amino acid sequences that are present in all of a selection of FASTA files I have. To make things confusing they all different feature IDs. Is there a script I can run that would be capable of doing this?

Thanks!

fasta • 3.1k views
ADD COMMENT
1
Entering edit mode

Please clarify your specific problem or add additional details to highlight exactly what you need.

ADD REPLY
1
Entering edit mode

Your comments are supposed to be pasted in these boxes based on the forum rules.

Yes. You can automate using sed.

Eg:

sed 's/>/>file1_/g' file1.fasta >file1NamesChanged.fasta
ADD REPLY
0
Entering edit mode
10.4 years ago
Prakki Rama ★ 2.7k

One possibility can be

  1. Change the headers in the each fasta file according to file name. Suppose, if the sequence in file1.fasta is >protein1, you can change it to >file1_protein1
  2. Then merge all the fasta files into one file.
  3. Run CD-HIT (with parameters like identity)

CD-HIT will then generate a list, which sequences are all similar and the representative sequence of the cluster. Because, you already have sequence header with file information in it, you will now easily know which proteins are present in multiple FASTA files.

~Prakki Rama.

ADD COMMENT
0
Entering edit mode

Thanks Prakki, is there a way to automate the renaming? There are quite a few sequences and it would take a long time doing it manually.

ADD REPLY

Login before adding your answer.

Traffic: 2709 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6