Hello,
I have about 100 fasta files and need to find those 10 that share the highest number of identical headers. I.e., I am looking for the 10 fasta files for which the number of species that are present in all of them is the highest. Any idea how to do that? A script (perl, python, bash) would be great, but software recommendations are also welcome.
Thank you and happy holidays!
Assuming that post is about identifying files only, with same fasta headers, minor changes to code:
output from OP code (edited .fasta to .fa in OP code):
post minor modification:
To print files with same headers (with header, count and file names):
test input files with .fa extension:
This is a great advice. Thank you very much! I had to replace "grep -wf - *.fa" by "grep -wf *.fa". I assume the dash was added accidentally, is that right? With the dash it wasn't working. However, now it's working but for all OTUs (headers), the number of files in which they are represented is reduced by 1 (the first occurrence is missing). Also, I don't understand the reason for the modifications of Pierre's command but it seems that with exception of this command (cat *.fa | grep "^>" | sort | uniq | while read H; do grep -F "${H}" -l *.fa; done | sort | uniq -c | sort -n) some numbers specifying the headers that occur in all files are slightly different now.
one-liner always rocks!
Hi Pierre! Thanks for the fast reply. This command, however, would give me the 10 fasta files that have the highest number of redundant headers within the same file. There are no redundant headers in each of my files. What I need is to find the 10 fasta files with the highest number of identical/redundant headers among all the fasta files.
my bad, I've fixed with
cat
@Pierre : Your command seems to output the top 10 most redundant headers not the top 10 fasta file names non ?
@Pierre and Nicolas: That's true. It gives the headers not the file names.
Great, the edited version is doing the job :) Thank you, Pierre!