Finding fasta files that share the highest number of identical headers
2
0
Entering edit mode
7.0 years ago
jan • 0

Hello,

I have about 100 fasta files and need to find those 10 that share the highest number of identical headers. I.e., I am looking for the 10 fasta files for which the number of species that are present in all of them is the highest. Any idea how to do that? A script (perl, python, bash) would be great, but software recommendations are also welcome.

Thank you and happy holidays!

sequence script perl python fasta • 1.9k views
ADD COMMENT
4
Entering edit mode
7.0 years ago
cat  *.fasta | grep "^>" | sort | uniq -c | sort -n | tail

EDIT

(...) That's true. It gives the headers not the file names.

cat *.fasta | grep "^>" | sort | uniq | while read H; do grep -F "${H}" -l *.fasta; done | sort | uniq -c | sort -n
ADD COMMENT
1
Entering edit mode

Assuming that post is about identifying files only, with same fasta headers, minor changes to code:

$ cat *.fasta | grep "^>" | sort | uniq  -d | while read H; do grep -Fw "${H}" -l *.fasta; done | sort | uniq -c | sort -n

output from OP code (edited .fasta to .fa in OP code):

$ cat *.fa | grep "^>" | sort | uniq | while read H; do grep -F "${H}" -l *.fa; done | sort | uniq -c | sort -n
      1 test1.fa
      1 test2.fa
      1 test3.fa
      1 test4.fa
      1 test6.fa
      2 test5.fa
      2 test7.fa

post minor modification:

$ cat *.fa | grep "^>" | sort | uniq -d | while read H; do grep -wF "${H}" -l *.fa; done | sort | uniq -c | sort -n
      1 test1.fa
      1 test2.fa
      1 test3.fa
      1 test4.fa
      1 test6.fa

To print files with same headers (with header, count and file names):

$ grep -h "^>" *.fa | sort | uniq -d | grep  -wf - *.fa | sort -k2 -t  ":" | datamash -t : -g2 count 2 collapse 1  | sed 's/:/\t/g'
>seq1   3   test1.fa,test2.fa,test6.fa
>seq2   2   test3.fa,test4.fa

test input files with .fa extension:

  $ cat test1.fa 
    >seq1
    atgc
    $ cat test2.fa 
    >seq1
    atgc
    $ cat test3.fa 
    >seq2
    gcta
    $ cat test4.fa 
    >seq2
    gcta
    $ cat test5.fa 
    >seq11
    atgc
    $ cat test6.fa 
    >seq1
    atgc
    $ cat test7.fa 
    >seq12
    cgtatata
ADD REPLY
0
Entering edit mode

This is a great advice. Thank you very much! I had to replace "grep -wf - *.fa" by "grep -wf *.fa". I assume the dash was added accidentally, is that right? With the dash it wasn't working. However, now it's working but for all OTUs (headers), the number of files in which they are represented is reduced by 1 (the first occurrence is missing). Also, I don't understand the reason for the modifications of Pierre's command but it seems that with exception of this command (cat *.fa | grep "^>" | sort | uniq | while read H; do grep -F "${H}" -l *.fa; done | sort | uniq -c | sort -n) some numbers specifying the headers that occur in all files are slightly different now.

ADD REPLY
0
Entering edit mode

one-liner always rocks!

ADD REPLY
0
Entering edit mode

Hi Pierre! Thanks for the fast reply. This command, however, would give me the 10 fasta files that have the highest number of redundant headers within the same file. There are no redundant headers in each of my files. What I need is to find the 10 fasta files with the highest number of identical/redundant headers among all the fasta files.

ADD REPLY
0
Entering edit mode

my bad, I've fixed with cat

ADD REPLY
0
Entering edit mode

@Pierre : Your command seems to output the top 10 most redundant headers not the top 10 fasta file names non ?

ADD REPLY
0
Entering edit mode

@Pierre and Nicolas: That's true. It gives the headers not the file names.

ADD REPLY
0
Entering edit mode

Great, the edited version is doing the job :) Thank you, Pierre!

ADD REPLY
0
Entering edit mode
7.0 years ago
5heikki 11k

This is definitely not the most efficient way..

#!/bin/bash
FILES=($(find . -maxdepth 1 -name "*.fa"))
for ((i=0; i<${#FILES[@]}; i++)); do
    for ((j=0; j<${#FILES[@]}; j++)); do
        echo -e "${FILES[$i]}\t${FILES[$j]}"
        comm -1 -2 \
            <(grep "^>" ${FILES[$i]} | sort) \
            <(grep "^>" ${FILES[$j]} | sort) \
            | wc -l
    done
done
ADD COMMENT
0
Entering edit mode

I am getting a long list showing the number of headers, which are shared among two files each. Getting the names of the 10 files with the most shared headers remains unresolved. Anyway, thanks for trying!

ADD REPLY
0
Entering edit mode
#!/bin/bash
FILES=($(find . -maxdepth 1 -name "*.fa"))
for ((i=0; i<${#FILES[@]}; i++)); do
    for ((j=0; j<${#FILES[@]}; j++)); do
        SCOUNT=$(comm -1 -2 <(grep "^>" ${FILES[$i]} | sort) <(grep "^>" ${FILES[$j]} | sort) | wc -l)
        echo -e "${FILES[$i]}\t${FILES[$j]}\t$SCOUNT"
    done
done | sort -t $'\t' -k3,3g

The inner for loop stuff can be inside

if [[ ${FILES[$i]} != ${FILES[$j]} ]]; then
  ..
fi

And then you probably also want to filter out duplicates (file1 vs file2 / file2 vs file1) and finally pipe to tail -n10 to get just the 10 but I leave these to you

ADD REPLY

Login before adding your answer.

Traffic: 2547 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6