common sequence between multi fasta file
1
0
Entering edit mode
6.8 years ago
Sam ▴ 150

Dear All

how I can retrieve common seqs between multiple fasta (12 ) files?

Thanks

bash • 3.7k views
ADD COMMENT
1
Entering edit mode

I already obtain common seqs with this command , is it a correct way ?

comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(comm -12 <(sort A1_novel_mirnas.fa) <(sort A2_novel_mirnas.fa)) <(sort B1_novel_mirnas.fa)) <(sort B2_novel_mirnas.fa)) <( sort C1_novel_mirnas.fa)) <(sort C2_novel_mirnas.fa)) <(sort D1_novel_mirnas.fa)) <(sort D2_novel_mirnas.fa)) <(sort E1_novel_mirnas.fa)) <(sort E2_novel_mirnas.fa)) <(sort F1_novel_mirnas.fa)) <(sort F2_novel_mirnas.fa)
ADD REPLY
0
Entering edit mode

So you have RNA sequence. Can you try replacing U with T and running @5heikki's script again?

ADD REPLY
0
Entering edit mode

Hello Sam,

what do you mean by "common seqs". Please describe more clearly what you have and what your goal is.

fin swimmer

ADD REPLY
0
Entering edit mode

Hi

Common mean sequence which is available in all fasta file. I want to find common sequences between all 12 fasta files.

Thanks in advance

ADD REPLY
0
Entering edit mode

Are there many sequences in each file or just one per file?

ADD REPLY
0
Entering edit mode

about 50 in each file

ADD REPLY
1
Entering edit mode

Doing quick blat searches is one option that should be pretty fast. Other option may be CD-HIT to do pair-wise comparisons.

ADD REPLY
0
Entering edit mode

The unix commandline utility comm will be useful here, so long as you can sort your files first. (Assuming ‘common’ in this question means the sequences or headers are identical)

ADD REPLY
0
Entering edit mode
  1. make a list of the unique fasta sequences

    cat *.mfa | grep -v ">" | sort -u > unique_sequences

  2. loop each line in all .mfa file

    cat unique_sequences | while read line; do grep -w $line *.mfa; done

  3. you could easily go further and check which sequence is present 12 times

ADD REPLY
2
Entering edit mode
6.8 years ago
5heikki 11k

This assumes that 1) blast is installed and that 2) the script is run in the dir where .fasta named files reside

#!/bin/bash
FILES=($(find . -maxdepth 1 -type f -name "*.fasta"))
for((i=0;i<${#FILES[@]}-1;i++)); do
    blastn -query ${FILES[$i]} -subject ${FILES[$i+1]} -outfmt '6 qseqid qlen length slen qseq' \
        | awk 'BEGIN{OFS=FS="\t"}{if($2==$3 && $2==$4){print ">"$1"\n"$5}}' > tmp.fasta
    cp tmp.fasta ${FILES[$i+1]}
done

So we blast first file against second file and put into tmp.fasta only the identical sequences between the files. Then we replace the second file with tmp.fasta. Then we blast that against the third file. Etc. Files will be deleted. If there are common sequences, in the end they will be in the tmp.fasta file. Here we assume that if query length = alignment length = subject length the sequences are the same. Of course, this might not be the case. I leave it for you how to make it more precise blastn -help

ADD COMMENT
0
Entering edit mode

Did you mean to use blast2seq? I don't see blast indexes being made anywhere.

Edit: That is the correct syntax for blast2seq (not having used it on the command line made me check again).

@sam: If you are getting an error then post the exact error or explain what "is not working".

ADD REPLY
0
Entering edit mode

unfortunately it's not working and as Genomax mentioned for blastn we need index , which is missing

ADD REPLY
0
Entering edit mode

If your file names don't end in .fasta then you need to change the code above to account for that.

ADD REPLY
0
Entering edit mode

Warning: [blastn] Query is Empty! and it convert all fasta file to empty file!

ADD REPLY
0
Entering edit mode

I just tested @5heikki's script and it worked flawlessly.

Do all your files have consistent names/format?

ADD REPLY
0
Entering edit mode

If the sequences are really short add -task blastn-short to the command. Also, is it possible that you don't have any "universal" seqs?

ADD REPLY
0
Entering edit mode

the length of seqs are about 20 bp , how ever I tested with blastn-short and it produced a huge temp file over than 600 Mb from just 600 seqs (total amount of seqs from all files is about 600 ) . when I lunched it with just 4 lib it was fine but for all 12 libs it's not work correctly

ADD REPLY
0
Entering edit mode

Check from blastn -help output how to add a field to outfmt for sequence similarity and filter also based on that in the awk command. In case the "temp" file is still huge, you must have a lot of redundancy in your input.

ADD REPLY

Login before adding your answer.

Traffic: 1478 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6