Dear All
how I can retrieve common seqs between multiple fasta (12 ) files?
Thanks
Dear All
how I can retrieve common seqs between multiple fasta (12 ) files?
Thanks
This assumes that 1) blast is installed and that 2) the script is run in the dir where .fasta named files reside
#!/bin/bash
FILES=($(find . -maxdepth 1 -type f -name "*.fasta"))
for((i=0;i<${#FILES[@]}-1;i++)); do
blastn -query ${FILES[$i]} -subject ${FILES[$i+1]} -outfmt '6 qseqid qlen length slen qseq' \
| awk 'BEGIN{OFS=FS="\t"}{if($2==$3 && $2==$4){print ">"$1"\n"$5}}' > tmp.fasta
cp tmp.fasta ${FILES[$i+1]}
done
So we blast first file against second file and put into tmp.fasta only the identical sequences between the files. Then we replace the second file with tmp.fasta. Then we blast that against the third file. Etc. Files will be deleted. If there are common sequences, in the end they will be in the tmp.fasta file. Here we assume that if query length = alignment length = subject length the sequences are the same. Of course, this might not be the case. I leave it for you how to make it more precise blastn -help
Did you mean to use blast2seq? I don't see blast indexes being made anywhere.
Edit: That is the correct syntax for blast2seq (not having used it on the command line made me check again).
@sam: If you are getting an error then post the exact error or explain what "is not working".
the length of seqs are about 20 bp , how ever I tested with blastn-short and it produced a huge temp file over than 600 Mb from just 600 seqs (total amount of seqs from all files is about 600 ) . when I lunched it with just 4 lib it was fine but for all 12 libs it's not work correctly
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I already obtain common seqs with this command , is it a correct way ?
So you have RNA sequence. Can you try replacing
U
withT
and running @5heikki's script again?Hello Sam,
what do you mean by "common seqs". Please describe more clearly what you have and what your goal is.
fin swimmer
Hi
Common mean sequence which is available in all fasta file. I want to find common sequences between all 12 fasta files.
Thanks in advance
Are there many sequences in each file or just one per file?
about 50 in each file
Doing quick
blat
searches is one option that should be pretty fast. Other option may be CD-HIT to do pair-wise comparisons.The unix commandline utility
comm
will be useful here, so long as you can sort your files first. (Assuming ‘common’ in this question means the sequences or headers are identical)make a list of the unique fasta sequences
cat *.mfa | grep -v ">" | sort -u > unique_sequences
loop each line in all .mfa file
cat unique_sequences | while read line; do grep -w $line *.mfa; done
you could easily go further and check which sequence is present 12 times