Question

Compairng multiple sequence and finding similarity.

0

Entering edit mode

8.3 years ago

EVR ▴ 610

HI,

I have two fasta files with same seqnames but with slightly different sequence names like follows:

File_1
    >trinity5_comp_3_0_1
    TTATGCATAT
    >trinity5_comp_9735_1_5
    AAAATGATGA
    >trinity5_comp_645_0_2
    TCGAATGCGA
    >trinity5_comp_3169_0_1
    AGGATATTAC
FIle2
    >trinity5_comp_3_0_1
    TTATGCATAT
    >trinity5_comp_9735_1_5
    AAAATGCCGA
    >trinity5_comp_645_0_2
    TAGAATGCGA
    >trinity5_comp_3169_0_1
    AGGATATTAC

I would like compare each sequence of File1 with respect to corresponding sequences in File2 and compute its percentage of similarity like follows:

trinity5_comp_3_0_1 100%
trinity5_comp_9735_1_5 80%
trinity5_comp_645_0_2 90%
trinity5_comp_3169_0_1 100%

I tried using cd-hit-est-2d but sequences are also compared with other sequences rather than its own corresponding sequences in file2. Kindly guide me.

Thanks in advance

DNA_Sequences Fasta Sequence_comparision • 2.4k views

ADD COMMENT • link updated 8.3 years ago by GouthamAtla 12k • written 8.3 years ago by EVR ▴ 610

0

Entering edit mode

You can blast your sequences, with the flag -max_target_seqs = 1, and also filter by % identity or set an e-value threshold.

ADD REPLY • link 8.3 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

But it doesn’t guarantee you that it will blast against its corresponding sequence with the same header

ADD REPLY • link 8.3 years ago by EVR ▴ 610

0

Entering edit mode

True. He is missing the gene cluster ids for each isoform in the assembly, however. @Tom, why do you have two trinity fasta? Are you comparing two different assemblies?

ADD REPLY • link 8.3 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

These two fasta files are generated using different approaches and I want to compare how different are they by comparing their percentage similarity

ADD REPLY • link 8.3 years ago by EVR ▴ 610

0

Entering edit mode

Do they follow same order always ?

ADD REPLY • link 8.3 years ago by GouthamAtla 12k

0

Entering edit mode

Yes they do follow same order

ADD REPLY • link 8.3 years ago by EVR ▴ 610

0

Entering edit mode

How about splitting the two files into individual sequences and then compare pair wise manner using cd-hit-est-2d ?

ADD REPLY • link 8.3 years ago by GouthamAtla 12k

0

Entering edit mode

Hi, I thought of doing that, but i have around 3000 sequences which could be tiresome. But anyhow I will try that method too

ADD REPLY • link 8.3 years ago by EVR ▴ 610

0

Entering edit mode

Its not tiresome, I updated my answer.

ADD REPLY • link 8.3 years ago by GouthamAtla 12k

score 1 · Answer 1 · 2016-08-22

1

Entering edit mode

8.3 years ago

GouthamAtla 12k

I do not know what is the scale of your data, but something like calculating edit-distance might help you.

from Bio import SeqIO
import editdistance,sys

fa1 = SeqIO.parse("1.fa", "fasta")
fa2 = SeqIO.parse("2.fa", "fasta")

for rec1,rec2 in zip(fa1,fa2):

    if ( rec1.id != rec2.id ):
        print "The sequences are not in order. Exiting"
        sys.exit()

    length=len(rec1.seq)
    dist=editdistance.eval(rec1.seq,rec2.seq)
    print rec1.id,100-((dist*100)/length)

output:

trinity5_comp_3_0_1 100
trinity5_comp_9735_1_5 80
trinity5_comp_645_0_2 90
trinity5_comp_3169_0_1 100

Assumes the sequences are in same order always. If your fasta sequences are very long, this is definitely not a good solution.

If you want to run cd-hit-est-2d on each pair of sequences:

Make a backup of your files.

cp 1.fa 1.fa.bkp
cp 2.fa 2.fa.bkp

Add a prefix to your fasta header. Check how it works for your system.

sed -i '' -e '/^>/s/$/.1/g' 1.fa
sed -i '' -e '/^>/s/$/.2/g' 2.fa

Then split the fasta file into individual sequences.

for fa in {1,2}.fa
do
curl https://raw.githubusercontent.com/gouthamatla/fasta_File_Manipulation/master/SplitFastaFile.py | python - ${fa}
done

Now you have all your sequences split with .1.fasta and .2.fasta suffix.

Then run cd-hit-est-2d in pairwise manner:

for sample in *.1.fasta 
do
base=`basename  ".1.fasta"`
echo "cd-hit-command $base.1.fasta $base.2.fasta" | parallel --jobs 4
done
rm *.1.fasta *.2.fasta

change the --jobs according to available cores.

ADD COMMENT • link 8.3 years ago by GouthamAtla 12k

0

Entering edit mode

I tried your python command, it worked. but as I am not that familiar with python, I wanted to write it to file, so I used file.write() but it throwed me error saying that TypeError: write() takes exactly 1 argument (2 given). How can make the python script to write to file. If possible a bound python script which takes two fasta file as input and write it into a file.

ADD REPLY • link 8.3 years ago by EVR ▴ 610

0

Entering edit mode

Python program definitely works but you should know how it is doing. To write to a file,

python script.py > out.txt

ADD REPLY • link 8.3 years ago by GouthamAtla 12k

0

Entering edit mode

I already figured out. I tried to delete that comment but I couldnt do it. thanks for your guidance.

ADD REPLY • link 8.3 years ago by EVR ▴ 610

0

Entering edit mode

Please validate the results with couple of sequences. Take few random pairs of sequences are run cd-hit on them to compare the outputs. It's very important.

ADD REPLY • link 8.3 years ago by GouthamAtla 12k