HI,
I have two fasta files with same seqnames but with slightly different sequence names like follows:
File_1
>trinity5_comp_3_0_1
TTATGCATAT
>trinity5_comp_9735_1_5
AAAATGATGA
>trinity5_comp_645_0_2
TCGAATGCGA
>trinity5_comp_3169_0_1
AGGATATTAC
FIle2
>trinity5_comp_3_0_1
TTATGCATAT
>trinity5_comp_9735_1_5
AAAATGCCGA
>trinity5_comp_645_0_2
TAGAATGCGA
>trinity5_comp_3169_0_1
AGGATATTAC
I would like compare each sequence of File1 with respect to corresponding sequences in File2 and compute its percentage of similarity like follows:
trinity5_comp_3_0_1 100%
trinity5_comp_9735_1_5 80%
trinity5_comp_645_0_2 90%
trinity5_comp_3169_0_1 100%
I tried using cd-hit-est-2d
but sequences are also compared with other sequences rather than its own corresponding sequences in file2. Kindly guide me.
Thanks in advance
You can blast your sequences, with the flag -max_target_seqs = 1, and also filter by % identity or set an e-value threshold.
But it doesn’t guarantee you that it will blast against its corresponding sequence with the same header
True. He is missing the gene cluster ids for each isoform in the assembly, however. @Tom, why do you have two trinity fasta? Are you comparing two different assemblies?
These two fasta files are generated using different approaches and I want to compare how different are they by comparing their percentage similarity
Do they follow same order always ?
Yes they do follow same order
How about splitting the two files into individual sequences and then compare pair wise manner using
cd-hit-est-2d
?Hi, I thought of doing that, but i have around 3000 sequences which could be tiresome. But anyhow I will try that method too
Its not tiresome, I updated my answer.