Question

Sequence similarity scores between two sets of genes from different genomes

1

Entering edit mode

7.7 years ago

avaneesh.t ▴ 20

I have a set of genes from the Yeast genome (~3000) and a set of genes from Human genome (~6000). I want to align each yeast gene against each human gene, and get a similarity score for each pair. The lengths of the genes would be different, many pairs would be very dissimilar.

1) How would I go about doing this, say with R ? 2) Are there some specific things i should take into while doing my analysis?

sequence similarity alignment genome • 2.4k views

ADD COMMENT • link updated 7.7 years ago by Charles Yin ▴ 180 • written 7.7 years ago by avaneesh.t ▴ 20

0

Entering edit mode

Do you need to achieve this in R? There are loads of great commanline utilities for alignment

ADD REPLY • link 7.7 years ago by Joe 21k

score 0 · Answer 1 · 2017-03-15

0

Entering edit mode

7.7 years ago

Benn 8.3k

Maybe inParanoid can help you further?

http://inparanoid.sbc.su.se/cgi-bin/index.cgi

There are R bioconductor libraries available, but limited (only 1 yeast species: S. cervisiea).

https://bioconductor.org/packages/release/BiocViews.html#___InparanoidDb

ADD COMMENT • link 7.7 years ago by Benn 8.3k

0

Entering edit mode

Inparanoid and other databases give me a list of orthologs. While this would help me validate my pariwise "similarity scores" (orthologs should have higher similarity scores?), they do not tell me how similar non-orthologous genes are.

ADD REPLY • link 7.7 years ago by avaneesh.t ▴ 20

0

Entering edit mode

Do you want 3000 x 6000 similarity scores (18 M)??

ADD REPLY • link 7.7 years ago by Benn 8.3k

0

Entering edit mode

Yes. That is the idea. Though, now that you bring that up, I should probably try and target a smaller subset.

ADD REPLY • link 7.7 years ago by avaneesh.t ▴ 20

0

Entering edit mode

It is possible to do these 18M alignments by your computer, but how to interpret the results is something to consider.

If you want to do these 18M pairwise alignments, you can use EMBOSS command line tool for it. Depending on if you like global or local alignment, you can use needle or water, respectively. The results will also contain identity for each pair, so you'll need some bash skills to extract them in the right way (e.g., using GREP).

For example:

needleall -auto true -asequence yeast.fasta -bsequence human.fasta \
-datafile EDNAFULL -outfile yeast_human.needleall -aformat markx0

grep "Identity:" yeast_human.needleall > yeast_human.needleall.identity

ADD REPLY • link 7.7 years ago by Benn 8.3k

score 0 · Answer 2 · 2017-03-15

For a large set of genomes, alignment may not work since it takes very long time. You may consider to use alignment free method. My paper is as follows with MATLAB code available, the link to the programs is inside the paper. The method can process different lengths of DNA sequences (even scaling).

Yin, C., Chen, Y., & Yau, S. S. T. (2014). A measure of DNA sequence similarity by Fourier Transform with applications on hierarchical clustering. Journal of theoretical biology, 359, 18-28.

score 0 · Answer 3 · 2017-03-15

Also please check this paper for the improved method for even scaling and code.

Yin, C., & Yau, S. S. T. (2015). An improved model for whole genome phylogenetic analysis by Fourier transform. Journal of Theoretical Biology. doi:10.1016/j.jtbi.2015.06.033

[https://www.mathworks.com/matlabcentral/fileexchange/52072-phylogenetic-analysis-of-dna-sequences-or-genomes-by-fourier-transform][1]