Using R to calculate diversity between hundreds of orthologous sequences
1
1
Entering edit mode
9.9 years ago
Adrian Pelin ★ 2.6k

Hello,

I am interested in calculating diversity for a large amount of genes in a given phylum.

What I did, is I took all genes from my organism in question and found true orthologues using inParanoid in 3 different taxa. I now have a table that looks like this:

LOC1 Taxa1_Orth_LOC1 Taxa2_Orth_LOC1 Taxa3_Orth_LOC1
LOC2 Taxa1_Orth_LOC2 Taxa2_Orth_LOC2 Taxa3_Orth_LOC2
...

In column 1 I have the name of the locus in my organism, in columns 2-4 I have the name of the true orthologues locus in taxa 1 2 and 3.

I also have fasta files with the ORFs of all loci from all my different taxa. So for Taxa #1 I have a fasta file with the sequences Taxa1_Orth_LOC1 and Taxa1_Orth_LOC2 and so on....

Now, since I have the fasta files and the table of true orthologues, how can I calculate diversity using R in a quick manner? I know there are ways of doing it in codeml, but setting up each alignment will be a very difficult task.

Any thoughts on how this can be done?

Thank you,

Adrian

R diversity alignment • 2.5k views
ADD COMMENT
0
Entering edit mode
9.9 years ago
Siva ★ 1.9k

If you are willing to consider other options than R, I would suggest using needle or needleall from EMBOSS. This does a global pairwise alignment using Needleman-Wunsch algorithm for sequence sets and reports global sequence similarity and identity between two sequences. Both these programs take two input sequence files.

If you want to compare a sequence from your species of interest against each of its orthologs, use needle. Create a file with the sequence from your species of interest and another file with its ortholog sequences.

If you also want to compare the orthologs sequences among themselves (all-against-all), use needleall. Create a multi-FASTA file of all sequences belonging to a single orthologous group and use the same file as the two input sequences. There might be redundant comparisons (seq1 vs seq2 and seq2 vs seq1).

Just a friendly suggestion about terminologies ("true orthologues"). We cannot infer homology from sequence-similarity based methods such as BLAST (inParanoid uses BLAST). The best we can call the hits as putative homologs. We need to do phylogenetic analyses to talk about homology.

ADD COMMENT

Login before adding your answer.

Traffic: 1882 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6