Dear all, I got some confused when analyzing the Orthofinder results. I have three RNA-sequencing data for A, B,C species (A has refenrence genome and B,C haven't have a reference genome), My aim is to find the genes existed in B, C species, but absent in A species. I think Orthofinder should be work for this, after Orthofinder analysis, I got several output, one is a file called "Orthogroups.GeneCount.csv", I am not sure is this is the one that I need? The file is looks like this:
A B C Total
OG0000000 28 30 13 71
OG0000001 26 31 1 58
OG0000002 6 49 0 55
OG0000003 13 40 0 53
OG0000004 18 16 18 52
OG0000005 29 19 4 52
OG0000006 18 33 0 51
OG0000007 4 46 0 50
OG0000008 28 18 4 50
I assume that OG0000002, OG0000003, OG0000006, OG0000007 is the gene that I need ?(but I am not condident if I am right or not..),
And there is another file called "Orthogroups.csv", I am confused is it tell us the correspondence for OG0000000 number and the ID in original input protein file?
Or if there is any other output file for orthofinder...(there are bunch of output files)..
Thanks in advance for your suggestions and have a great day!!
Thanks for your kindly reply!! I think I made a mistake for the input file of Orthofinder then results in too much sequence in output file. for the data without reference genome, I just use cd-hit and transdecoder to get the protein file, do I need to use "get_longest_isoform_seq_per_trinity_gene.pl" as well to get the unigene? (I am not sure if cd-hit and get_longest_isoform_seq_per_trinity_gene.pl are both necessary to get a unigene, or I just need one of them..) Will be really appreciated If you also have any suggestions on that.
Thank u!!
cd-hit will resolve some of the redundancy but will likely not have that much effect as what trinity already outputs should be more or less non-redundant (at least on technical/sequence level ) . running the cd-hit equivalent for proteins might help a little as well.
getting one isoform per 'locus' will help indeed, so running this perl script (don't know it to be honest) could do the trick.
Keep in mind though that comparing transcriptome data with genomic data is always tricky and can result in 'strange' results due to the inherent nature of those data types.