Entering edit mode
6 weeks ago
analyst
▴
50
Hi all,
I am constructing gene-based pangenome for fungus. I have used orthofinder. Please suggest which file should be used to find core and dispensible genes; Orthogroups.GeneCount.tsv or Orthogroups.tsv.
Note that row names are orthogroups and column names are genomes.
This is how Orthogroups.GeneCount.tsv file looks like
This is how Orthogroups.tsv file looks like
You should be looking for orthogroups that have at least one sequence from all species in
Orthogroups.tsv
if my memory serves me right. The set of all those orthogroups will then be your "strict core" genome. I presume all the other orthogroups (that do not have contributions from all species) together represent the set of "dispensable" genes for this pangenome. You can play with cutoffs for what constitutes the core genome (e.g., gene should be found inx%
of all species) and you will get different results based on what you set as the cutoff.Thankyou so much Dunois for your valuable input. Do you mean the file with gene names not the gene counts right?
About cutoffs I am following this paper from where I am following this information:
From above categories I infer that orthogroups were considered not sequences or genes therefore I got confused whether I should look for orthogroup in all genomes (like present absent) or orthogenes under orthogroup ?
Please provide your valuable suggestion
Thanks a lot!
Yes, the
Orthogroups.tsv
file in this case will do.An "orthogroup" is just a set of sequences (be it genes, transcripts, or proteins) that happen to be both evolutionarily related with one another and are found in a set of species of interest. As an orthogroup will contain sequences related to one another via both speciation and duplication events, you will not necessarily find sequences affiliated with the orthogroup in all species.
In your case here, in the
Orthogroups.tsv
, for example, each row of the file represents an orthogroup (OGXXXXXXXX
). For example,OG0000000
seems to have at least one sequence affiliated with it from all of your species (as far as I can tell from your screenshot) given that every column (i.e., every species) has an entry for this row. This is also an example of an orthogroup that you could count as contributing towards a "strict core" genome given that all species (seem to) have sequences belonging to this orthogroup.On the other hand, something like
OG0000002
does not appear to have sequences from all member species. Depending on how many species do have entries in this row, this orthogroup could either be a "soft core" genome contributor or one that can be considered a "dispensable" (set of) genes.To calculate whether or not an orthogroup belongs to the core genome (or not), you'll basically need to calculate (in your case) for each row (i.e., orthogroup) whether or not it is present in a sufficient number of genomes to warrant a particular classification (e.g., contributing to the "core" genome or not). You'll have to do some scripting in some language to get this done.
Here's some code in
R
with some toy data that you should be able to adapt for your own analysis:Should I look for orthogroups that have at least one sequence from all species or at least one same sequence from all species?
Thankyou
You're basically looking to check whether or not each orthogroup has at least one sequence from all species (in the case of the "core" genome).
You will not find the "same" sequence in different species.