Question

constructing gene-based pangenome from orthofinder output file

0

Entering edit mode

6 weeks ago

analyst ▴ 50

Hi all,

I am constructing gene-based pangenome for fungus. I have used orthofinder. Please suggest which file should be used to find core and dispensible genes; Orthogroups.GeneCount.tsv or Orthogroups.tsv.

Note that row names are orthogroups and column names are genomes.

This is how Orthogroups.GeneCount.tsv file looks like enter image description here

This is how Orthogroups.tsv file looks like enter image description here

orthofinder pan-genome • 495 views

ADD COMMENT • link updated 27 days ago by Dunois ★ 2.8k • written 6 weeks ago by analyst ▴ 50

2

Entering edit mode

You should be looking for orthogroups that have at least one sequence from all species in Orthogroups.tsv if my memory serves me right. The set of all those orthogroups will then be your "strict core" genome. I presume all the other orthogroups (that do not have contributions from all species) together represent the set of "dispensable" genes for this pangenome. You can play with cutoffs for what constitutes the core genome (e.g., gene should be found in x% of all species) and you will get different results based on what you set as the cutoff.

ADD REPLY • link 6 weeks ago by Dunois ★ 2.8k

0

Entering edit mode

Thankyou so much Dunois for your valuable input. Do you mean the file with gene names not the gene counts right?

About cutoffs I am following this paper from where I am following this information:

The single-copy orthologs were used to generate a maximum-likelihood tree with Fusarium graminearum (GenBank accession: GCA_000240135.3) as outgroup. Clustered families were divided into five categories: core (present in all 35 genomes, 30.4% of the families), soft-core (present in exactly 34 genomes, 6.2%), dispensable (present in 3 to 33 genomes, 32.3%), peripheral (present in 2 or 3 genomes, 14.5%) and private (present in only 1 genome, 16.6%).. (copied text from paper)

From above categories I infer that orthogroups were considered not sequences or genes therefore I got confused whether I should look for orthogroup in all genomes (like present absent) or orthogenes under orthogroup ?

Please provide your valuable suggestion

Thanks a lot!

ADD REPLY • link 6 weeks ago by analyst ▴ 50

2

Entering edit mode

you mean the file with gene names

Yes, the Orthogroups.tsv file in this case will do.

I infer that orthogroups were considered not sequences or genes

An "orthogroup" is just a set of sequences (be it genes, transcripts, or proteins) that happen to be both evolutionarily related with one another and are found in a set of species of interest. As an orthogroup will contain sequences related to one another via both speciation and duplication events, you will not necessarily find sequences affiliated with the orthogroup in all species.

In your case here, in the Orthogroups.tsv, for example, each row of the file represents an orthogroup (OGXXXXXXXX). For example, OG0000000 seems to have at least one sequence affiliated with it from all of your species (as far as I can tell from your screenshot) given that every column (i.e., every species) has an entry for this row. This is also an example of an orthogroup that you could count as contributing towards a "strict core" genome given that all species (seem to) have sequences belonging to this orthogroup.

On the other hand, something like OG0000002 does not appear to have sequences from all member species. Depending on how many species do have entries in this row, this orthogroup could either be a "soft core" genome contributor or one that can be considered a "dispensable" (set of) genes.

To calculate whether or not an orthogroup belongs to the core genome (or not), you'll basically need to calculate (in your case) for each row (i.e., orthogroup) whether or not it is present in a sufficient number of genomes to warrant a particular classification (e.g., contributing to the "core" genome or not). You'll have to do some scripting in some language to get this done.

Here's some code in R with some toy data that you should be able to adapt for your own analysis:

#Freezing the RNG; you do not need to do this in your own code.
set.seed(seed = 2)
#Toy data
df <- data.frame(Orthogroup = paste0("OG", sprintf("%07d", 0000000:0000009)),
                 sp1 = sample(c("", LETTERS[1:3]), 10, replace = TRUE),
                 sp2 = sample(c("", LETTERS[1:3]), 10, replace = TRUE),
                 sp3 = sample(c("", LETTERS[1:3]), 10, replace = TRUE))
df[11,] <- c(Orthogroup = "OG00000010", sp1 = "A", sp2 = "B", sp3 = "C")
df
# Orthogroup sp1 sp2 sp3
# 1   OG0000000           B
# 2   OG0000001   B       B
# 3   OG0000002   A   A   B
# 4   OG0000003   A   B   C
# 5   OG0000004   C       B
# 6   OG0000005   C   B    
# 7   OG0000006       A   A
# 8   OG0000007       C    
# 9   OG0000008       C   C
# 10  OG0000009   C   A   C
# 11 OG00000010   A   B   C

#Calculating the number of genomes in each orthogroup.
df$nsp_in_grp <- rowSums(sapply(df[,-1], function(x){ifelse(x == "", 0, 1)}))
df
# Orthogroup sp1 sp2 sp3 nsp_in_grp
# 1   OG0000000           B          1
# 2   OG0000001   B       B          2
# 3   OG0000002   A   A   B          3
# 4   OG0000003   A   B   C          3
# 5   OG0000004   C       B          2
# 6   OG0000005   C   B              2
# 7   OG0000006       A   A          2
# 8   OG0000007       C              1
# 9   OG0000008       C   C          2
# 10  OG0000009   C   A   C          3
# 11 OG00000010   A   B   C          3

#Calculating orthogroup coverage as a fraction of all genomes.
df$perc_sp_in_grp <- df$nsp_in_grp/(ncol(df)-2)
df
# Orthogroup sp1 sp2 sp3 nsp_in_grp perc_sp_in_grp
# 1   OG0000000           B          1      0.3333333
# 2   OG0000001   B       B          2      0.6666667
# 3   OG0000002   A   A   B          3      1.0000000
# 4   OG0000003   A   B   C          3      1.0000000
# 5   OG0000004   C       B          2      0.6666667
# 6   OG0000005   C   B              2      0.6666667
# 7   OG0000006       A   A          2      0.6666667
# 8   OG0000007       C              1      0.3333333
# 9   OG0000008       C   C          2      0.6666667
# 10  OG0000009   C   A   C          3      1.0000000
# 11 OG00000010   A   B   C          3      1.0000000

#Classifying orthogroups into parts of the core, soft core, and dispensable 
#genome classifications based on what proportion of genomes they are found in.
df$classification <- ifelse(df$perc_sp_in_grp == 1, "core", 
                            ifelse(df$perc_sp_in_grp >= 0.6, "soft_core", 
                                   "dispensable"))
df
# Orthogroup sp1 sp2 sp3 nsp_in_grp perc_sp_in_grp classification
# 1   OG0000000           B          1      0.3333333    dispensable
# 2   OG0000001   B       B          2      0.6666667      soft_core
# 3   OG0000002   A   A   B          3      1.0000000           core
# 4   OG0000003   A   B   C          3      1.0000000           core
# 5   OG0000004   C       B          2      0.6666667      soft_core
# 6   OG0000005   C   B              2      0.6666667      soft_core
# 7   OG0000006       A   A          2      0.6666667      soft_core
# 8   OG0000007       C              1      0.3333333    dispensable
# 9   OG0000008       C   C          2      0.6666667      soft_core
# 10  OG0000009   C   A   C          3      1.0000000           core
# 11 OG00000010   A   B   C          3      1.0000000           core

#Checking the sizes of the core, soft core, and dispensable portions of
#this pangenome
table(df$classification)
# core dispensable   soft_core 
# 4           2           5

ADD REPLY • link 5 weeks ago by Dunois ★ 2.8k

0

Entering edit mode

Should I look for orthogroups that have at least one sequence from all species or at least one same sequence from all species?

Thankyou

ADD REPLY • link 6 weeks ago by analyst ▴ 50

2

Entering edit mode

You're basically looking to check whether or not each orthogroup has at least one sequence from all species (in the case of the "core" genome).

or at least one same sequence from all species?

You will not find the "same" sequence in different species.

ADD REPLY • link 27 days ago by Dunois ★ 2.8k