Hello everybody.
I'm currently working on identifying single-copy genes within a specific genome. I was using OrthoFinder as recommended some posts. However, OrthoFinder seems to require input from at least two species to perform its analysis, but my project is focused exclusively on one species.
Has anyone faced a similar issue or does anyone have experience in working around this? I'm interested in any strategies or alternative tools that are well-suited for identifying single-copy genes in a single-species context.
Thank you in advance
Chicken geneome seems to be very complex making your quest that much more difficult:
https://www.pnas.org/doi/full/10.1073/pnas.2216641120
https://academic.oup.com/mbe/article/39/4/msac066/6553873
https://www.nature.com/articles/s42003-023-05619-y
What is the organism?
I think orthologue clustering is the right approach, you just might have to 'hack' the process a bit.
If it's microbial, then you could try
roary
. I don't know for sure if it will accept just one genome but if it does, then all you need to do is get the orthologue groups with only a single member.Another option could be to use 2 genomes, make sure they have distinct identifiers, run the analysis as usual, then just filter out/throw away the 'dummy' genome (it would need to be as closely related as possible so as not to skew the results too badly). Any clusters which have only one member with the identifier you care about will be the same result in essence.
The 'right' way to do this however is probably just to do some simple clustering using e.g CD-HIT and then find clusters with a single member (orthologue clustering tools are really just doing this under the hood, so it's less hacky this way).
The organism is Gallus gallus.
Yes, I did what you mention. I download another organism (pan paniscus) and as a result, Ortofinder generates a folder called Single_Copy_ortologue_Sequences. Inside there are many fasta files, but I don't know if these sequences are really what I need or not... since I don't know how I can check them.
I would probably just start with a stringent re-alignment back to the genome and see if you get multiple hits. It will be a bit manual/laborious, but check a few by hand. A global aligner would probably be better than BLAST, but I'd start with that and see what you get.
I don't see why they wouldn't be the right thing though tbh,
Thank you Joe. I will do that.
You could simply cluster the sequences at, say, 90% sequence identity and 90% coverage. All those sequences that do not end up in a cluster should theoretically be single copy genes, I believe. But this is probably a very "weak" heuristic given that the cutoffs for clustering are entirely arbitrary.