How can I identify single-copy genes in a specific genome?
2
1
Entering edit mode
11 months ago
Hamtaro ▴ 50

Hello everybody.

I'm currently working on identifying single-copy genes within a specific genome. I was using OrthoFinder as recommended some posts. However, OrthoFinder seems to require input from at least two species to perform its analysis, but my project is focused exclusively on one species.

Has anyone faced a similar issue or does anyone have experience in working around this? I'm interested in any strategies or alternative tools that are well-suited for identifying single-copy genes in a single-species context.

Thank you in advance

genome single-copy genes • 2.7k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
0
Entering edit mode

What is the organism?

I think orthologue clustering is the right approach, you just might have to 'hack' the process a bit.

If it's microbial, then you could try roary. I don't know for sure if it will accept just one genome but if it does, then all you need to do is get the orthologue groups with only a single member.

Another option could be to use 2 genomes, make sure they have distinct identifiers, run the analysis as usual, then just filter out/throw away the 'dummy' genome (it would need to be as closely related as possible so as not to skew the results too badly). Any clusters which have only one member with the identifier you care about will be the same result in essence.

The 'right' way to do this however is probably just to do some simple clustering using e.g CD-HIT and then find clusters with a single member (orthologue clustering tools are really just doing this under the hood, so it's less hacky this way).

ADD REPLY
0
Entering edit mode

The organism is Gallus gallus.

Yes, I did what you mention. I download another organism (pan paniscus) and as a result, Ortofinder generates a folder called Single_Copy_ortologue_Sequences. Inside there are many fasta files, but I don't know if these sequences are really what I need or not... since I don't know how I can check them.

ADD REPLY
0
Entering edit mode

I would probably just start with a stringent re-alignment back to the genome and see if you get multiple hits. It will be a bit manual/laborious, but check a few by hand. A global aligner would probably be better than BLAST, but I'd start with that and see what you get.

I don't see why they wouldn't be the right thing though tbh,

ADD REPLY
0
Entering edit mode

Thank you Joe. I will do that.

ADD REPLY
0
Entering edit mode

You could simply cluster the sequences at, say, 90% sequence identity and 90% coverage. All those sequences that do not end up in a cluster should theoretically be single copy genes, I believe. But this is probably a very "weak" heuristic given that the cutoffs for clustering are entirely arbitrary.

ADD REPLY
2
Entering edit mode
11 months ago
shelkmike ★ 1.4k

All genes are single-copy. What you mean is probably "single-copy orthogroups", i.e. orthogroups that have 1 gene in this species. An orthogroup is, by definition, a set of genes that originates from a single gene in the last common ancestor of a set of species. Hence, which orthogroups are single copy in your species depends on what other species you use for comparison.

ADD COMMENT
1
Entering edit mode

I interpreted it slightly differently - that OP might be interested in genes which are in the early stages of duplication/paralogy, in which case one can look within a single genome.

My experience is all microbial where this is very common though. I wouldn't have expected this to be particularly prevalent in higher organisms.

ADD REPLY
0
Entering edit mode

Basically, I'm doing a study about the length of telomeres. These sequences are not expressed, so I don't measure them in RNA but in DNA.

When I perform PCR, it will give me an estimate of how many times this sequence repeats. That data alone is not useful because I need to know the number of cells I have analyzed since these repetitions can be due to either very long telomeres in a sample or a very high number of cells I have introduced. I think the best way to determine the number of cells is by using single-copy genes or genes with a fixed copy number.

ADD REPLY
1
Entering edit mode

In that case you don't need a real gene. You need a reliable marker that you can use as being a copy of 1 (or more based on ploidy). There are likely programs that will allow you to feed your sequence in and look for unique PCR products that you can test.

ADD REPLY
0
Entering edit mode
11 months ago
Mensur Dlakic ★ 28k

Single-copy genes is a relative concept. As you heard above, most genes in a single genome are single-copy. One needs to do a comparative analysis of at least hundreds of genomes to find which genes are single-copy across many species.

It sound like you are interested primarily in telomere repeats. Not sure that you can normalize that by amplifying a single-copy gene. For what it is worth, I would download the chicken proteome and do a BLASTP against itself. Whatever genes match only themselves and nothing else are guaranteed to be "single-copy." I am putting that in quotations because it would be a single-copy within that genome, which is not how the term is normally used. Even proteins that have distantly related matches with relatively high E-values are likely to be "single-copy" for your purposes.

ADD COMMENT
0
Entering edit mode

Thank you for your comment. I have tried to do what you mention. The problem is that Blast has a file size limit, and downloading the whole proteome and load into Blast gives me a memory error.

ADD REPLY
0
Entering edit mode

The problem is that Blast has a file size limit, and downloading the whole proteome and load into Blast gives me a memory error.

This should not be a problem at all with any modern computer. I don't know exactly the number of proteins in a chicken genome, but I don't think it is more than tens of thousands. This should be a trivial BLASTP search, so I suspect something is not right in how you are doing it.

ADD REPLY
0
Entering edit mode

You wont be able to load a whole proteome in to the BLAST web interface.

If you're doing it locally, have you made a BLAST database from the multifasta?

ADD REPLY
0
Entering edit mode

I am putting that in quotations because it would be a single-copy within that genome, which is not how the term is normally used.

Can you please elaborate on what you meant with this statement?

ADD REPLY
1
Entering edit mode

It's in the following sentence - typically a single copy gene is considered within the context of a group of genomes (the strain, species, genus etc) - i.e. a gene which, within the guardrails of the cutoffs chosen, only appears once in a single specific genome.

In the context above we're talking about a gene appearing within only one considered genome one or more times. The point Mensur is making is that in the conventional 'orthogroup way of thinking' all of the genes in that genome would be single copy (ignoring ploidy).

ADD REPLY
0
Entering edit mode

Okay, that makes sense. Had to re-read Dr. Dlakic's statements again carefully.

Thank you for the clarification.

ADD REPLY
1
Entering edit mode

Can you please elaborate on what you meant with this statement?

Single-copy genes are used for all kind of phylogenetic analyses because they are by definition present in (almost) all genomes as a single copy. That means no ambiguity when searching for homologous proteins to align in many species, as it should be a clear cut single candidate in each of them. For example, there are ~100 single-copy proteins shared by just about all bacterial species, which include ribosomal proteins, RNA and DNA polymerases, some proteases and DNA repair and recombination proteins.

This was not the intended "single-copy" meaning of the OP, as they were talking within a single genome.

ADD REPLY

Login before adding your answer.

Traffic: 1965 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6