Hi all,
I'm trying to get into the theory and practice of gene copy number variation (CNV) analysis, but there is something basic confusing me, which I couldn't yet figure out. Sorry if this is a dumb/trivial question - would appreciate your help anyway.
My confusion is regarding the terms 'gene copy' and 'paralogs'. As far as I understand, paralogs are created when a gene undergoes duplication (never mind by what molecular mechanism), and then starts accumulating mutations as evolution proceeds. So, if gene X was duplicated to create another X, and then X changed to become X', are X and X' considered copies of the same gene, or are they paralogs? Is it a matter of applying some threshold on the sequence similarity between X and X', so they are considered copies up to the point where they diversify enough? Or maybe gene copies are expected to be perfect duplicates? If so, I'd guess that finding such gene pairs is very rare... Maybe it's a matter of function, so once X' gets a different function from X (neo-functionalization), it is considered a paralog? This is a rather complex and difficult to measure definition...
To make things more clear, I'm interested in CNV analysis in the context of whole genome sequence data (not older technologies such as CGH), if that matters.
Could anyone clarify this point for me, or refer me to relevant literature? Thanks a lot!
Maybe look at biology SE?
In very simple terms, a gene copy is still ‘the same gene’, it is just simply, a copy. A paralog may no longer be considered the same gene however, if it has sufficiently drifted since duplication.
In your nomenclature, you could perhaps think of it as Gene X copies and then there is Gene X1 and X2. Eventually X2 might turn in to X’ which has gone on to acquire a new function. I would personally say this is no longer a copy.