Question

Identifying gene copy number in genome assembly

0

Entering edit mode

5 weeks ago

pedro.rrmb • 0

Hi everyone,

I have a set a genes (which I know are from a certain species) and I would like to know how many copies of each gene there are inside this species genome. The natural approach to me would be to use blastn, sice I want the acctual nucleotide copies in the genome and not genes that produce similar proteins.

But I'm having a difficult time finding which parameters I sould foccus on to certify which alignments represent a copy and which doesn't. Let's take this following example: using blastp with ACT14 gene sequence as query and the refseq_reference_genome of cotton (NCBI) as subject: enter image description here

What I would focus is the '% identity' and the query coverage. Looking at the results I would say there is 3 copies of that gene, one in the chromossome NC_053447 and two in NC_053434, since those 3 alignments have 100% query coverage and an identity from 96,1% to 100%. The maximum query coverage of the other alignments is 61%, which doens't strike me as an acctual copy. But I'm not sure, are there other parameters I sould be looking at? Sould I be considering lower values of identity or coverage? Are there objective values I could use in case I find gradualy higher values, like 80% for identity or coverage?

CNV assembly copy-number gene • 296 views

ADD COMMENT • link updated 5 weeks ago by lieven.sterck 15k • written 5 weeks ago by pedro.rrmb • 0

score 3 · Accepted Answer · 2024-11-29

Valid approach in my book! If we can assume your query sequences are form the same species then also your thresholds are OK (100% might in some cases be a bit too strict, so perhaps loosen that value a bit)

Of course you took an easy case ;-) , the match seems to consist of a single HSP. Often, however, it will be split in 'exon' matches and you will need to summarize some of those values to get an overall score for the whole gene ( and that in a meaningful way: make sum of the correct parts of the hit alignments not to double count some regions)

An alternative approach might be (not sure if you have an annotated genome?) is to build gene families and then analyse those for your genes of interest (the gene family construction will already compensate for the hurdle I mentioned above of the split hits) ... This only if the genome is annotated, if you would first need to annotate it, you're likely better of with the blastn approach

EDIT: I just notice you mention 'blastp' , I assume you likely mean tblastn as you search a protein input sequence (query) to a nucleotide DB (hits). blastp is protein vs protein. I simply mention this as the split hits I mentioned above will rarely occur for a blastp analysis but often for a tblastn (eukaryotic then even, for prokaryotes it's much less)