Question

Estimate number of particular genes in my assembly

0

Entering edit mode

4.1 years ago

Gonçalo • 0

Hello everyone. I am new here so forgive me if I am not doing things in a most correct way. After successfully performing genome assembly ( I hope) from a particular species, I need to provide an estimation of how many cytochromes genes are present in my assembly. I really struggling to come up with a strategy to approach this task. Would you use BLAST to compare/find regions of similarity with other known sequences? But how do I do that with a particular gene (CYP in this case)?. Any help would be very much appreciated. Thank you very much.

ps - this is related to some MSc coursework

Gonçalo

Assembly gene genome • 840 views

ADD COMMENT • link updated 4.1 years ago by NH ▴ 10 • written 4.1 years ago by Gonçalo • 0

1

Entering edit mode

Is there a related genome available that is annotated? You could start with that and compare. It would also give you some idea of how good your assembly is.

BLAST would be a good way to start to look at individual genes. Have you done gene predictions? Preferably do your comparisons at protein level to have confidence in the results.

If the gene is expected to be multi-copy then your assembly may have collapsed those copies if you did not have long read data. So keep that in consideration.

ADD REPLY • link 4.1 years ago by GenoMax 147k

0

Entering edit mode

Thank you very much for your reply. I didn't do gene predictions yet. Would you suggest using something like MAKER for gene prediction and then use BLAST to find regions of similarity with the reference genome? My task is basically to perform a de novo assembly for the fire ant Solenopsis Invicta as, apparently, the official genome assembly is quite fragmented. Then, as part of the same assessment, I am being asked to estimate the number of CYP genes in my assembly.

ADD REPLY • link 4.1 years ago by Gonçalo • 0

1

Entering edit mode

Running MAKER on your assembly would be fine. Since it is an annotation pipeline it will produce valued added results for the whole genome. You may also want to run BUSCO to see if your assembly is reasonably complete.

ADD REPLY • link 4.1 years ago by GenoMax 147k

score 1 · Answer 1 · 2020-10-07

Depending on the stringency of the requirements BLAST may be a reasonable method, but you would need to select a list of CYP genes from your species found on an online database, keep them in a file and blast those against your assembled genome. Obviously this will produce a huge number of matches, but you could then reduce these by selecting for high identity, length and coverage etc using the various options available for blast.

This is a very simple method and introduces many follow-on questions, but I'm sure if your work is asking for just an estimate, there might also be a question regarding the strengths and limitations of such a method of estimating genes.

Good luck with your MSc!!