Hi,
I need to predict pseudogenes from the assembled genome of a catfish. For this, I need to predict the genes from the genome and make a parent protein set for finding similarity in intergenic regions of the genome. There is possibility of processed pseudogene being predicted as a gene during prediction. Which software can be used for gene prediction that avoids the pseudogene in the results? Thanks in advance.
How complete/polished is your catfish genome assembly?
Also: do you have some good quality RNA-Seq?
With a draft genome it would be hard to guess if a gene X is missing some say first or last exon(s) because a faulty genomic region duplication (pseudo-gene) or it is just missing from the assembly. Same goes for a frame shift/stop codon introduced by a sequencing error vs inactivating mutation in a paralogue.
You may get processed pseudogenes where introns will be missing.
Thanks for the answer.
The draft genome of walking catfish Clarias magur with the coverage of 94 percent of estimated genome size. Assembly scaffolding and several rounds of iterations resulted in 3484 scaffolds. The primary assembly unit does not have any assembled chromosomes or linkage groups.
Before looking for pseudogenes in Clarias magur I would try to get some idea if the contigs from your assembly can be ordered using C.gariepinus chromosomes. And map all 42k proteins to your species using i.e. miniprot
Just check that C.gariepinus is not somehow tetraploid, since then things get more complicated.
Last but not least:
To get the general feeling about the quality of your genome assembly and annotation you may select 10 largest contigs and do the alignments with other Clarias genomes, map proteins, then take a look in a genome browser, compute stats.
Hi,
I don't know if there are tools designed to predict pseudogenes, but If i understand correctly from your post you could predict all the ORFs (using Artemis or ORF finder) and after that to compare/align each identified ORF with eachother to see the sequence similarities between and to find some of pseudogenes (using BLAST or something similar).
I hope this is helpful.
Thanks for the answer. The draft genome of walking catfish Clarias magur with the coverage of 94 percent of estimated genome size. Assembly scaffolding and several rounds of iterations resulted in 3484 scaffolds. The primary assembly unit does not have any assembled chromosomes or linkage groups.
Looks like there are five Clarias genomes:
The most complete seems to be Clarias gariepinus with ca 42k proteins.
Before looking for pseudogenes in Clarias magur I would try to get some idea if the contigs from your assembly can be ordered using C.gariepinus chromosomes. And map all 42k proteins to your species using i.e. miniprot
Just check that C.gariepinus is not somehow tetraploid, since then things get more complicated.
Last but not least: To get the general feeling about the quality of your genome assembly and annotation you may select 10 largest contigs and do the alignments with other Clarias genomes, map proteins, then take a look in a genome browser, compute stats.