Question

How to merge the annotation from different softwares?

2

Entering edit mode

3.2 years ago

mzzzzzzzzzz ▴ 40

Hi all, I'm quite new for genome annotation. So far I got an annotation lifted from the reference genome by liftoff, an annotation predicted by Braker (with hints from short-read RNA seq, reference protein, and long-read RNAseq), and an gtf transcripts file generated by long-reads analysis (SQANTI3 filtered results). I would like to merge all of these annotation together, in which all the unique transcripts will be kept.

So my 1st question is whether it's proper to do this merge, considering that Braker results already combine all of the hints and that the other two tools (liftoff & sqanti) use the same evidence? My reason to do so is that Braker use tsebra to select transcript models in the end, which may neglects some of the evidence that are maintained in the other tools. So from what I can see, liftoff results have 3000 more transcripts than Braker results after tsebra selection, and SQANTI3 filtered results still have 60 more new genes than Braker/tsebra results.

My 2nd question is, if I would like to do this merge, what tools can I use? I checked gffcompare and Tama_merge, and it seems like these tools only merge the transcripts together, especially after gffcompare I only got exon info left in my gff file (no CDS info at all). I haven't tried Tama_merge yet, because it need a lot of file format conversion. Could you please give me some suggestions?

Also, after merge, I would like to keep the gtf/gff file with the annotation format, not the transcriptom format. I mean I would like the file have a line for gene, a line for transcript, and then all the exon, cds and other stuff for the transcript. How can I do that?

Last but not least, I would like to keep the gene name of the final annotation file same as in the reference. How can I do this? Shall I use Blast for this purpose?

Thanks a lot!

genome annotation • 4.2k views

ADD COMMENT • link updated 2.5 years ago by Ram 45k • written 3.2 years ago by mzzzzzzzzzz ▴ 40

score 3 · Answer 1 · 2022-04-13

You need a chooser/combiner tool like MAKER, Evidence Modeler... see this page for some other tools: https://github.com/NBISweden/GAAS/blob/master/annotation/knowledge/annotation_tools_genome.md

There is also brute approach like:

agat_sp_merge_annotations.pl from AGAT that will merge gene (add isoforms) when CDS of genes overlap and append genes when do not overlap
- agat_sp_complement_annotations.pl from AGAT that will just append to a reference annotation all gene from another that do not overlap CDS (Useful when one annotation is better than other one, but the later contains loci not found by the first approach).

score 0 · Answer 2 · 2022-04-13

0

Entering edit mode

3.2 years ago

liorglic ★ 1.5k

I'd recommend EvidenceModeler. It takes any number of gff files from multiple sources and weights for each source, which you have to assign based on how much you trust them. The only problem is that the combined gene models get new IDs, and AFAIK there is no way to tell it to keep the reference IDs, so you'll have to think of some post-analysis that will transfer the IDs when relevant.

ADD COMMENT • link 3.2 years ago by liorglic ★ 1.5k

0

Entering edit mode

Guess I will have to do a blast and change the name by myself. Thanks for sharing your opinion! I will have a look at EvidenceModeler!

ADD REPLY • link 3.1 years ago by mzzzzzzzzzz ▴ 40