Question

Simplify annotation file by collapsing gene isoforms into a single annotation per gene

2

Entering edit mode

3.1 years ago

citizen_852 ▴ 30

Hi,

I need a simplified annotation file that contains a single "complete" annotation for each gene of the human genome. In other words, what I need is similar to when an annotation track in the UCSC Genome Browser is changed from full view to dense (see images below). Does anyone know of a simple way to collapse individual gene isoforms of a gtf file into single "complete" gene annotations?

Thanks.

All isoforms

Isoforms collapsed into single "complete" gene annotation

gtf annotation gff UCSC • 4.1k views

ADD COMMENT • link updated 3.1 years ago by Carambakaracho ★ 3.3k • written 3.1 years ago by citizen_852 ▴ 30

score 2 · Answer 1 · 2021-11-08

2

Entering edit mode

3.1 years ago

liorglic ★ 1.5k

Maybe this script can help.

ADD COMMENT • link 3.1 years ago by liorglic ★ 1.5k

0

Entering edit mode

Thanks for the reply. From what I understand the script filters away short isoforms and only keeps the longest. While that works for some genes (like LSM4) this approach could result in losing the annotation of certain exons only present in shorter isoforms. I would like to obtain an annotation file that has one variant per gene with all known and putative exons.

ADD REPLY • link 3.1 years ago by citizen_852 ▴ 30

3

Entering edit mode

This would create an biologically invalid gff, in my inflexible database influenced mind - so I guess you'd have to come up with a solution.

Options for handling gtf/gff:

ADD REPLY • link 3.1 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

Thank you for the resources! The collapse_annotation script looks promising.

I think it is a bit harsh to call it biologically invalid. Though we lose resolution of gene isoforms we maintain all exons that are assigned to a particular gene. If you only need to know if a sequence can be assigned to a particular gene then isoforms just complicates the process.

ADD REPLY • link 3.1 years ago by citizen_852 ▴ 30

0

Entering edit mode

admitted, the phrase is a bit extreme. I totally see the purpose, a colleage has done the same, though tracing which exon comes from which isoform. Just in case you ever think about annotation of transcript/protein changes.

ADD REPLY • link 3.1 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

AGAT does not have any script to collapse features in this way (yet).

ADD REPLY • link 3.1 years ago by Juke34 9.0k

0

Entering edit mode

I have never seen this done before and not sure I understand the benefit of doing that. But I guess that's up to you. In any case, this will probably require merging exons and coming up with specific rules to resolve conflicting mRNA models. I'm curious about the downstream purpose of this procedure.

ADD REPLY • link 3.1 years ago by liorglic ★ 1.5k

0

Entering edit mode

I guess one benefit is to reduce the search space while maintaining information of exons associated with individual genes.

Personally I wanted this type of annotation to make it super quick and easy to find overlap between a sequence and the most upstream and downstream exon of genes. Some genes have e.g. one or more isoforms with exon 1 positioned downstream of the most upstream exon of the gene as a whole.

Luckily, since I am only looking at exons at the extreme ends, it is not too complicated to extract these and I have come up with a procedure that works fine. But having an annotation file with collapsed gene isoforms would have simplified the procedure.

ADD REPLY • link 3.1 years ago by citizen_852 ▴ 30

score 1 · Answer 2 · 2021-11-09

1

Entering edit mode

3.1 years ago

lassefolkersen ▴ 60

I think RefSeq select is exactly what you are looking for. It's a version of refseq that is made to pick a transcript based on multiple hierarchically scored criteria, plus manual curation. I often use it to avoid overly complicated VEP annotations of my VCF files. The nice thing over other solution-proposals is also that you avoid having to explain complex script-setups, but just refer to that NCBI-webpage.

ADD COMMENT • link 3.1 years ago by lassefolkersen ▴ 60

1

Entering edit mode

Thanks for the input!

Great tool. However, picking only a single transcript variant could result in losing the annotation of certain exons only present in one or more of the discarded variants. It is important for my analysis that I retain the annotation of the most extremely positioned exons. Luckily I think that have I come up with a filtering procedure that works fine for this purpose.

ADD REPLY • link 3.1 years ago by citizen_852 ▴ 30

1

Entering edit mode

Do you care to share? Despite my earlier dismissal this might be of interest to others in this forum.

ADD REPLY • link 3.1 years ago by Carambakaracho ★ 3.3k