Dear all,
I want to download gene sequences for a list of genes, including UTR annotations. I was for some reason sure that this information was provided for each genome at the moment of release coming from some gene model identification step. I was thus sure to have extracted also the UTR regions when downloading [Unspliced (Gene) option in Biomart/Ensembl] for a list of genes the gene sequences in bulk. It turned out that those sequences were not annotated. Also turned out that i had no idea about this as i've no experience with genome assembly/annotation :)
Can somebody tell me how this annotation is performed and why for some genomes it is released and for others it is not ? Is there a way to perform this annotation ex-novo for example for a list of genes of interest? thanks in advance
This blog post from Ensembl talks about UTR annotation process. It is from 2018 and some things may have changed but bulk should still be informative.
Thank you. Could you tell if this annotation is normally done by genome paper authors or by the Ensembl staff? I don't think i've ever read, at least in plant papers, about UTR annotation pipeline.
The Ensembl pipeline is an automated system that combines multiple evidence lines to produce best guess annotations. In some cases this consists mostly of just importing transcript structures from another database (I believe this is the case for Fly where Flybase is the primary source of data and Arabidopsis where TAIR is the primary source of data). A small number of genomes have some manual annotation as well (Human most obviously, but I think there is also Mouse manual annotation).
Without an external annotation, the primary sources of data are multispecies protein alignments and cDNA/EST sequencing, plus, in more recent years, NGS sequencing. As you can see in the above blog post, for automatically annotated species (which is most of them), the UTR is simply the span between the stop codon and the cDNAs that have been aligned to a sequence.
In less well annotated species, you may have quite a lot of genes where the evidence comes primarily from multiple-sequence alignment of protein coding sequences from other speices, plus a small amount of patchy EST evidence. In this case there is likely not UTR annoatated.
You don't want the Unspliced (Gene) option in Biomar/Ensembl to get the UTRs. The cDNA option will include the full sequence of both the coding and UTR sequences.
Remember that only coding gens have UTRs.
I'm looking for the entire gene sequences as for my purpose i want information on UTRs and introns. Thats why i chose that option.