Question

UCSC Table Browser: redundant CDS sequences in generated fasta

1

Entering edit mode

11.1 years ago

lilla.davim ▴ 180

Hello,

I used UCSC Table Browser to generate a fasta where each entry is a CDS within a given region. The parameters I used are the following:

group: Genes and gene predictions
track: Ensembl genes
table: ensGene
region: position chr20:250000-1000000
output format: sequence
output file: myfile.fasta

Then:

sequence type for Ensembl Genes : genomic

And in "sequence retrieval region options": CDS Exons (only) and "One FASTA record per region (exon, intron, etc.)".

Now in the resulting fasta some identical sequences occur several times, with the same range but a different ID, for instance:

>hg19_ensGene_ENST00000217233_0 range=chr20:368655-368945 5'pad=0 3'pad=0 strand=+ repeatMasking=none
ATGCGAGCC......

> (...)

>hg19_ensGene_ENST00000449710_0 range=chr20:368655-368945 5'pad=0 3'pad=0 strand=+ repeatMasking=none
ATGCGAGCC......

Why does this occur and how can I obtain a fasta where each entry is a unique range?

Thanks.

table-browser ucsc cds • 3.6k views

ADD COMMENT • link updated 3.7 years ago by Ram 45k • written 11.1 years ago by lilla.davim ▴ 180

1

Entering edit mode

That gene has multiple transcripts, which is what you're seeing. Download the whole genome and the annotation file and use R or bioperl/biopython.

ADD REPLY • link updated 3.7 years ago by Ram 45k • written 11.1 years ago by Devon Ryan 105k

Ram · Answer 1 · 2014-07-04

This does occur because most genes have multiple alternative transcripts annotated, and the CDSs of these can (partially) overlap. Ensembl does annotate one transcript per gene as canonical (from their glossary: "For human, the canonical transcript for a gene is set according to the following hierarchy: 1. Longest CCDS translation with no stop codons. 2. If no (1), choose the longest Ensembl/Havana merged translation with no stop codons. 3. If no (2), choose the longest translation with no stop codons. 4. If no translation, choose the longest non-protein-coding transcript."), so you could consider to only take the CDSs from these. However, the only way to do this, as far as I am aware, is by using the Ensembl Perl API. I am happy to provide you with some code to accomplish this, but you would have to install the Ensembl API yourself (easiest way is by using the Ensembl virtual machine). If you decide to do this and have questions / run into problems with regard to the API installation, please contact the Ensembl Helpdesk at helpdesk@ensembl.org.

Also, I don't know what your ultimate goal is, but you probably should ask yourself if just taking one CDS per gene is the right thing to do for what you want to accomplish.