UCSC Table Browser: redundant CDS sequences in generated fasta
1
1
Entering edit mode
10.4 years ago
lilla.davim ▴ 160

Hello,

I used UCSC Table Browser to generate a fasta where each entry is a CDS within a given region. The parameters I used are the following:

  • group: Genes and gene predictions
  • track: Ensembl genes
  • table: ensGene
  • region: position chr20:250000-1000000
  • output format: sequence
  • output file: myfile.fasta

Then:

  • sequence type for Ensembl Genes : genomic

And in "sequence retrieval region options": CDS Exons (only) and "One FASTA record per region (exon, intron, etc.)".

Now in the resulting fasta some identical sequences occur several times, with the same range but a different ID, for instance:

>hg19_ensGene_ENST00000217233_0 range=chr20:368655-368945 5'pad=0 3'pad=0 strand=+ repeatMasking=none
ATGCGAGCC......

> (...)

>hg19_ensGene_ENST00000449710_0 range=chr20:368655-368945 5'pad=0 3'pad=0 strand=+ repeatMasking=none
ATGCGAGCC......

Why does this occur and how can I obtain a fasta where each entry is a unique range?

Thanks.

table-browser ucsc cds • 3.3k views
ADD COMMENT
1
Entering edit mode

That gene has multiple transcripts, which is what you're seeing. Download the whole genome and the annotation file and use R or bioperl/biopython.

ADD REPLY
2
Entering edit mode
10.4 years ago
Bert Overduin ★ 3.7k

This does occur because most genes have multiple alternative transcripts annotated, and the CDSs of these can (partially) overlap. Ensembl does annotate one transcript per gene as canonical (from their glossary: "For human, the canonical transcript for a gene is set according to the following hierarchy: 1. Longest CCDS translation with no stop codons. 2. If no (1), choose the longest Ensembl/Havana merged translation with no stop codons. 3. If no (2), choose the longest translation with no stop codons. 4. If no translation, choose the longest non-protein-coding transcript."), so you could consider to only take the CDSs from these. However, the only way to do this, as far as I am aware, is by using the Ensembl Perl API. I am happy to provide you with some code to accomplish this, but you would have to install the Ensembl API yourself (easiest way is by using the Ensembl virtual machine). If you decide to do this and have questions / run into problems with regard to the API installation, please contact the Ensembl Helpdesk at helpdesk@ensembl.org.

Also, I don't know what your ultimate goal is, but you probably should ask yourself if just taking one CDS per gene is the right thing to do for what you want to accomplish.

ADD COMMENT
0
Entering edit mode

Addendum: Reading back again, I think I may have misunderstood your question. Do you only want to get rid of those CDSs that are exactly identical or also of overlapping ones? If the first, then you should just filter your output file for unique locations, if the second, then my reply above still holds.

ADD REPLY

Login before adding your answer.

Traffic: 1435 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6