Question

Filtering transcripts by transcript support level (TSL)

1

Entering edit mode

9.4 years ago

jth ▴ 190

Hi,

I have a question on filtering transcripts from Ensemble by transcription support levels (TSL). Currently, I am collecting Ensembl transcripts for three separate purposes:

Calculating nucleotide distributions on exons, introns, and UTRs separately (from canonical transcripts to avoid redundancy).
For a given genomic location, providing an annotation based on location in a gene model.
RNA quantification (I have recently read this: https://cgatoxford.wordpress.com/2015/10/21/improving-kallisto-quantification-accuracy-by-filtering-the-gene-set/)

I am inclined to filter out TSL 4 (the best supporting EST is flagged as suspect) and TSL 5 (no single transcript supports the model structure) for all purposes to provide more accurate distributions, annotations, quantification, etc.

When I filter according to this criteria, 50,672 transcripts (total: 191,632) and 5,401 canonical transcripts (total: 57,387) are eliminated from autosomal chromosomes. Among eliminated transcripts, 22,035 transcripts (~29% of total protein coding transcripts) and 1,941 canonical transcripts (~10% of total protein coding canonical transcripts) are protein coding. Since these numbers are a bit high and may influence especially the first purpose, I became a bit suspicious of this strategy. At this point, the link I have provided shows an interesting result for quantification too, which left me more confused.

So, would you think this type of filtering is appropriate for the given purposes, or is it an over-conservative and/or unnecessary approach?

Thanks!

ensembl genome sequence transcripts • 4.9k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.4 years ago by jth ▴ 190

1

Entering edit mode

Filtering out TSL4 or 5 seems reasonable

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 9.3 years ago by Rm 8.3k

score 1 · Answer 1 · 2017-02-24

1

Entering edit mode

8.1 years ago

Vasisht ▴ 190

Filtering out TSL4 and 5 will also filter out genes like JAK1, SMAD4 which have RefSeq and CCDS transcripts but none of the principal isoforms are TSL 1 through 3. It may be better to filter via APPRIS or use an overlap with RefSeq/CCDS.

ADD COMMENT • link 8.1 years ago by Vasisht ▴ 190