Question

Isoform Reconstruction

3

Entering edit mode

12.7 years ago

GPR ▴ 390

Dear community. I have a rather broad question I would like have your input on. I am trained in structural biology and proteomics, and following the needs of my project, I have recently ventured into learning how to analyze transcript level RNA-seq data. Like in proteomics and protein identification, my fellow biologists seem recalcitrant to the idea that RNA-seq experimental data can be used to infer gene isoforms and even better calculate their abundances. It is literally impossible to convey the message that the excel spread sheet they get at the end of the analysis is not a list of predictions one needs to validate with a vast amount of convoluted PCR and/or cloning experiments. In the case of mass spec data I have well funded arguments to show that an MS/MS fragmentation pattern explains a peptide sequence or a phosphorylation site for example, if the statistical parameters are good enough. In the case of isoform reconstruction from RNA-seq data, I am not sure I have all the arguments at hand. I would therefore appreciate if any (or many) of you could give me the point of view of bioinformaticists. A few specific questions are: Are the bioinformatics tools available (e.g. TopHat-Cufflinks-Cuffdiff), mature enough to reconstruct isoforms? Furthermore, in a reference-guided analysis, what can I make of novel isoforms, in particular those tagged as class_code J? Your input will be appreciated. G.

• 3.9k views

ADD COMMENT • link updated 12.7 years ago by Qdjm 1.9k • written 12.7 years ago by GPR ▴ 390

2

Entering edit mode

This post seems relevant: http://www.biostars.org/post/show/16649/how-are-rnaseq-transcripts-assigned/#16649

ADD REPLY • link 12.7 years ago by Malachi Griffith 20k

0

Entering edit mode

Very good thread, thanks lots!

ADD REPLY • link 12.7 years ago by GPR ▴ 390

score 1 · Answer 1 · 2012-08-18

1

Entering edit mode

12.7 years ago

Qdjm 1.9k

As a computational biologist who understands a bit about how these programs work but who has never worked with this data, I'm not comfortable with any isoform reconstructions yet unless the splicing graph has only has one or a small number of possible isoforms. In most RNA-seq technologies, the reads only cover 1-3 exons, so you have to infer the existence and abundance of isoforms by matching exons and exon-exon junctions with the same read depth. This seems to me to be problematic. However, this is an active area of research, and people have published algorithms to infer isoform abundance but I suspect that they only work well when (a) there are only a small number of isoforms expressed and (b) the isoforms all have relatively high abundances.

I do believe the splice boundary calls and I also believe the gene abundance levels (where all reads to the gene are counted, regardless of the isoform). What kind of novel isoforms are you talking about? If they contain a previously unobserved splice boundary that is supported by substantial read depth, I would be more confident than if they don't contain any new exons or splice boundaries.

ADD COMMENT • link 12.7 years ago by Qdjm 1.9k

1

Entering edit mode

I wanted to write this answer, but had to run in the morning. I am of the same opinion. I also attended talks from developers who claimed that they predict genes with 1 isoform with 76% accuracy and it dropped to 45% with 2 isoforms! Just 2. Their lines of arguments are always as to how their software performs 4% better than the other one. I don't want to blame them, they put these efforts after all. Even out of the existing softwares, some of them give you the minimal set of isoforms and others give you all possible combinations. We don't know much about the mechanism of splicing, in that, our understanding is not sufficient to dictate the way splicing happens. Until that, I don't think its possible to decipher the absolute number of isoforms from RNA-Seq (or any other) data. Its like trying to find the combination of numbers that add up to 5, except that we don't know that we have to add to 5. No offense to the honest efforts, but it is clear it is still very much ill-posed.

ADD REPLY • link 12.7 years ago by Arun 2.4k

1

Entering edit mode

Thanks for your input, I will consider all these comments strongly.

ADD REPLY • link 12.7 years ago by GPR ▴ 390

0

Entering edit mode

Thanks for your answer. I am talking about known splice isoforms. Mainly the junction-based distribution of transcript abundance among the known isoforms per gene. I am being conservative in my analysis by using a reference database (gtf file) to guide transcript reconstruction. Regarding the coverage, I have an very good data set, about 400 million paired-end reads and a biological replicate with about 150 million paired-end reads. Your comments actually bring me back to a question I have posted, previously but for which I haven't gotten an answer so far. This is: how low is too low when looking at isoform abundance? I am trying to cut it at 1 FPKM, even though Cuffdiff, tags isoforms with less than 1 FPKM as significant. Is this a good cut off? Thanks

ADD REPLY • link 12.7 years ago by GPR ▴ 390