Our lab are working on alternative splicing event in some non-model organisms
however, there are few mRNA (transcripts) in the annotation files, 1 transcript per gene in most cases.
We thought that it is because the genome is poor annotated, so we collect a lot RNA-seq data and apply Hisat and StringTie to generate novel splice variants
There are a lot more transcripts, reaching 2 transcript per gene in the output
However, we concern about the new transcripts, as the new transcripts exceed the gene region in the annotation file, in some case the whole exon exceed the gene region, and a few cases the whole transcript exceed the gene region..
May i have your advice on other tools or what do you think about the new transcripts?
do you think they are acceptable?
is there any handy tool to find new combination of exon from RNA-seq data and .gtf annotation file?
thank you very much!!
In a non-model organism, I suspect much of the annotation will come from the alignment of protein sequences from other organism against the genome in order to look for homologs. Thus the annotated gene region will only contain the ORF of the gene and not the UTRs. UTRs are the worst annotated part of any genome, even the human genome, and disagreements as to the start and stop co-ordinates of the gene between sequencing data and annotation are to be expected, alternate 5' and 3' exons are not that uncommon. I would be inclined to trust that there is at least something in these transcripts, although be aware that many alternate transcripts are either non-protein coding or are unstable in the cell. Check for the integrity of the ORF and the presence of introns in the putative 3' UTR.
Where you have a whole transcript outside the gene region, there are several possiblities:
It is a new protein coding gene that has not been previously annotated, either because it has no homologs in other organisms, or is sufficiently diverged as not to show up, or is only expressed in a small number of conditions.
It is at a lincRNA
It is a product of non-specific background transcription.
You are looking at DNA contamination in your RNA preps (all RNA preps have some DNA contamination)
You are looking at a mapping artefact.
For the last two one would not expect to see splice junctions in the new transcript. For this reason people are very skeptical of novel single exon transcripts. For the first three, if you are interested in the transcript, its all about doing the old school biology really.
thank you very much for your detail answer!
what do you think about a whole new exon appear before or after the annotated gene region? Actually they make up large portion of our output. I've checked that some of them overlap with nearby gene while some are completely unannotated.
if i removed them all, there will be not much left..
I'd be careful with exons that overlapped other genes - its not that its not possible, many human genes overlap, its just that there are reasons why they might also be artefacts. Is your RNA-seq stranded? I'd also want to be careful with new exons 3' of the ORF: classically we think of exon boundaries more than 50bp after the stop codon as triggering nonsense-mediated decay. There is no reason you shouldn't have exons 5' of the ORF though. Many genes don't have their start codon in the first exon, have alternate transcription start sites or whole alternate 5' exons.
we are from bioinformatics lab, and the RNA-seq data are collected from different experiments from different databases, i think most are not stranded. We apply a pipeline which will generate new transcripts, achieving our initial attempt to increase the number of isoforms, yet we worry if biologist will accept the result. i am now also searching for pipeline that will only generate new combination of exon. Many thanks for spending time with me!
thank you very much for your detail answer! what do you think about a whole new exon appear before or after the annotated gene region? Actually they make up large portion of our output. I've checked that some of them overlap with nearby gene while some are completely unannotated. if i removed them all, there will be not much left..
I'd be careful with exons that overlapped other genes - its not that its not possible, many human genes overlap, its just that there are reasons why they might also be artefacts. Is your RNA-seq stranded? I'd also want to be careful with new exons 3' of the ORF: classically we think of exon boundaries more than 50bp after the stop codon as triggering nonsense-mediated decay. There is no reason you shouldn't have exons 5' of the ORF though. Many genes don't have their start codon in the first exon, have alternate transcription start sites or whole alternate 5' exons.
we are from bioinformatics lab, and the RNA-seq data are collected from different experiments from different databases, i think most are not stranded. We apply a pipeline which will generate new transcripts, achieving our initial attempt to increase the number of isoforms, yet we worry if biologist will accept the result. i am now also searching for pipeline that will only generate new combination of exon. Many thanks for spending time with me!