Dear all, I'm not sure if this is a vague question, but here we go. I've got an annotation of non-protein-coding gene transcripts made up of repetitive elements. I'd like to check whether any of the predicted transcripts can be identified as "real" transcripts (in other words, I would expect that the gene annotations have multiple RNA-seq reads aligning to them, to an extent that a bioinformatic tool defines these putative transcripts as effective transcripts). I've got access to 500 RNA-seq samples which I can use to do this. What would be the best approach to do this, considering I'd like to allow these "new sequences" to be detected alongside regular protein-coding genes?
Something I started piecing together was as follows: simply cat
forward / reverse reads together from all 500 samples. Align reads using hisat2
. Run stringtie
to identify transcripts using my annotation (of protein-coding genes + new genetic features) as "reference" annotation.
What would you do?
How long are these repetitive elements, and are there SNPs with strong confidence in each repeat?
The putative genes in this annotation vary in length, from 100-10000+ bp. They do have strong confidence SNPs in them. There is a program that quantifies expression of these features in RNA-seq data, based on these genetic differences (Telescope). This program is also the source of this annotation of putative transcripts.
Just to clarify, you want to check whether any of these repetitive elements have appreciable expression in your cohort of 500 RNA-seq samples? I've not heard of Telescope before, but it looks like it takes a similar approach to resolving multimappers as the more popular alternatives such as Salmon. If Salmon or Telescope can resolve the multimappers based on the SNPs, and there is an appreciable estimated abundance, I would probably be confident enough to say they are expressed. Perhaps Rob can add to this.