Hi all,
Bit of an essay apologies...
I've applied a novel transcript discovery pipeline to RNAseq data derived from a cell type grown as a monoculture with/without treatment in vitro, particularly focusing on lncRNAs:
Basic pipeline (open to comments/criticism!): STAR mapping of reads to GENCODEv26 indexed hg38 -> remove non-expressed transcripts from GENCODEv26 -> StringTie to merge abundance-filtered GENCODEv26 transcripts with new transcripts -> remove known ORFs -> CPC/HMMER/RNAcode -> annotated + new lncRNAs
This has yielded some nice data, seeing ~35% newly assembled lncRNAs in my differentially expressed genes. I also have basically a customised specific annotation for this cell type too.
Would now like to see relevance for some in vivo data, I have found tissue-level data which will contain a variable amount of the cell type I started with as well as others. My approach so far would be to RSEM the reads in this dataset to my new customised annotation. A bit messy, but I think enough to show my lncs are active in a real world situation though I'm also having doubts which any comments on below questions may aid!
1)Are there any approaches to estimate cellular make up in tissue-level data based on cell-specific markers?
2)Is this just too naive an approach to be useful?
3)Could run the pipeline again on the in vivo dataset but it isn't stranded... would this mess up transcript discovery too much?
Would appreciate any input, thanks for reading :)
Thanks, some good food for thought in here.
Have overlapped my new assemblies to FANTOM CAT (seeing some degree of exonic overlap for mast majority) but would be good to do latest GENCODE too.