Hi everyone,
I'm attempting to simulate mRNA-seq data to map back to my spliced genome graph for a baseline comparison (graph vs linear reference). To summarise, I'm using an actual mRNA-seq data set to match error profiling, RSEM to calculate expression, before using vg sim to simulate the data. With all the in-between steps and files, it's getting to be a pain and I'm running into a lot of issues. But I had an idea and wanted to put it out there for some feedback.
Currently, my graph represents the full genome (introns, exons, etc.) to which I have added splice junctions using vg rna.
IN THEORY, if I were to run the vg rna step again but remove any non-gene regions (-d, --remove-non-gene), which results in an exon-only graph, and ran vg sim using this version of the graph, would this produce mRNA-like data?
I may be completely off the mark but there's no harm in asking.
Yes, I believe this should be equivalent to using the full graph. It should probably be mentioned that there are still some features of real RNA-seq data that aren't modeled by
vg sim
, like intron retention in nascent mRNA, stochastic transcription of non-genic sequences, and expression of many ncRNAs. However, these limitations apply to both the full and exon-only graphs.