So i've been working on an RNA-seq pipeline that makes it easier for my lab members to run their own analysis even though they aren't very tech savvy. While the pipeline works great for our lab since we only work on human samples, it doesn't transfer over very well to other labs using different models such as mouse, rat, or drosophila.
My question is ... how do you write the DGE portion of an RNA-seq script and make it applicable to any organism?
Currently, my pipeline will take in a GTF and create a TxDb object from that, which seems to solve ... some of my problems and just opens up new ones (how to convert from differently annotated gene names into more conventional names .. eg. ensembl to symbols). It also makes automated gene ontology analysis extremely difficult because I can never predict what kind of data is being fed into the script.
Any suggestions?
For our premade pipelines, we dictate the genomes that are available. Then we can ensure that things like GO annotation have a chance at working. Adding more genomes just means a few lines in a config file, so it's not exactly difficult.
This was my original plan, but this brings up another issue. Do you also dictate the annotations that can be used? Ensembl vs UCSC vs Gencode etc.
Yes, we/I dictate the annotations that are available. I manage the available packages/indices/annotations for the institute, so I get to play benevolent dictator :) This ends up cutting down on people doing silly things like mixing/matching chromosome naming conventions and organisms.
I may just have to start leaning towards doing something similar.
Thanks Devon!