Hello, I'm hoping someone can provide some insight or point me in the right direction... I have very little programming knowledge and am fairly new to RNA-seq but I'm sure there must be an easier way to do what I need...
Using Trinity de novo assembly, I have assembled my paired end reads for my RNA-seq data. I have also used the trinity RSEM utility to calculate transcript abundance. I now would like to annotate or identify, by protein name, those transcripts most highly expressed in certain samples.
Currently, what I am doing is importing the output RSEM file (RSEM.genes.results), with FPKM values, into an excel / tab-delineated file, then sorting by highest FPKM. Then, I search for the gene id corresponding to the FPKM value in the output trinity assembly (.fasta). There, I can find the corresponding sequence, and then I manually input that into the nucleotide blast database on pubmed...for each individual gene.
This is a very cumbersome and tedious approach and I am certain there is more automated way to do this. I have very limited programming experience so I cannot quickly write a script to do the above for me...but I'm almost positive there must be some built in trinity function or other already established script that can do this. What is the approach that is generally taken? I would be extremely grateful if you could point me in the right direction! Thank you for any help!
You're correct that this is a good candidate process for automation. I don't personally know of any existing tool that does exactly what you're wanting, but I'd be happy to write a quick script; it would be a good exercise for me.
If you can post Dropbox links to an example of your RSEM output file, and maybe an Excel file in progress, that would give me a full understanding of what you're trying to do.
My group has recently developed a pipeline for transcriptome annotation (Annocript). The pipeline identified both coding and non-coding RNAs, and after preliminary configuration (parameters, database download, additional software installation) is completely automated for all future runs. It is comparatively faster than current annotation pipeline and gives protein, domain, GO term, Enzype and Pathway annotation. Further it estimates ORF size and non coding potential of each transcript to assign a binary classification for the transcript to be coding or non-coding.