Hi all -- I'm new to RNA-seq and have had some issues assembling the reads. I'm looking for any advice or input on what might be the best way to handle my data.
My work is done in oocytes of a non-model organism without a reference genome.
I have performed an RNA-IP against two different proteins and processed the IPs for paired end RNA seq, my goal is to identify the transcripts associated with both of these proteins (overlap). In addition, I have also processed whole oocytes for RNA-seq. Everything was done with three biological replicates.
Since my work is done in a non-model organism I have been using Trinity to assemble my paired-end RNA-seq data. There are a few ways, I think, this can be done however and I could use any input on what the best method might be. I've dabbled with one and had some errors, which is why I'm confused and wondering if a different approach is better.
OPTION 1: Assemble the whole oocyte transcriptome using Trinity and use this as a reference genome.
- ​After assembly, I used trinotate to cross reference my assembly to a recently-released protein database for my organism. I believe this assigned contigs a protein annotation.
- I then used the built-in Trinity plugins to align and estimate transcript abundance using RSEM for each IP sample separately
- I simply used the raw fastq files (left and right) for each IP (did they need to be assembled here??).
- Looking at the RSEM.isoforms.results output, I saw in every IP that a control transcript had a 0 FPKM, and I'm assuming is not expressed. This is obviously concerning...especially since using the sample sample I could identify my control transcript by qPCR.
OPTION 2: Assemble all IP reads together. In this case I would then map each IP's raw fastq file back to this "IP-transcriptome" to try and estimate transcript abundance using RSEM. (I'd ignore the whole oocyte data in this scenario)
- OPTION 3: Individually and separately assembly each IP. I would then use transdecoder, trinotate, and blast to try and map these reads to the recently-released protein database. I would use the protein database.fasta file as the reference in this case.
Which option seems best? Any idea why my first approach failed to show the control transcript?
This is my first time doing RNA-seq so I apologize if these questions are very naive! All advice is greatly appreciated. Thank you!
Thank you for your advice!
Hello Charles, I want detect lncRNA from some human (control and treatment) RNA-seq data in fastq format,I read the article of http://www.nature.com/articles/srep22698 ,which use clc genomics and de novo assembly pathway and...,I checked its data It is different to mine (In terms of library and format),now can I use its workflow to detect lncRNA?
I am using clc genomics for getting genes diff. exp.
Your attention would be really appreciated
If you interested in human lncRNA, you might want to start with the GENCODE annotations without doing a de novo assembly:
http://www.gencodegenes.org/releases/current.html
Pre-existing assemblies might also exist for your specific topic of interest, but you can also align your reads against the assembly that you have made and BLAT the sequences for highly expressed transcripts to the human genome (to see if they overlap known annotations). The quantification part should something that you can do in CLC Bio. I can't provide more specific directions, but they are commercial software with their own tech support (support-clcbio@qiagen.com)
Not sure how you were comparing your human results to that Salmon paper (and CLC Bio should work with most sequencing platforms), but I would focus on the most highly expressed transcripts. If you do a transcriptome alignment to begin with, you can just focus on unaligned reads to try and see if there are any highly expressed novel lincRNAs.
Also, this should really be a separate question, and not a comment for this previous post that is only similar in that it involves a de novo assembly (in this case, in a non-model organism)