Hi Fellows,
I am planning to generate reference de novo assembly as follows:
- Creating reference assembly by pooling reads of all 6-samples i.e. WT and transgenic together in one fastq file.
- Run velveth with auto parameter for K from 27 to 71 on that single fastq file
- Afterward using velvethg for each value of K
- Running oases on contig.fa file generated from each value of K
- Transcript.fa file generated from each assembly are merged using Merged option in Oases
- Checking merged assembly, by mapping each sample reads back to merged assembly using bowtie.
Am I on right path?
Any suggestion would be appreciated.
Thanks
I read that comment in my post but I am confused about how to create de novo reference assembly. Thanks a lot
If you are talking about Trinity or Oases, you are talking about RNA-Seq, correct?
The reference here is not a genomic reference, but a transcript reference.
The output of Oases includes a transcripts.fa file. The extra link that I added to this response describes some strategies I used to create an optimal list of transcripts (parameters to make the transcripts.fa file smaller as well as select one sequence per gene). However, this might not be technically necessary.
eXpress aligns reads to a list of mRNAs (unlike aligners such as TopHat, which align to the DNA-based reference genome, and then require a program like cufflinks to provide mRNA quantification). If you are working with RNA-Seq data, this is the only option that you have (since you ideally shouldn't have any sequence from introns or intergenic sequence).
So, once you run the Oases command shown above, you can start referring to the eXpress manual. Also, if you are familar with Perl, I have a templete for this sort of pipeline run_RNA_Seq_de_novo.pl):
https://sites.google.com/site/cwarden45/scripts
I understand for my rna-seq data, the reference will be transcript reference. Any suggestion about RSEM vs eXpress. I will try to understnd and run script.
Thanks for all your feedback. Naresh
I haven't personally tried RSEM with de novo RNA-Seq data, but I know that is a popular algorithm. If you have time, you can just try them both.
In the script description, I also mention CLC Bio de novo. Although not specifically designed for RNA-Seq, I found it to be useful. In fact, it provides coverage statistics along with the contigs (so, a separate mRNA quantification step is not needed). If your institution has a license for CLC Bio Genomics Workbench, I would recommend trying that. Otherwise, Oases + eXpress is the only other thing that I can currently stand behind.
Thanks! Can you please give me any suggestion about -amos_file option while running velvet/Oases. Why it is useful and it's importance.
Best, Naresh
I don't think you should have to use that parameter.
I believe amos is a file format, so it is only necessary if you want to work with that file format:
http://wiki.bioinformatics.ucdavis.edu/index.php/Illumina_assembly_using_velvet
I think the only necessary parameters are listed in the "impatient people" section of the manual, which I think is fairly well documented (and there are lots of velvet tutorials on the web)
Thanks cwarden45.
Hi Cwarden45, I need your help. I used express for abundancy. And for downstream analysis I want to use edgeR. But I am having hard time understading about how to run edgeR package. Can you please guide me step by step. I would really appreciate that.
Thanks, Naresh.
Hi Cwarden45,
I need your help. I used express for abundancy. And for downstream analysis I want to use edgeR. But I am having hard time understading about how to run edgeR package. Can you please guide me step by step. I would really appreciate that.
Thanks, Naresh
How many samples do you have?
edgeR is meant for differenital expression - meaning you compare the expression patterns for the same gene in multiple samples divided into groups (most often, two groups).
The input for edgeR will be a table of read counts. Once you can produce this, I think the edgeR instructions are pretty good.
The potential problem is that this is typically done against a known reference genome. In other words, eXpress will tell you the abudances for transcripts in a particular sample, but it will not be sufficient to tell you what transcript IDs match between samples.
Using BLAST can help by identifying homologs to other, characterized genomes. For me, this was the end step: I sorted the transcripts by expression levels and checked if the type of genes made sense (for samples coming from muscle, fat, etc.). I have not yet had to define differentially expressed genes from a RNA-Seq de novo assembly study. You might be able to use your BLAST results to guess which transcripts match (or simply use BLAST to map the transcripts between samples instead of the NCBI nucleotide database). However, I don't have any advice in regards to how you can determine the optimal criteria to determine that transcripts from two samples are indeed the same (for example, I think it is likely that the regions of coveraged sequence will often vary - so, only portions of transcripts actually align).
Hi Cwarden45,
Thanks for your feedback.
I have 3 different samples. From express, which count did you used i.e. eff_count, est_count, etc for edgeR.
I agree with you about blast. I am also running blast2go on side to get sequence annotation.
Thanks in advance! Naresh
There are a few caveats (see below), but I would probably say tot_count is what you would use.
1) If you only have 3 samples, then you must have at least one group without a replicate. Although you might technically be able run edgeR and/or DESeq without replicates, I don't think the results will be as useful. You may just want to calculate fold-change values, without worry about p-values. If you do this, you should use FPKM values, or you could just rank the genes and compare differences in percentiles.
2) I never used edgeR - my goal was to determine what types of genes were expressed in a couple samples. So, I just wanted to see which genes were most highly expressed (using FPKM and eff_count). The relative expression in one sample to another sample didn't matter.
3) I technically didn't find this result to be the best: I actually found the de novo assembly program in CLC Bio (not specifically designed for RNA-Seq) to give the most reasonable top ~20 genes (using the coverage statistic provided by that program). However, I did think this was the best result out of the open-source tools that I tested. So, you may need to do some trial and error on your own. For example, using the Oases (i.e Velvet) contigs may actually work better than the Oases transcripts (even if you filter them). I would just make sure you understand what each eXpress metric means:
http://bio.math.berkeley.edu/eXpress/manual.html#expr
Good luck!
Thanks a lot cwarden45,