Hi,
I first assembled a genome of size 80MB using four Illumina libraries. After masking the repeat elements, I am trying to generate the gene using RNA-Seq from cufflinks. I have three Single RNA-Seq Illumina libraries (Isolated at three different time points, from pure to infection phase of pathogen). In the first run, I first ran tophat2
and then cufflinks
to get the transcripts from all three libraries individually.
1). In the first run, I first ran tophat2
and then cufflinks
to get the transcripts from all three libraries individually.
nohup tophat2 --num-threads 35 --b2-very-sensitive -o $I/Tophat_out_MAsked_genome_lib1 $I/SCa_gtr_300_discarded_90_99per_Ns_for_Masked_Tophat.fa.index $D/lib1.fastq &
nohup tophat2 --num-threads 35 --b2-very-sensitive -o $I/Tophat_out_MAsked_genome_lib2 $I/SCa_gtr_300_discarded_90_99per_Ns_for_Masked_Tophat.fa.index $D/lib2.fastq &
nohup tophat2 --num-threads 35 --b2-very-sensitive -o $I/Tophat_out_MAsked_genome_lib3 $I/SCa_gtr_300_discarded_90_99per_Ns_for_Masked_Tophat.fa.index $D/lib3.fastq &
This generated the accepted_hits.bam
file I used in Cufflinks
like this:
cufflinks -o Cufflinks_all/ -p 30 -L Ph ./Tophat_out_MAsked_genome_All_RNA_Seq/accepted_hits.bam
From the above described way I got around 800, 19,000 and 20,000 transcripts for lib1, lib2 and lib3, respectively. Then I merged the transcripts using cuffmerge
, command was:
cuffmerge -s $I/SCa_gtr_300_discarded_90_99per_Ns_for_Masked_Tophat.fa $I/assemblies.txt
Cuffmerge generated around 17,500 transcripts.
2). In the Second run I pooled all three libraries and run tophat2 and Cufflinks on the single dataset. This generated ~21,000 transcripts.
My question is, which strategy should I follow? I am also interested in finding out differentially expressed genes using Cuffdiff
. What should be the input .gtf
file for Cuffdiff
? The file generated by 1st method or the 2nd one?
I would really appreciate your comments on this. Thank you so much in advance!
Best regards and wishes,
Rahul
Are the three libraries from the same sample or three different samples? That will determine which method is preferred.
Hi, Thanks for your reply! They are from three different time-points of pathogen infection. But if my first goal is to find all genes in this species, should I use the
gtf
file from method2? But later, I will also look for differentially expressed genes. Many thanks!1) Why cuffmerge and not cuffcompare? See also this thread: RNA-seq with cuffdiff: use merged.gtf from cuffmerge or combined.gtf from cuffcompare? 2) Personally, I think both methods are reasonable, but maybe you can remove the first library giving only 800 transcripts. Maybe that's part of the relatively large differnece in the total number of transcripts.