Question

Difference In Cufflinks Transcripts With Merging 3 Cufflink Outputs And By Pooling All 3 Libraries?

0

Entering edit mode

11.0 years ago

Rahul Sharma ▴ 660

Hi,

I first assembled a genome of size 80MB using four Illumina libraries. After masking the repeat elements, I am trying to generate the gene using RNA-Seq from cufflinks. I have three Single RNA-Seq Illumina libraries (Isolated at three different time points, from pure to infection phase of pathogen). In the first run, I first ran tophat2 and then cufflinks to get the transcripts from all three libraries individually.

1). In the first run, I first ran tophat2 and then cufflinks to get the transcripts from all three libraries individually.

nohup tophat2 --num-threads 35 --b2-very-sensitive -o $I/Tophat_out_MAsked_genome_lib1 $I/SCa_gtr_300_discarded_90_99per_Ns_for_Masked_Tophat.fa.index $D/lib1.fastq &

nohup tophat2 --num-threads 35 --b2-very-sensitive -o $I/Tophat_out_MAsked_genome_lib2 $I/SCa_gtr_300_discarded_90_99per_Ns_for_Masked_Tophat.fa.index $D/lib2.fastq &

nohup tophat2 --num-threads 35 --b2-very-sensitive -o $I/Tophat_out_MAsked_genome_lib3 $I/SCa_gtr_300_discarded_90_99per_Ns_for_Masked_Tophat.fa.index $D/lib3.fastq &

This generated the accepted_hits.bam file I used in Cufflinks like this:

cufflinks -o Cufflinks_all/ -p 30 -L Ph ./Tophat_out_MAsked_genome_All_RNA_Seq/accepted_hits.bam

From the above described way I got around 800, 19,000 and 20,000 transcripts for lib1, lib2 and lib3, respectively. Then I merged the transcripts using cuffmerge, command was:

cuffmerge -s $I/SCa_gtr_300_discarded_90_99per_Ns_for_Masked_Tophat.fa $I/assemblies.txt

Cuffmerge generated around 17,500 transcripts.

2). In the Second run I pooled all three libraries and run tophat2 and Cufflinks on the single dataset. This generated ~21,000 transcripts.

My question is, which strategy should I follow? I am also interested in finding out differentially expressed genes using Cuffdiff. What should be the input .gtf file for Cuffdiff? The file generated by 1st method or the 2nd one?

I would really appreciate your comments on this. Thank you so much in advance!

Best regards and wishes,

Rahul

cufflinks rna-seq tophat2 • 3.9k views

ADD COMMENT • link updated 10.5 years ago by Biostar 20 • written 11.0 years ago by Rahul Sharma ▴ 660

0

Entering edit mode

Are the three libraries from the same sample or three different samples? That will determine which method is preferred.

ADD REPLY • link 11.0 years ago by Devon Ryan 104k

0

Entering edit mode

Hi, Thanks for your reply! They are from three different time-points of pathogen infection. But if my first goal is to find all genes in this species, should I use the gtf file from method2? But later, I will also look for differentially expressed genes. Many thanks!

ADD REPLY • link 11.0 years ago by Rahul Sharma ▴ 660

0

Entering edit mode

1) Why cuffmerge and not cuffcompare? See also this thread: RNA-seq with cuffdiff: use merged.gtf from cuffmerge or combined.gtf from cuffcompare? 2) Personally, I think both methods are reasonable, but maybe you can remove the first library giving only 800 transcripts. Maybe that's part of the relatively large differnece in the total number of transcripts.

ADD REPLY • link 11.0 years ago by Fabio Marroni ★ 3.0k