tophat-cufflinks command pipeline
2
6
Entering edit mode
9.9 years ago
biolab ★ 1.4k

Hi, everyone,

I am a newhand working on RNA-seq data. Tophat-cufflinks method is widely used in analyzing RNA-seq data. I have installed these tools, and now need to run them. From tophat and cufflinks websites, I learned some basics as below, but I am not sure if my commands are right. I am wondering could anyone offers a pipeline or gives me a guidance, particularly the command options? THANKS a lot!

Step 1: generate a tophat_out folder with bam files

tophat <index> sample1_1.fq sample1_2.fq
tophat <index> sample2_1.fq sample2_2.fq

Step 2: generate .gtf files

cufflinks sample1/accepted_hits.bam
cufflinks sample2/accepted_hits.bam

Step 3: prepare a text file named assemblies.txt with following gtf files

sample1/transcript.gtf
sample2/transcript.gtf

Step 4: run cuffmerge to generate merged.gtf

cuffmerge assemblies.txt

Step 5: discovery of novel transcript

cuffcompare merged.gtf

Step 6: compare gene expressions of two samples

cuffdiff merged.gtf sample1/accepted_hits.bam sample2/accepted_hits.bam
cufflinks tophat • 27k views
ADD COMMENT
3
Entering edit mode

HERE IS MY PIPELINE FOR ucsc.hg19 reference genome and ucsc refgene gtf file:

./tophat2 \
  -o tp_out \
  --library-type fr-unstranded \
  -r 260 \
  --fusion-search \
  -G /home/refgene.gtf \
  -p8 \
  --b2-very-sensitive \
  /home/ucsc.hg19.fasta \
  /home/R1_001.fastq /home/R2_001.fastq \
  --rg-id Sample \
  --rg-sample sample \
  --rg-library rna-seq \
  --rg-platform Illumina

./cufflinks \
  -p 8 \
  --library-type fr-unstranded \
  -o cufflinks_out_gtf \
  -G /home/refgene.gtf \
  /home/tophat-/accepted_hits.bam

This I repeat for all samples.

./cuffmerge \
  -o all_samples_merged \
  -s /home/ucsc.hg19.fasta \
  assemblies.txt \
  -g /home/refgene.gtf \
  -p 8 \
  assemblies.txt

And finally cuffdiff.

./cuffdiff \
  -o all_samples_diff \
  -L 1,2,3,4,5,6,7,8,9,10 \
  -b /home/ucsc.hg19.fasta \
  -u all_samples_merged/merged.gtf \
  -p 8 \
  --library-type fr-unstranded \
  -c 5 \
  PATH/TO/ALL/MY/BAMS

And finally I run R - CummeRbund and have issue explained above . Sorry maybe for stupid questions.

Paul.

ADD REPLY
0
Entering edit mode

I'm trying to accomplish the same task, but I'm new at this field, so, how do I get the refgene gtf file and the ucsc.hg19 reference genome?

ADD REPLY
0
Entering edit mode

What organism?

ADD REPLY
0
Entering edit mode

For homo sapiens too.

ADD REPLY
0
Entering edit mode

Have you done a bit of Googling?!

http://hgdownload.cse.ucsc.edu/downloads.html#human

ADD REPLY
0
Entering edit mode

Yes, I did, but I was unable to find the .gtf file and also, at the beginning I thought it would be some specific kind of file other then the "normal" reference genome of human. Now I understood what it takes, thank you, and I'm sorry for my lack of knowledge at this point.

ADD REPLY
0
Entering edit mode

No worries. Hope it helped =)

ADD REPLY
0
Entering edit mode

You should really add some information on your data, in order to give proper guidance on what you need for your analysis.

ADD REPLY
5
Entering edit mode
8.4 years ago
pjmaguire3 ▴ 80

I wouldn't recommend using TopHat/Cufflinks. While they used to be the go-to option for RNA-Seq analysis, now a days there are much better alternatives out there, particularly STAR. Conveniently I just made a figure showing a short overview of my protocol, so I linked that below. Do note that RSEM might be a better option for getting read counts, but it does require that you run your STAR mapping through its wrapper (which is why I went with a more general approach).

Steps:

  1. Check quality of reads with FastQC. If they are good then proceed, but if the quality is low consider trimming the ends and then re-evaluating.
  2. Using your genome and gene annotation file, create a reference genome using STAR.
  3. Map your reads against the annotated reference genome (output to BAM to avoid needing to convert things later).
  4. Use featureCounts to get the read counts.
  5. Do a TMM transformation on the read data using Limma (this is far superior to FPKMs in normalizing the data, especially if the main focus is to identify differentially expressed genes).
  6. Use Limma to calculate the differentially expressed genes.

enter image description here

Hope that helps you, and feel free to shoot me some questions. Best of luck!

ADD COMMENT
1
Entering edit mode

Hi,

Could you please share your opinion about HISAT-Stringtie2-Ballgown pipeline? I am going to follow this pipeline and still in a little confusion regarding how it performs compared to others.

Thanks

ADD REPLY
1
Entering edit mode

As far as I am aware, there has not been a published study done comparing HISAT2 to STAR (HISAT came out in 2015, so it is relatively new overall), so I can't really appropriately compare them in a non-biased manner. But I imagine that HISAT2 wouldn't perform as well when it comes to sliced alignments since it uses the BWA algorithm to map reads. One of the things that is nice about HISAT2 is that it has a much smaller footprint than STAR, so if you're doing a lot of RNA-Seq analytics it could be very beneficial to decreased the required storage space for each sample run. HISAT2 also reports that it is slightly faster than STAR, but I would take that with a grain of salt since the author's were the ones who selected what datsets to use when bench marking the different mappers. Moral of the story, I still recommend STAR as the first choice but I consider HISAT2 to be a reasonably good second place holder. As for the rest of the pipeline I have no idea, I have never even heard of those two(2) software suites so it wouldn't be fair for me to comment one way or the other on their abilities.

This is a good read, if you haven't gone through it yet: Current state-of-the-art for RNA-Seq gene / transcript expression quantification OR just the tools of preference

I also recommend you read this paper if you are interested in comparing the different read normalization methods: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis.

ADD REPLY
0
Entering edit mode

Thank you very much for your detail explanation (and your recommended article)!

ADD REPLY
0
Entering edit mode

Hi, It's an old thread but I found a paper talking about this recently: Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. In its conclution part, it said "Importantly, our analysis indicates that the explicit quantification of transcript isoforms leads to more accurate estimates of gene expression levels compared to the ‘count-based’ methods that are broadly used currently". But they evaluated the count-based method with "bedtools intersect", not htseq-count or featureCounts, so I am a bit skeptical of this conclusion. What do you think about this?

Thanks

ADD REPLY
3
Entering edit mode
9.9 years ago
Parham ★ 1.6k

Have you seen this protocol? It is very helpful.

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

ADD COMMENT
0
Entering edit mode

Thank you very much, Paul and Parham.

ADD REPLY

Login before adding your answer.

Traffic: 2435 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6