Question

How To Build A Basic Rna-Seq Pipeline

5

Entering edit mode

13.6 years ago

Travis ★ 2.8k

Hi all,

I've been looking around and there doesn't seem to be much information on the development of RNA-Seq pipelines for differential expression analysis. I am about to start work on setting up a basic skeleton pipeline.

As a very high level overview, how do the following steps look? Can anyone comment on/add/remove steps? I have also added some questions regarding the steps to aid in my own understanding.

1) Align reads to genome using Tophat/Bowtie
(perhaps use the new Tophat-fusion to find fusion transcripts? I also guess it is important to ensure that the genome we use will match our preferred annotation source downstream e.g. if we prefer Ensembl, we should use NCBIv37 rather than hg19 to ensure consistency in chromosome names?)

2) Mark/remove duplicate reads.

3) Use Cufflinks to assemble transcripts.

4) Run Cuffdiff to assess differential expression.

5) Annotate transcripts
(unsure on how exactly this is done - can anyone comment? For example, what program might be used and what happens when we attempt to annotate novel-spliced or fully novel transcripts? Will these be recognised somehow?)

6) At this point I guess we have an annotated matrix that could be used in next gen or classical visualisation programs? Any suggestions on how to view?

next-gen sequencing rna gene • 16k views

ADD COMMENT • link updated 13.6 years ago by Radhouane Aniba ▴ 790 • written 13.6 years ago by Travis ★ 2.8k

score 6 · Answer 1 · 2011-06-30

6

Entering edit mode

13.6 years ago

Mikael Huss 4.8k

Some comments on your steps:

1) TopHat is fine, but Bowtie (or BWA) only makes sense if you are mapping directly against the transcriptome (IMO). Mapping against the transcriptome may be a good idea for many applications, although mapping against the genome is much more common and I haven't seen any in-depth comparison of which one is more sensitive/specific. Apart from TopHat, there are other good spliced mapping methods such as MapSplice, SpliceMap and (especially) RUM.

Yes, you should take care to keep your reference genome and annotation "in sync".

2) Yes, at least for paired-end it's a good idea to remove duplicates.

3) In my opinion, assembling the transcripts with Cufflinks only makes sense if you don't have a good annotation. If you are sequencing human RNA, I would just run Cufflinks with a GTF file to quantify the expression of annotated transcripts. If you want to run DESeq or other count-based methods for differential expression later, you would use HTSeq or something similar here instead of Cufflinks.

4) Cuffdiff is probably OK, or you could use e g DESeq, which uses counts.

5)-6) Not sure I understand the questions.

ADD COMMENT • link 13.6 years ago by Mikael Huss 4.8k

1

Entering edit mode

@Travis: BWA does gapped alignment, but the gaps are on the order of 1-10 bp; BWA does not handle gaps the size of introns. You need to use a splice-aware aligner when aligning to the genome. See my answer and Mikael's above for some aligner suggestions.

ADD REPLY • link 13.6 years ago by Sean Davis 27k

0

Entering edit mode

Why do Bowtie or BWA only make sense if mapping to the transcriptome?

ADD REPLY • link 13.6 years ago by Travis ★ 2.8k

0

Entering edit mode

The genome has gaps between the exons and bowtie and bwa cannot map a read that crosses those gaps.

ADD REPLY • link 13.6 years ago by Sean Davis 27k

0

Entering edit mode

But BWA does do gapped alignment, doesn't it?

ADD REPLY • link 13.6 years ago by Travis ★ 2.8k

0

Entering edit mode

Thanks for that. Off the cuff, it makes me wonder why anyone would use bowtie for RNA-Seq!

ADD REPLY • link 13.6 years ago by Travis ★ 2.8k

0

Entering edit mode

The more I think about this, the more I have to ask - do aligners like Bowtie/Eland discard intron-spanning reads?

ADD REPLY • link 13.6 years ago by Travis ★ 2.8k

0

Entering edit mode

yes, but Solexa reads used to be shorter so the junction spanners were not common when ~30bp. software is always fighting the last war.

ADD REPLY • link 13.6 years ago by Jeremy Leipzig 23k

Ram · Answer 2 · 2011-06-30

2

Entering edit mode

13.6 years ago

Sean Davis 27k

See this blog post for a quick start. See this question and consider alternative alignment algorithms. Consider reading some of these review articles.

ADD COMMENT • link updated 5.4 years ago by Ram 44k • written 13.6 years ago by Sean Davis 27k

0

Entering edit mode

can you walk us through your GSNAP-based pipeline?

ADD REPLY • link 13.6 years ago by Jeremy Leipzig 23k

0

Entering edit mode

GSNAP is used for the alignment step only. After that, the workflow can be similar to those used for tophat or any other aligner and could include the cufflinks suite, DESeq, etc.

ADD REPLY • link 13.6 years ago by Sean Davis 27k

0

Entering edit mode

how do you divvy up hits to different overlapping transcripts?

ADD REPLY • link 13.6 years ago by Jeremy Leipzig 23k

score 2 · Answer 3 · 2011-06-30

2

Entering edit mode

13.6 years ago

Radhouane Aniba ▴ 790

SEQanswers.com have published an interesting post for a basic pipeline than you can refine later, I forgot the link to the post but found the guide on RNA seq blog

CLICK HERE

Hope that help

Radhouane

ADD COMMENT • link 13.6 years ago by Radhouane Aniba ▴ 790