Question

Rna-Seq Pipeline

45

Entering edit mode

14.5 years ago

brentp 24k

So, there're papers on designing an RNA-seq experiment, and normalizing the data (Bullard et. al and the recent Genetics paper are good reads) but what do folks do for the actual pipeline.

I'm looking at

filter on quality. (what are your quality/parameter cutoffs?)
any other pre-processing?
tophat
cufflinks
repeat 1-4 for different set of reads and find differentially expressed genes (cuffdiff)

First, any steps I should add?

Second, there doesn't seem to be much about how to do this. I mean I can read the manuals and execute the commands (steps 3, 4 seem no problem), but I'm looking any pointers to either:

fully documented pipelines with a explanation of the processing at each step
shell script(s) of going from reads to differentially expressed genes.
pubs where this is documented.

I realize each set of data will be different, but it'd be nice to base it on something.

pipeline next-gen-sequencing rna rna-seq • 34k views

ADD COMMENT • link updated 18 months ago by Ram 44k • written 14.5 years ago by brentp 24k

score 11 · Answer 1 · 2010-05-21

11

Entering edit mode

14.5 years ago

Dstan ▴ 160

We're getting ready to publish a study in which we use RNA-seq, and we used a piece of software called GNUMAP. We did not apply any filtering on the read qualities, as we found that lower-quality reads simply didn't map as well. As far as the post-mapping analysis, we're still waiting to hear back from our statistics colleagues on the model they've developed.

As far as an out-of-the-box solution for RNA-seq, I'm not sure how much you'll be able to find.

ADD COMMENT • link 14.5 years ago by Dstan ▴ 160

2

Entering edit mode

hadn't heard of GNUMAP, checking it out now. i'm not expecting an out-of-the-box solution, just trying to make use of existing knowledge.

ADD REPLY • link 14.5 years ago by brentp 24k

score 10 · Answer 2 · 2010-05-21

10

Entering edit mode

14.5 years ago

Wjeck ▴ 490

No idea about where these steps exist as a well documented whole, but I can pass on our experience. We're doing a pretty massive amount of RNA-seq at our institution as part of The Cancer Genome Atlas, and our methods are along the lines you describe.

Bowtie/Tophat for mapping has been our best bet for spliced sequence alignment. I know the group working on this tried other techniques with mapping onto a reference "transcriptome" that has some advantages in terms of mapping but can be harder to deconvolute in cases where transcripts overlap.

ADD COMMENT • link 14.5 years ago by Wjeck ▴ 490

0

Entering edit mode

thanks, at least it's good to know you decided on a similar overall pipeline after looking around.

ADD REPLY • link 14.5 years ago by brentp 24k

score 6 · Answer 3 · 2010-06-30

6

Entering edit mode

14.4 years ago

Michael 55k

I think, one important step that is missing here could be

remove/condense (100%?) identical reads into one read

in the filtering step. A large amount of reads could be e.g. artifacts from a PCR step in the wet-lab pipeline. This can be done e.g. with the tool FASTA collapser from the FASTX tools. For a quantitative approach I would prefer this, but I guess it's controversial. Any experiences with that?

Another filtering step can be to clip the reads removing low-quality regions instead of removing only total reads.

ADD COMMENT • link 14.4 years ago by Michael 55k

2

Entering edit mode

My understanding is that removing identical reads is a step that is typical for DNA analysis, but more controversial when it comes to RNA-Seq because the rationale for it is less clear here (are we only removing PCR artifacts, or also introducing a quantitative bias?).

ADD REPLY • link 11.5 years ago by jobinv ★ 1.1k

0

Entering edit mode

Note, I wrote this almost 3 years ago. Now, I wouldn't do it anymore for a differential analysis, with the argument that on average PCR-artifacts should equally affect both conditions. That's possibly still controversial.

ADD REPLY • link 11.5 years ago by Michael 55k

0

Entering edit mode

I'll admit that I didn't see the date of the original answer :)

ADD REPLY • link 11.5 years ago by jobinv ★ 1.1k

Ram · Answer 4 · 2015-09-05

We make available open-access RNA-seq tutorials that cover cloud computing, tool installation, relevant file formats, reference genomes, transcriptome annotations, quality-control strategies, expression, differential expression, and alternative splicing analysis methods. These tutorials and additional training resources are accompanied by complete analysis pipelines and test datasets made available without encumbrance at http://www.rnaseq.wiki/.

This material was released alongside this publication:

Malachi Griffith, Jason R. Walker, Nicholas C. Spies, Benjamin J. Ainscough, Obi L. Griffith. 2015. Informatics for RNA-seq: A web resource for analysis on the cloud.11(8):e1004393.

The Supplementary Information for this publication includes an extensive review of RNA-seq wet lab and analysis concepts, existing tools, common questions, etc.

All materials associated with this publication, including high resolution and original figure files, supplementary tables, etc. are available here

This publication was inspired by workshops that we have taught at CBW, CSHL, and NYGC over the last few years. These workshops are ongoing and we hope to maintain and expand the content in the coming years.

Ram · Answer 5 · 2013-06-18

2

Entering edit mode

11.4 years ago

wadunn83 ▴ 90

For anyone still interested in this type of thing:

If using Tophat Cufflinks, the authors generally do not recommend removing poor quality reads since their process will simply down value the alignments of poor quality reads and sometimes they can actually help things.

As for 3-5:

I have recently written a pipeline called Blacktie to do just this, plus do some automated analysis with cummeRbund.

Installation via pip:

[sudo] pip install -U blacktie

ADD COMMENT • link updated 6.2 years ago by Ram 44k • written 11.4 years ago by wadunn83 ▴ 90

0

Entering edit mode

Could you give a source for the top statement about pre-filtering reads for tophat? I've been trying to learn about this topic and haven't found a whole lot honestly.

ADD REPLY • link 11.0 years ago by kipp ▴ 50

score 1 · Answer 6 · 2013-06-10

1

Entering edit mode

11.5 years ago

Biojl ★ 1.7k

You may want to take a look to The Simple Fool’s Guide to Population Genomics via RNA-Seq done at the PALUMBI lab. It's a functional fully documented pipeline from 0.

http://sfg.stanford.edu/guide.html

Edit PD: OK, yes, I didn't saw this post was from 3 years ago.

ADD COMMENT • link 11.5 years ago by Biojl ★ 1.7k

score 0 · Answer 7 · 2013-06-11

fastqc could be used for the quality control
adptor may need to be removed before the alignment, in case the long adaptor affects the aligning result
& 4 other aligner may worth to look at depends on the length of the reads. (BWA, Bowtie, Bfast)

list of alignment software:

http://en.wikipedia.org/wiki/List_of_sequence_alignment_software

http://elements.eaglegenomics.com/

list of adaptor removal software:

http://bioscholar.com/genomics/tools-remove-adapter-sequences-next-generation-sequencing-data/