I am following GAGE workflow (1). The sample data is from ArrayExpress and it contains a set of control and a set of experiment (KD1 and KD2). I quote from the workflow(1)
As an example, below is the code I used to map, index and process the read data for the first sample (ERR127302). You can write a shell script to do so for all samples (ERR127302-9).
The code is:
tophat2 -p 8 -o tophat_out_1 ref/hg19 ERR127302_1.fastq.gz ERR127302_2.fastq.gz
If I include all files in the mapping script above, won't they just be shuffled all together?!!! I couldn't figure out how should I set in the pathway the control vs the experiment? Any help is appreciated.
I am a new user to Biostars, so I only have 5 posts per day. I may not be able to reply to all answers but I am reading your comments.
Thanks!
Reference
- RNA-Seq Data Pathway and Gene-set Analysis Workflows, Weijun Luo, February 6, 2014.
I see! Then how do I set control and experiment in the command lines? I cannot understand by just reading commands, since I am very new to this field. Should I rename the accepted hits files to control or experiment?! So many stupid questions I have!
You needn't rename the BAM files that tophat produces, just use the -o option from tophat to put them in different directories. You end up specifying which samples belong to which group within R, when you run the
gage()
command (see section 3.4 of the workflow you're looking at). The idea is that you have normalized counts in a matrix (cnts.norm
) with samples as columns. The column numbers for samples in the two groups are then provided as theref=
andsamp=
parameters.This line you mean?
So should I leave the ref.idx and samp.idx as they are or should I change them. The problem I am having is how does this matrix column numbers are set to ref and samp?
Yes, that's the line. The actual indices for the samples will depend on how you make the matrices. In the workflow that you're looking at, the order of the samples is the order of the file names in
fls
in section 3.2. Personally, I'd use htseq-count or featureCounts and then just load the counts, but that's me. I'd also just export rlog2 counts from DESeq2 rather than bothering to do things manually, but that's rather outside the purview of that workflow.