Question

How and what are the steps for performing RNAseq Time point analysis 3 biological replicates and 4 Technical replicates for each

0

Entering edit mode

7.4 years ago

macmath ▴ 170

Condition A, B, C I have 3 biological replicates, Each biological replicates have 4 Technical replicates

Below is just one Example 1KO-1a.fastq.gz 1KO-1b.fastq.gz 1KO-1c.fastq.gz 1KO-1d.fastq.gz 2KO-1a.fastq.gz 2KO-1b.fastq.gz 2KO-1c.fastq.gz 2KO-1d.fastq.gz 3KO-1a.fastq.gz 3KO-1b.fastq.gz 3KO-1c.fastq.gz 3KO-1d.fastq.gz

I tried to perform this analysis as below:

Genome Alignment : STAR 4 Technical replicates same time for each biological replicate

STAR --genomeDir reference/mm10ens84/STARindex/ --readFilesIn data/1KO-1a.fastq.gz,data/1KO-1b.fastq.gz,data/1KO-1c.fastq.gz,data/1KO-1d.fastq.gz,data/2KO-1a.fastq.gz,data/2KO-1b.fastq.gz,data/2KO-1c.fastq.gz,data/2KO-1d.fastq.gz,data/3KO-1a.fastq.gz,data/3KO-1b.fastq.gz,data/3KO-1c.fastq.gz,data/3KO-1d.fastq.gz --readFilesCommand zcat --outFileNamePrefix alignment_STAR/HFD-KO1 --outFilterMultimapNmax 1 --outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate --twopassMode Basic --runThreadN 15

Result : 1 Bam aligned file for each biological replicate

HTSEQ : Count

htseq-count -s no -r pos -t exon -f bam -s no -a 0 alignment_STAR/KO1hq.bam reference/mm10ens84/mm10ens84.gtf >KO1hq.counts

Correct me if I am wrong and suggest me how to proceed ahead.

Thank you in advance for your suggestion and time

RNA-Seq Time point • 2.7k views

ADD COMMENT • link updated 7.4 years ago by Michael 55k • written 7.4 years ago by macmath ▴ 170

3

Entering edit mode

How does STAR distinguish and group biological and technical replicates?

This is an interesting example, because if that works, it is not clear from the documentation how to achieve this, and should be clarified.

From the manual:

--readFilesIn
default: Read1 Read2
string(s): paths to files that contain input read1 (and, if needed, read2)

The only mention of that STAR might accept comma separated lists of replicates is found under a different option:

--outSAMattrRGline [...] Comma separated RG lines correspons to different (comma separated) input
files in -readFilesIn. Commas have to be surrounded by spaces, e.g.

I have seen the comma separated input work before, but I think the manual needs some clarification.

ADD REPLY • link 7.4 years ago by Michael 55k

0

Entering edit mode

Yeah, pretty sure OP's example doesn't actually work.

ADD REPLY • link 7.4 years ago by Devon Ryan 104k

0

Entering edit mode

Another comment: are you sure that your protocol is not strand specific? -s no -r pos -t exon -f bam -s no also you have doubled the parameter, be careful when trying out different combinations, to specify each parameter only once to avoid confusion, like in -s no -r pos -t exon -f bam -s yes, which one takes precedence?

ADD REPLY • link 7.4 years ago by Michael 55k

0

Entering edit mode

Thank you for all your valuable insight's. I'm naive in this analysis and with many experts in the field viewing these queries, I would be extremely grateful if any of you can suggest me the workflow with commands to analyse Technical and biological replicates from Fastq -> DEGs in different time points or different links (tutorials) associated with similar situation.

ADD REPLY • link 7.4 years ago by macmath ▴ 170

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLY • link 7.4 years ago by GenoMax 147k

0

Entering edit mode

Take a look at DESeq2 vignette for examples of how to analyze RNAseq data.

ADD REPLY • link 7.4 years ago by GenoMax 147k

score 1 · Answer 1 · 2017-06-23

1

Entering edit mode

7.4 years ago

Michael 55k

As we have written in comments, we are not sure if the grouping of replicates will work correctly, it might if the RG are defined well, but you could also make a big mistake already here. You need to experiment a bit, but I think that it is advisable to stick as close as possible to the documented behavior and to run the STAR alignment process separately over each biological replicate, separating input files by commas.

STAR ... --readFilesIn data/1KO-1a.fastq.gz,data/1KO-1b.fastq.gz,data/1KO-1c.fastq.gz,data/1KO-1d.fastq.gz --outFileNamePrefix alignment_STAR/HFD-1KO ... # Assuming 1KO, 2KO, 3KO are biological replicates and a,b,c,d are technical replicates

STAR ... --readFilesIn data/2KO-1a.fastq.gz,data/1KO-2b.fastq.gz,data/2KO-1c.fastq.gz,data/2KO-1d.fastq.gz --outFileNamePrefix alignment_STAR/HFD-2KO ... 

....

Other remarks:

--outFilterMultimapNmax 1 # is that not a bit strict?

Make sure, you are counting from the right strand, most modern stranded-protocols are reverse sense
you are lacking any QC, out of painful experiences, you should QC every step otherwise there could be a big blunder. You can use MultiQC for that. Run
- fastQC on the fastq files, check for adapter contamination, if high you need to trimm adapters in addition
- include the STAR Log final out files (mapping rates should be >80%)
- run bamtools stats on the bam files and finally
- check for that the numbers of aligned sequences in the bam files match with the number of seqs in the input
- check the assignment rate of reads to genes from HTseq or featureCounts, any low percentage (~5%) should trigger a warning (should be >70%).

ADD COMMENT • link 7.4 years ago by Michael 55k

0

Entering edit mode

Thank you Michael Dondrup for your suggestion and valuable time. I performed fastQC on the data set. Regarding the outFilterMutimapNmax, what is your ideal suggestion while performing the alignment of such files. In short what will you personally do with these files? I am curious to know how can I make it as correct as possible without doing any blunder.

ADD REPLY • link 7.4 years ago by macmath ▴ 170

0

Entering edit mode

Kindly correct me if I am wrong, I checked my data alignment all were >80% and I could see alignment on both strands. Then I used samtools with the below command

samtools view -h -b -q 20 -F 4 AlignedKO1.bam >KO1hq.bam

HTSEQ : Count

htseq-count -r pos -t exon -f bam -a 0 alignment_STAR/KO1hq.bam reference/mm10ens84/mm10ens84.gtf >KO1hq.counts

Followed by DESeq2

sampleFiles<- c("KO1a.counts", "KO1b.counts", "KO1c.counts", "KO1d.counts", "KO2a.counts", "KO2b.counts", "KO2c.counts", "KO2d.counts", "KO3a.counts", "KO3b.counts", "KO3c.counts", "KO3d.counts", "WT1a.counts", "WT1b.counts", "WT1c.counts", "WT1d.counts", "WT2a.counts", "WT2b.counts", "WT2c.counts", "WT2d.counts", "WT3a.counts", "WT3b.counts", "WT3c.counts", "WT3d.counts")
sampleNames <- c("KO1a", "KO1b", "KO1c", "KO1d", "KO2a", "KO2b", "KO2c", "KO2d", "KO3a", "KO3b", "KO3c", "KO3d", "WT1a", "WT1b", "WT1c", "WT1d", "WT2a", "WT2b", "WT2c", "WT2d", "WT3a", "WT3b", "WT3c", "WT3d")
sampleCondition <- c("KO1", "KO1", "KO1", "KO1", "KO2", "KO2", "KO2", "KO2", "KO3", "KO3", "KO3", "KO3", "WT1", "WT1", "WT1", "WT1", "WT2", "WT2", "WT2", "WT2", "WT3", "WT3", "WT3", "WT3")
sampleTable <- data.frame(sampleName = sampleNames, fileName = sampleFiles, condition = sampleCondition)
treatments <- c("KO1", "KO2", "KO3", "WT1", "WT2", "WT3")
library("DESeq2")
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                       directory = directory,
                                       design = ~ condition)
colData(ddsHTSeq)$condition <- factor(colData(ddsHTSeq)$condition,
                                      levels = treatments)

I am not sure how to merge the technical replicates eg. "KO1a", "KO1b", "KO1c", "KO1d" into KO1 and how should the design look like. Looking forward for your suggestions.

ADD REPLY • link 7.4 years ago by macmath ▴ 170