Question

DESeq2 using samples with different unique alignment rates

0

Entering edit mode

6.3 years ago

Tash • 0

My aim is to do a differential expression analysis using paired end data with two conditions: controls and cases.

Unfortunately, I was having some issues with low unique alignment rates in all my 5 cases (approx. 40% mapping rate) due to rRNA contamination as confirmed by BBduk and BLASTing. I ran DESeq2 with 4 controls that have good alignment rates (90%) versus the 5 cases. I do end up with a fair amount of DEG's and a pathway analysis in IPA shows some promising results, but now I am unsure if I can trust these results. I did not control for anything as we only have females in the same age group. I have tried to change the design to include alignment rates but I get an error:

"Error in checkFullRank(modelMatrix) : the model matrix is not full rank, so the model cannot be fit as specified. One or more variables or interaction terms in the design formula are linear combinations of the others and must be removed".

Since I do have more than double aligned reads in my controls I am wondering if I can do some normalisation to control for this discrepancy in DESeq2 or at an earlier step in the analysis?

My pipeline is:

- Trimmomatic

java -jar /opt/apps/bioinformatics/trimmomatic/0.36/trimmomatic-0.36.jar PE ${INPUTFILE_R1} ${INPUTFILE_R2} ${OUTPUTFILE_PAIRED_R1} ${OUTPUTFILE_UNPAIRED_R1} ${OUTPUTFILE_PAIRED_R2} ${OUTPUTFILE_UNPAIRED_R2} ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:8:TRUE SLIDINGWINDOW:4:15 MINLEN:25

- STAR

STAR --runThreadN 4 --genomeDir path/genome_dir --readFilesCommand zcat --readFilesIn ${INPUTFILE_R1} ${INPUTFILE_R2} --outReadsUnmapped Fastx --outSAMtype BAM Unsorted --outFileNamePrefix ${OUTPUTFILE}

- htseq

htseq-count -f bam -a 10 -s reverse --idattr=gene_name $INPUTFILE path/gencode.v28.annotation.gtf > $OUTPUTFILE

- DESeq2

dds<-DESeqDataSetFromMatrix(countData, colData, design = ~ condition) 
dds$condition <- relevel(dds$condition, ref = "control")
dds<-estimateSizeFactors(dds)
ddsB <- estimateDispersions(dds)
variancedata <- varianceStabilizingTransformation(ddsB)

dds<-DESeq(dds)
results_RNAseq_DEGs = as.data.frame (results(dds,contrast=c("condition", "cases", "control")))
res <- results(dds)
res_filtered=results_RNAseq_DEGs %>% subset(padj<0.05 & abs(log2FoldChange) > 0.5)

I hope that I have provided you with sufficient information.

Thanks in advance.

RNA-Seq R • 1.6k views

ADD COMMENT • link 6.3 years ago by Tash • 0

0

Entering edit mode

Does a PCA plot indicate any significant difference in the high rRNA samples?

ADD REPLY • link 6.3 years ago by Devon Ryan 104k

0

Entering edit mode

https://imgur.com/zhE0Iku

I linked to the PCA plot. There seems to be quite a difference.

ADD REPLY • link 6.3 years ago by Tash • 0

0

Entering edit mode

Please color that by rRNA level rather than group.

ADD REPLY • link 6.3 years ago by Devon Ryan 104k

0

Entering edit mode

Thanks for your replies Devon.

I coloured by unique alignment rates since I haven't checked exact rRNA content using bbduk for all the "badly" aligned files, only two. Pretty sure all of the 5 have similar contamination though because when I BLAST the overrepresented sequences they all give hits for rRNA.

https://imgur.com/a/x4WxSjH

It makes sense that this plot looks like the previous one as all the samples who have high RNA content are cases.

ADD REPLY • link 6.3 years ago by Tash • 0

1

Entering edit mode

OK, from that it's clear that rRNA contamination is not a main driver of any of the clustering. I wouldn't worry about incorporating that into your design then. If you need to do that in the future then ensuring it's a continuous variable will probably solve the error message you received.

ADD REPLY • link 6.3 years ago by Devon Ryan 104k