My aim is to do a differential expression analysis using paired end data with two conditions: controls and cases.
Unfortunately, I was having some issues with low unique alignment rates in all my 5 cases (approx. 40% mapping rate) due to rRNA contamination as confirmed by BBduk and BLASTing. I ran DESeq2 with 4 controls that have good alignment rates (90%) versus the 5 cases. I do end up with a fair amount of DEG's and a pathway analysis in IPA shows some promising results, but now I am unsure if I can trust these results. I did not control for anything as we only have females in the same age group. I have tried to change the design to include alignment rates but I get an error:
"Error in checkFullRank(modelMatrix) : the model matrix is not full rank, so the model cannot be fit as specified. One or more variables or interaction terms in the design formula are linear combinations of the others and must be removed".
Since I do have more than double aligned reads in my controls I am wondering if I can do some normalisation to control for this discrepancy in DESeq2 or at an earlier step in the analysis?
My pipeline is:
- Trimmomatic
java -jar /opt/apps/bioinformatics/trimmomatic/0.36/trimmomatic-0.36.jar PE ${INPUTFILE_R1} ${INPUTFILE_R2} ${OUTPUTFILE_PAIRED_R1} ${OUTPUTFILE_UNPAIRED_R1} ${OUTPUTFILE_PAIRED_R2} ${OUTPUTFILE_UNPAIRED_R2} ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:8:TRUE SLIDINGWINDOW:4:15 MINLEN:25
- STAR
STAR --runThreadN 4 --genomeDir path/genome_dir --readFilesCommand zcat --readFilesIn ${INPUTFILE_R1} ${INPUTFILE_R2} --outReadsUnmapped Fastx --outSAMtype BAM Unsorted --outFileNamePrefix ${OUTPUTFILE}
- htseq
htseq-count -f bam -a 10 -s reverse --idattr=gene_name $INPUTFILE path/gencode.v28.annotation.gtf > $OUTPUTFILE
- DESeq2
dds<-DESeqDataSetFromMatrix(countData, colData, design = ~ condition)
dds$condition <- relevel(dds$condition, ref = "control")
dds<-estimateSizeFactors(dds)
ddsB <- estimateDispersions(dds)
variancedata <- varianceStabilizingTransformation(ddsB)
dds<-DESeq(dds)
results_RNAseq_DEGs = as.data.frame (results(dds,contrast=c("condition", "cases", "control")))
res <- results(dds)
res_filtered=results_RNAseq_DEGs %>% subset(padj<0.05 & abs(log2FoldChange) > 0.5)
I hope that I have provided you with sufficient information.
Thanks in advance.
Does a PCA plot indicate any significant difference in the high rRNA samples?
https://imgur.com/zhE0Iku
I linked to the PCA plot. There seems to be quite a difference.
Please color that by rRNA level rather than group.
Thanks for your replies Devon.
I coloured by unique alignment rates since I haven't checked exact rRNA content using bbduk for all the "badly" aligned files, only two. Pretty sure all of the 5 have similar contamination though because when I BLAST the overrepresented sequences they all give hits for rRNA.
https://imgur.com/a/x4WxSjH
It makes sense that this plot looks like the previous one as all the samples who have high RNA content are cases.
OK, from that it's clear that rRNA contamination is not a main driver of any of the clustering. I wouldn't worry about incorporating that into your design then. If you need to do that in the future then ensuring it's a continuous variable will probably solve the error message you received.