Hi my lab sequenced 39 samples (some normal, some tumors) and I am now trying to analyze the data. Just to give a general overview of my workflow, I performed adaptor trimming using cutadapt, and then I used STAR to align the sequences to the GRCm38.p6 genome (our samples are from a BALB/cj background), and then finally I quantified the alignment using featurecounts. I am now using DESeq2 to perform the differential gene expression analysis but I am a bit wary of my data. Looking at the PCA plot it seems like everything is grouping about how I'd expect, however there are a handful of samples that are grouping far away and it seems like its kinda in a pattern
As you can tell it looks like C05, F08, F04, B02, and C09 all look like they are just shifted down and slightly to the right of where everything else seems to be grouped. This doesn't see like a biological effect to me because it spans a bunch of the groups. How should I go about trying to figure out what's going on. I looked at the sums of the genes and they are all relatively similar. I also tried comparing everything in the upper left vs everything in the bottom right and none of the genes really stand out to me. One thing I noticed is that all of the most significantly different genes between the two groups seem to be downregulated in the group that's in the bottom right. Any feedback or ideas would be highly appreciated!!
Looks like a batch or technical effect. Are there any commonalities in the metadata for those samples (e.g. collected or processed on different days, isolated using a different kit or method, etc)? Looking at the PC loadings for PC2 may reveal more.
Yeah that's what I was thinking initially as well but all of the samples were processed with the same method and they were processed across different batches.
I looked into the PC loadings and I think thats a little bit more informative. It seems like the top 4 loadings for PC2 are potentially a result of rRNA. This has been an issue with this dataset that I haven't been able to get around. Basically it seems like some of the samples have really high levels of ribosomal RNA and some of the rRNA seems to be aligning to off target genes. This makes it so some genes look super highly expressed but in reality when I look through IGV only one part of the gene is expressed and if I blast that region it seems to align with rRNA. When we did the sample prep we used a rRNA depletion kit so I'm not sure why the contamination is so high in some of these cases. Is there a good way to filter out these reads? How do I get around this issue?
You can try removing the reads prior to alignment. There are a few tools that do it - sortmerna and RiboDetector are pretty popular options.
Note they can take like a full day to run, so it's not quick, but it may help in some instances.
Prior to that, you might consider just assessing the rRNA content by counting the proportion of ribosomal reads in the samples to ensure it's actually an issue.
How about removing all the identifiers from rRNA from your count matrix and perform PCA if it makes any differences? If you think you still have a high ribosomal contamination in your data, please check put the discussion here: Cleaning RNA-Seq data from rRNA