Hi Biostars community,
I am new to bioinformatics and are learning how to analyze RNA seq data for bacteria (E.coli specifically) for differential gene expression. I have some questions about normalization using DESeq2 after searching and reading on the forum. I am using data generated by others in the lab. Each condition includes 3 replicates. The RNA seq data has a large percentage of rRNA contamination despite of rRNA depletion. The input to DeSeq2 are the raw counts that are mapped to CDS (not including reads mapped to rRNA) from all conditions. My questions are:
- The total raw counts varies quite a bit between samples, ranging from 0.06 M to 9 M. This is shown in the boxplot of raw counts. Is this variation too great for normalization and comparison for DGE?
Due to the large variable between samples, the sizeFactors are also all over the place, ranging from 0.02 to 9.04. My understanding is the sizeFactors should be close to 1. What is an acceptable range for sizeFactors?
After normalization, the sums of normalized counts between samples are closes to each other compared with the sums of raw_counts, but still I am not comfortable with the level of difference. Some as low as 1.43 M, while others as high as 5.9M. Can I even use this normalized counts for downstream DGE analysis?
PCA plot shows that the replicates of amp_6h samples cluster together, but the replicates of s_0h, untreated_3h, amp_3h scattered across PC1 or PC2. Is there any other processing I should try to make the data more interpretable prior to DGE analysis(such as remove the purple dot in the PCA that's far away from the other two purple dots )?
What are the possible causes for these data variation? Moving forward, what should we take into consideration when prepping samples for RNASeq to minimize within sample variation? Thank you so much for helping me clarify!