I am using the DeSeq2 package in Rstudio to analyze some RNAseq data for two different strains (COH1 & A909) of group B Streptococcus. The pipeline my lab has established was easy enough to tweak to my usage, but what I would like to do is pool my A909 and COH1 data together for analysis since they are share the majority of their genomes. Unfortunately, the gene description files I have use each strains unique locus tags, so I cannot combine them. I have the raw reads and the genome reference files, but the initial alignment was actually done by the sequencing company and I am new to using R so I am unsure what steps I would need to take to combine the reads of the different strains. The code below is a sample of my current workflow for analysis of a single strain.
> #Create matrix and generate normalized read counts
DiabeticStatusMatrix <- DESeqDataSetFromMatrix(countData = CountDF, colData = MetaDF, design = ~Diabetic_status) #creates DESeq data matrix
DESeqDiabeticStatus <- DESeq(DiabeticStatusMatrix) #run DESeq2 on all samples: the program includes library size normalization
normCountsDiabeticStatus <- counts(DESeqDiabeticStatus, normalized = T) #make a list of normalized read counts
View(normCountsDiabeticStatus) #check output
write.csv(normCountsDiabeticStatus, "COH1_NormCounts_DiabeticStatus.csv") #export normalized read count
#Diabetic vs Nondiabetic analysis
DiabeticVsNonDEGs <- results(DESeqDiabeticStatus, contrast = c("Diabetic_status", "Diabetic","Nondiabetic"), alpha = 0.05, lfcThreshold = 1) #test for DEGs
print(DiabeticVsNonDEGs) #check output
summary(DiabeticVsNonDEGs) #summary of number of significantly up or down regulated genes
write.csv(DiabeticVsNonDEGs, "DEGs_DiabeticVsNon.csv") #Pull out all DEGs
SigDiabeticVsNon <- subset(DiabeticVsNonDEGs, padj < 0.05) #Pull out only significant DEGs that have a corrected p-value of < 0.05 (default is 0.1)
View(SigDiabeticVsNon) #check output
SigDvsN <- as.data.frame(SigDiabeticVsNon) #convert to data frame to merge with gene descriptions
View(SigDvsN) #check output
setDT(SigDvsN, keep.rownames = "Locustag") #changes locustags from row names to a data column for merging with gene descriptions
View(SigDvsN) #check that Locustag is now a column
SigDiabeticVsNon_GeneDesc <- merge(SigDvsN, GeneDescDF, by= "Locustag") #adds the gene names and descriptions to the significantly DEGs
write.csv(SigDiabeticVsNon_GeneDesc, "SigDEGs_DiabeticVsNon.csv") #export the list of significant DEGs with the gene descriptions
I am not sure what you mean by "combine". Are the samples expected to contain reads from both genomes i.e. is it a mixed sample. If it is not a mixed sample then by mixing two references and two datasets you are going to allow the reads to multi-map to both genomes (it looks like these are strains so very similar genomes). This would cause problems with counting. Why do you want to do this instead of analyzing respective datasets against respective genomes.
The experiment was culturing these strains (separately) in patient urine samples then extracting the RNA for subsequent RNASeq. For a single donor sample I would do a culture of A909 then a separate culture of the same donor sample for COH1. So the different strains were not mixed but they were both cultured in the same donor sample. Both strains do share a significant amount of their genomes, and this experiment did not have a very high n. Thus the idea of pooling the counts for homologous genes shared between the two strains to effectively double the n. I have analyzed each dataset against their respective genomes, but the results were a little bare so I was hoping this might be a way to dig a little deeper into the data and see if maybe we need to add more samples or if the results simply are what they are. I am brand new to using R and also RNASeq analysis so I hope that my explanation helps clear things up.
Out of curiosity do you achieve that, by using antibiotic resistance? Are these genomes being derived from those cultures? Where does the actual sample for RNA come from?
If the genomes are very similar then you could potentially use just one of the genomes and do the analysis. You will need to provide information about rationale of the experiment. Perhaps someone else will have more to say.
I am using sterile filtered urine from diabetic, prediabetic, and nondiabetic patients to see if the expression profile of group B strep is different between the patient status'. A culture (grown in THB) of the chosen strain is washed and then resuspended in 1.5mL of a patient urine sample, then is incubated for 2 hours before proceeding with the RNA extraction. Because the urine is sterile filtered the only (or at least vast majority) of RNA extracted from the sample is just that of the introduced bacteria. The strains are incubated in separate aliquots of the same donor sample. If I was to just pick one of the genomes to do the analysis with both data sets, I assume that means I would need to take the raw sequencing data and align it to the chosen annotated reference genome, correct?
Ah this makes more sense. So you are actually using two separate bacteria against the same donor sample. It still would not make sense to "add" the two genomes.
Yes. You can pick one of the genomes and align the samples against it to make comparison between the two "tests" possible. You could repeat this analysis using the second strain.
If you have the original sequence files, then it might it make sense to construct a pan-genome reference and then align to that?
I talked to someone who does this sort of thing in GroupA, and they said that sometimes the strains are close enough that you can use something like Roary (I think, don't quote me on that, I'm a eukaryote person) to construct a list of all the genes that are close enough accross the strains to count as core genome. She then took one of the reference genomes and added to it the genes that were in the second strain but not the first.
ALternatively, if you a table mapping the gene IDs in one strain to the gene IDs to gene IDs in the other with, for example, two columns called StrainOneID and StrainTwoID, you could do something like:
Which would convert the rownames for your strain 2 matrix to contain strain one gene names. You could then bind the matricies together to make a new matrix to feed into DESeq.
This would only really be valid where the genes in the two strains were more or less identical a few SNPs here and there isn't a problem, but changes in length or major changes in sequence composition would invalidate the analysis.