Hi Biostars,
I have two datasets covering 8 bacterial isolates:
- expression data across the whole genome for each isolate, analysed with deseq2
- allele data for intergenic regions up to 500 nt long up and downstream of the coding sequence for each locus in each isolate
What I want to do is compare the expression data to the intergenic variation data, and see if igr variation is influencing changes in gene expression. The structure of the latter is simply a number for each isolate depicting which allele it has. So a locus with 8 different upstream igrs would have the numbers 1-8 appear, one for each isolate. E.g. the below example locus sees 2 alleles in 3 isolates.
Isolate1 Isolate2 Isolate3
Locus0001. 1 2 1
I would like to make correlations across the whole dataset, and also perform each pairwise comparison between isolates. To that end, a colleague suggested I produce two matrices, one for each dataset, and run some stats comparing the matrices.
Each matrix would have the structure:
Isolate1_Isolate2 Isolate1_Isolate3 Isolate2_Isolate3
Locus1 1 1 0
Locus2 1 0 0
With cells coded in binary. The IGR matrix would default to 0 in each cell unless the two isolates had the same allele at that locus. The RNAseq matrix would default to 0 unless the two isolates had similar log2fold expression changes
My question is, is this a useful comparison to be making and if so, what stats would you run to compare the two matrices?
Furthermore, ignoring the pairwise comparisons, how would you go about correlating across the whole dataset? As the locus names are the same in both datasets my initial thought would be selecting loci with high expression changes and checking how much variation they have. So you'd end up with some analysis to see if loci with high variation are overrepresented in the highly differently expressed loci. Is there anything else you would do to correlate those sets?
Sorry for the lengthy post and thank you in advance!
That is an interesting dataset that you have.
I remember I tried doing something similar a while ago (correlating expression and variation data). It is of some interest in the field of evolutionary biology where the hypothesis (if I remember correctly) is - expression is negatively correlated with variation i.e., genes with high expression are important in cellular context, hence they would be resistant to mutations. See this paper for more details - https://academic.oup.com/mbe/article/36/9/2013/5506639 (Figure 3)
Admittedly I did not find similar correlation (at least not too strong) in my dataset.
I am not able to fully grasp the pairwise comparison that you mentioned, but I would have tried to do something similar here too. For example - check if variation in the upstream is also negatively correlated with variation. The paper that I show above looks only at coding regions but, I imagine for non-coding regions, you could use pi (nucleotide diversity) or window-based pi. Usually those upstream and downstream regions could be hubs of regulatory elements associated with the neighboring genes, so you could also try to make a GO-based analysis to see which terms are over-represented for 1)high diversity and high expression, 2)low diversity and high expression etc.
Thanks for your input, I'll absolutely check out that paper and investigate your suggestions more.
I think we're going with the slightly different hypothesis that variation in coding regions correlates with changes in expression, and initial analysis across the whole dataset supports this. If we split loci depending on whether they have any variation at all and compare them to loci without any variation in their igrs, we see the former group tends to have higher average expression changes. Haven't looked to see if that change in expression is negative or not as we've been using absolute values just for the initial look.
Your suggestion about splitting the data into high variation and high exp, high var and low exp etc is great, I'll definitely be doing that and looking into the GO terms that come out of that.
One issue we have is we are using a categorical measure of igr variation (if alleles are different they get different numbers) rather than quantifying the degree of variation. That could definitely be something we look into once deciding on a way to quantify that variation. My initial thought on a simple quantification would be number_of_mismatches/gene_length *100 to get a percentage variation. Could then average that over the 8 isolates or maybe check each comparison to see if greater divergence between a locus in any two isolates correlates with a greater expression change between the two at that locus.
So if I understand you correctly, you compare group A (loci with at least one variation) and group B (loci with no variation at all - highly conserved). And you see group A is more likely to have fluctuating expression values across your samples as compared to group B? Just thinking out loud - that is an interesting observation suggesting that the highly conserved loci do not change their expression intensities across samples (also intuitive to understand probably). If I were you, I would be tempted to check if there is an over-representation of house-keeping genes in group B (just a hunch).
Again I am thinking out loud - you could probably treat those 8 samples as a small population (given that they are biologically similar i.e. come from same species). The metric that you describe is intuitive (number_ofmismatches/gene_length*100), but if I were you then I would use pi instead which is kind of similar to what you describe but just over the whole population. There are some tools (packages in R too) where you can provide your set of sequences (all 8 samples) and get pi estimates. For regions that are longer than 100bp, I would rather use a sliding window pi estimate if there are some variation "hotspots" in the loci.
Yes thats correct, group A loci have greater expression changes than group B loci. Another analysis I'd like to do is to check the variation in the coding regions of both groups as well, see what that shows. Your point about housekeeping genes makes a lot of sense - I havent yet dug down into individual loci or a functional analysis but that would be interesting to see.
They do indeed. All the samples are from the same clonal complex, and half the samples are from the same patient as well, so very very similar. Pi is a good suggestion, that would help quantify the data and handle the pairwise comparison aspect as well potentially. Pegas is an r package I've found from a quick google. Forgive my ignorance but would I format the input list to calculate pi over the entire genome (so the input is
list(genome1, genome2, ...)
), or by locus? (list(locus1_isolate1, locus1_isolate2, ...)
) and just repeat for each locus and get pi on a locus by locus basis?I think given the level of similarity of these samples a sliding window would be okay not to do over the 100-500 nt regions I'm looking at.
Thank you greatly for your input, this has been very helpful!!
Would be very interesting to see if similar patterns are observed for coding regions. I am very curious to know what you find, please do share if you can (or if and when you publish)! The evolutionary hypothesis (from the paper) stating that "important" loci have higher expression and are under high constraint applies even strongly for coding region elements. If I may ask, what metric do you plan to use to quantify the variation in the coding region?
The genome basis will give you a rough overview of the picture of diversity over the whole genome, but (in my opinion), will give too much of a broad view to make a loci-specific comparison. I would rather do the following - have a loci-based pi estimate and assign each locus (which I presume means a stretch of non-coding region in the proximity of a coding region) to a coding region/gene. I would go one step further and would try to aim for a table with three columns per gene (for the two groups) - "diversity in the non-coding region", "diversity in the coding region" and "expression change".