Hi,
I have data from forty-five individuals sampled before and after treatment (paired samples) and would like to identify differentially edited sites between these conditions.
I intend to use a framework similar to what is used for finding differentially methylated sites and ASE (specifically edgeR)/
My input count table looks like this,
ref1 edit1 ref2 edit2 ref3 edit3 ref4 edit4 ref5 edit5 ref6 edit6
Coordinate_1_A_G 10 90 11 54 19 65 16 2 18 0 12 2
Coordinate_2_T_C 20 91 65 94 55 79 62 602 58 224 64 575
Coordinate_3_T_C 16 65 18 77 15 82 16 5 18 7 17 6
Coordinate_4_A_G 16 15 3 15 5 13 1 6 8 0 9 1
Here ref1 = the number of unedited bases and edit1 = number of edited bases for the respective coordinate for patient1, and so on.
I would like to know the best way to model this.
Any thoughts??
I was going through this paper (https://rnajournal.cshlp.org/content/24/11/1481.short) and they use the below design:
design <- model. Matrix(~0 + patient_id + treatment: allele)
to identify sites with condition-specific changes in the edited base counts, considering the unedited base counts for each sample
I don't understand the nuances of the design matrix, but could you help me understand how this would differ from the design you have provided?
Many thanks for your guidance
The paper that you cite is counting sequence reads, not bases. I assume that is what you want to do also, although I find your references to "base counts" confusing.
The design matrix that I outlined is designed to test for differences in the proportion of edited vs unedited reads. The design matrix that you quote from the paper is designed to find differences in the abundance of edited and unedited reads separately. They are quite different analyses.
The reason I used the term "base counts" is because the counts in my input matrix are the number of edited bases (edits: A -> G, T->C, G->C ..) of the total number of bases aligned to that one single site (coordinate), and ref is the number of unedited bases of the total aligned bases for each site.