I have RNA-seq data from four different cell fractions derived from two different developmental stages. There are 5 replicates for one stage and 6 replicates for the other.
I am currently attempting to mitigate an obvious batch effect due to the donor rather than the cell fraction (See the list of samples below). When I use an experimental design ~name
I can see the donor effect in the PCA plot where samples from the same donor cluster closer together rather than samples from the same fraction.
I am having difficulty coming up with the experimental design to mitigate the issue because samples from stage
A are not from the same donor as samples from stage
B. I am interested in DE calls between different cell fractions within the same stage and between stages.
When I try to include the donor information ~name + donor
in the design matrix I end up getting an error that the matrix is not full rank. I assume I need some kind of nested design matrix but I cannot wrap my head around it.
sample name donor stage fraction
A_4mm Amm A4 A mm
A_4mp Amp A4 A mp
A_4pm Apm A4 A pm
A_4pp App A4 A pp
A_5mm Amm A5 A mm
A_5mp Amp A5 A mp
A_5pm Apm A5 A pm
A_5pp App A5 A pp
A_7mm Amm A7 A mm
A_7mp Amp A7 A mp
A_7pm Apm A7 A pm
A_7pp App A7 A pp
A_1mm Amm A1 A mm
A_1mp Amp A1 A mp
A_1pm Apm A1 A pm
A_1pp App A1 A pp
A_2mm Amm A2 A mm
A_2mp Amp A2 A mp
A_2pm Apm A2 A pm
A_2pp App A2 A pp
A_3mm Amm A3 A mm
A_3mp Amp A3 A mp
A_3pm Apm A3 A pm
A_3pp App A3 A pp
B_1mm Bmm B1 B mm
B_1mp Bmp B1 B mp
B_1pm Bpm B1 B pm
B_1pp Bpp B1 B pp
B_2mm Bmm B2 B mm
B_2mp Bmp B2 B mp
B_2pm Bpm B2 B pm
B_2pp Bpp B2 B pp
B_3mm Bmm B3 B mm
B_3mp Bmp B3 B mp
B_3pm Bpm B3 B pm
B_3pp Bpp B3 B pp
B_5mm Bmm B5 B mm
B_5mp Bmp B5 B mp
B_5pm Bpm B5 B pm
B_5pp Bpp B5 B pp
B_6mm Bmm B6 B mm
B_6mp Bmp B6 B mp
B_6pm Bpm B6 B pm
B_6pp Bpp B6 B pp
Not sure if this is going to solve your error, but you have to specify the batch effect first, e.g. ~batch + treatment, or in your case: ~donor + name (if donor is your batch effect)
I will try that thank you. I am still having trouble understanding how these GLM based methods work.