This is more a modelling question using DESeq2 than a technical question about how to use DESeq2.
I have a somewhat complex set (or sets) or RNAseq data on some parasitic worms:
I have 7 different populations of worms. Unfortunately they were RNA extracted and sequenced on 3 different occasions (denoted by lane). I'm guessing that it's impossible to control for lane by including it as a factor in the experiment (as I don't have each treatment(population) in each lane - see below) - is this correct? Should I just remove Lane from the factor table as it's pointless to keep it in, given I can't fully factor it in?
Second, our reference genome strain was created from the I population. All reads were mapped to this genome. We are concerned how much the level of polymorphism between the strains, specifically between I and the rest of them, is influencing mapping rates (and if polymorphism is variable throughout the genome as we expect, if this variation is biasing DE calls). In fact we see much fewer reads mapping from the non-I populations. As a result we scaled up the allowance for polymorphism when read mapping with tophat2 (which does seem to mitigate the problem based of map rates and total number of DE genes between I and the other populations). So as a result we have 3 count table entries for each population denoted strict, medium, and lax, which represents how much polymorphism was allowed when these reads were mapped. Note though that each triplicate in I strict, is nested with the same bioreplicate in I medium, and I lax. Each bioreplicate was just mapped at these three different settings.
notation below represents: Population, Lane, SNPs/indels allowed in mapping of reads
(each entry represents all 3 bioreplicates)
I , 1 , strict
I , 1 , medium
I , 1 , lax
W , 1 , strict
W , 1 , medium
W , 1 , lax
C , 1 ,strict
C , 1 , medium
C , 1 , lax
WB , 2 , strict
WB , 2 , medium
WB ,2 ,lax
CB , 2 , strict
CB , 2 , medium
CB , 2 , lax
L , 3 , strict
L , 3, medium
L , 3 , lax
U , 3 , strict
U , 3 , medium
U , 3, lax
So I guess my question is: Is it possible to control for the changes in read mapping success rate when varying the level of strictness of polymorphism allowance? Or is it impossible to factor this in with DESeq2 given that they aren't "true" replicates, but are instead the same replicates mapped at different rates?
Finally, if indeed I shouldn't try to control for mapping success rate, would it be wise to choose one of the mapping strictness's and create a read count table for it on its own? (Medium gives me the highest map rates in all populations, and is the setting where I see the lowest # of DE genes between I and the other populations). In other words, remove the other two levels in the table.
I have noticed that doing this (removal of other two levels) actually changes the base means and thus the DE calls by a pretty substantial margin. So this is why I'm wondering which method is the proper way to do this!
Thanks so much for your input, I'm somewhat in over my head in trying to make these decision!
Andrew