Dear All,
I have RNAseq data of a hybrid yeast, which has a lot of gene conversion and loss of heterozygosity between two genomes. I also have RNAseq data of one of its parents.
I was able to phase only 300 genes out of 6000. What I want is to compare gene expression levels between hybrid and this parent. Since only these 300 genes are phased, I got only 2% of uniquely mapped reads in hybrid, while in parent there are around 90%.
So my question is whether it is legitimate to use DESeq2 for only this subset of 300 genes? I am wondering whether it is ok to compare such a different library sizes together.
Thanks,
With my experience, I would say, you may run into some normalization problems. May be you can try ANOVA kind of test(?). But somebody here who has better experience with DESeq2 should comment on your situation.
Hi Venu, you right, my gut feeling says that conceptually it might not feet to DESeq2. Regarding ANOVA, what do you suggest exactly? Thanks
Do you really need to phase the alignments to do this? If the two parents are quite similar I would think it'd be better to align to one genome (or use an allele-specific pipeline, ignoring the fact that you don't actually care about allele-specific expression) and use the counts from all of the genes.
Hi Devon, allele-specific expression is exactly what I had to do, that's why I phased genes :) So now I want to compare the parent and homeolog. Here we have a quite complex genome and usual allele-specific pipelines fail, since >70% of genome has undergone conversion and LOH.
Those 2% vs 90% aligned reads, are those referring to the entire genome or to the specific 300 genes? If the 300 genes of interest are similarly well covered, it may be feasible. You could use the standard kallisto/salmon - tximport - DESeq2 routine just using those 300 genes. At least technically, that should be doable.
2% to 300 genes (the rest are multimaps), and 90 refers to entire genome. I used used STAR-DESeq2 pipeline.
What I want to try is mapping the parent only to subset of these 300 genes and then use DESeq2.
In the parental line, when you map to the entire genome, what is the mapping rate on the subset of 300 genes? Could you maybe clarify a bit how you mapped? Ex:
So I think I will subset these 300 genes from the whole genome mapping, and will normalize the library size only based on these genes and will compare it with hybrid.