Hi All, I have a two datasets of rna-seq samples, one consists of strand-specific protocol (Truseq) and the other one unstranded (Clontech’s SMART). I would like to use both datasets (to increase the power of my study) and tried batch effect correction, but it did not go well (I still see two clear groups separated on pca according to the the protocol used). Is there a way to account for the difference between the protocol at the mapping/counting levels? My understanding is that the principle difference between the two sequencing techniques is that the unstranded will generate reads from both strands, even if one strand was actually expressed. Is there a way to get rid of the strands that were not expressed by using my strand dataset (assuming that strands that are not expressed in the strand dataset should not be expressed in the unstranded dataset as well)? Thanks a lot!
Thank you for the detailed answer. I agree with your comments. I may not have been explicitly clear about what I would like to achieve from the conversion. I built a model to predict groups based on their gene expression using the stran specific samples. I want to verify my model using the unstranded samples and some of the remaining stranded samples (I don't have many to begin with). I was hoping there is a way to compare the expression levels between the strand specific samples and the unstranded samples. Also, it's worth noting that my alignment algorithm was based on splice site orientation, so I was able to infer the strand for the unstranded reads. I know I may be losing a lot information (such as novel genes etc'...), but I am not trying to detect genes, just compare levels of expression for selected genes. Thanks
At least for non-overlapping genes the TPM values should be comparable if the experimental conditions were the same. If you see great differences there, the issue is most likely not just due to the different library prep types.