Hi Biostars! I am coming to you with a relatively simple question, but one that i have surprisingly not found and answer to. I am working with a case-cohort of samples that were prepared for RNA-seq as paired reads with a read length of 125bp (each).
I am wondering now how to handle alignment with respect to control data that I have obtained through TCGA that is paired reads (75bp) in length (sequenced on the same platform, same organism, same tissue and using a protocol very similar to that used for the case-cohort). As I am running an alignment through STAR (i know that RSEM is generally preferable, but unfortunately it is not an option in this case). When preparing the index, i have thus generated it with the setting --sjdbOverhang=124bp (read-length-1).
My question is, should i generate a separate index for the control cohort that adheres to the length of the controls (i.e. --sjdbOverhang= 75-1 =74) and align them to the reference genome using this, or should i still align the controls against the same index used for the case-group. The alignments will subsequently be used for differential gene expression (with read counting carried out using featurecounts (subread)).
I apologize if this question is very simplistic/basic or have been answered before, but i have previously only worked with case/control samples from the same source of origin, and have not found relevant information regarding working with a case/control with different read-lengths.
I would not even do the trimming. Any differential analysis of case (normal/control) versus anything in TCGA (progressed cancer) is going to yield thousands of DEGs, I really doubt that the read length is any relevant factor in this case, given the tremendous batch effects between unrelated datasets, especially at large sample size. It's going to be a (and I really feel sorry saying that) pointless analysis, that by design should never have been done that way. For any case you need a control done at the same time, same lab, same kit, same everything or you cannot compare, especially if you expect many DEGs.
Thank you for the extensive answer, i highly appreciate it! I understood the basics regarding the limitations of batch-effect errors from previous experiences in DNA-methylation analysis, but as i was not certain regarding the sensitivity of RNA-seq data in this context, i thought it best to ask here before i started downloading such enormous amounts of data. I was primarily hoping to construct a network model of the large scale differences with respect to DEGs within the sub-populations of my case-set, but it seems constructing paired case/control sample sets is the way to go (as i feared). Given your comments, and the comments left by other users, i will have to focus on the case vs. case differences (as i am also investigating the effects of sub-populations within the data). Once more, thank you for taking the time to reply to such a "stupid" question. I am quite new to this type of analysis, and while there is an extensive amount of information out there regarding bioinformatics, it can be difficult to find guidelines explaining the correct type of approach and "common sense" for a specific type of analysis.
If you are interested in network buiding, then case vs case is definately the way to go. In general, network building is an alternative to DGE. They end up leading to the same conclusions, but where the appropriate design can be specified, DGE is generally more sensitive. Network approaches work well when that the specific hypothesis (the correct experiemental design and the identification of the most informative two way comparison) isn't possible.
Thank you kindly for the explanation, it is highly appreciated. I suppose i was a bit confused as you quite often see people validating their findings in studies against external data such as TCGA studies and the like.
I understand if you do not have the time to answer such basic questions, but are these approaches then "incorrect" seen to statistical validity, or is it more accepted in the scientific community to perform two separate DEG analysis for the "local" cohort and the external cohort and then compare their respective results to one another. I know that this approach is quite common, but would not the biases previously established then carry over (i.e. l2FC for a DEG being remarkably different between the two cohorts stemming from the previously mentioned batch effects). Because the batch effects to my understanding do not disappear after performing DEG analysis through a program such as Deseq2 (and batch effect removal software such as ComBat/RUV have been shown recently to introduce false positives to datasets).
I am asking as I am very interested in seeing how genes for certain sub-populations within my cohort would compare both internally against other subtypes, but also how their relative expression looks when compared to a normal population (i.e. see the inner & outer joins for DEG's case1 vs. case2 and case1 vs. control). But after your thorough explanation above, that in general seems like a bad idea, thus making the prevalence of studies utilizing external cohorts as part of their validation process all the more confusing.
I understand that many within the field of biology (including myself) have a limited understanding of the statistics behind the generated output of data (largely stemming from the black-box problem due to the many advanced and often highly specific elements of bioinformatic analysis), but then it is even more surprising so many studies employing this methodology get published at all?
Again, thank you in advance for any answers you have the time to give, to get insights from someone with as much experience within the field as yourself is illuminating.
The assumption of these studies is that any batch effect within a batch affects both treatment and control equally, so that when you divide the one by the other, they cancel out. This is almost certainly not strictly true, but people assume it will work well enough to provide some insight. P-values from two different studies are probably not directly comparable due to differeing powers between the two studies, but the are metaanalysis techniques that can help you combine p-values from different studies. Its also generally assumed that log2FC will be in the same direction between two studies, even if not of exactly comparable magnitude, so it is valid to take your DE genes from one cohort of treatement vs control and ask if their log2FCs are the same sign as those you find in another study.
In general any batch correction technique requires that batches overlap in the conditions they contain, and thus the algorithmn can compare the distribution of two sets of samples that should be the same, and, finding how different these two sets of samples that should be the same are, calculate a correction factor that can then be applied to the rest of the batch (this is, of course, a massive oversimplification, but the idea holds).
If the samples don't overlap, there is simply no way a correction can be applied. If gene A is expressed twice as high in the cancer cohort as in the normal cohort, is that due to the condition of interest, or is it due to the fact the libraries were prepared using a different manufactorer's kit? No amount of statisitics can answer that question.
Even where apply batch correction is possible, as you note, it is not always successful. This is probably due to the fact that the batch effects are assumed to be linear, and this is possibly not always the case.
As to how these studies get published? I can really think of three explainations:
BTW - Most TCGA cancer types do have normals amoungst the samples.
Again, thank you so much for these very clearly explained summaries. After reading some articles on the matter, i feel i have a much better grasp of the thought process behind constructing a more statistically valid experiment in the future. It truly is strange how in a field that is so well established and seen as prestigious as biology, there is so little focus on Bioinformatics and their essential requirements to produce accurate findings.
Comparing the expression trends in my own data when compared to the output of a case-control study from an external cohort using the same subtype then seems to be the best way to go (i.e. looking at if a gene is over/underexpressed in my own subtype1 vs. subtype2 and then if this is coherent with the findings in the case vs. control study (ideally using the same subtype for the case in the external cohort)). Obviously any findings would have to be validated through wet-lab work, but thankfully arranging that will likely not be an issue. Thank you so much for the time and effort put into these answers!