Hi all
So, I'm working with some RNA-seq raw reads, both paired end (2 x 150 bp) and single-end (1x75 bp). These are coming from the same samples, but sequenced differently in 2 rounds. I want to get counts for each at the trasncript level using salmon. For DEseq2 differential expression analysis at the gene level, I just took the raw counts for each sample (sample_1_seq1 and sample_1_seq2 and sum them. This was after checking there was correlation and the 2 experiments were similar. This way, I just got one count for each sample.
With salmon, I got the quant.sf file for each sample. I see inside the length, effectivelength, TPM and NumReads for each transcript. Does anyone have any idea on how could I get one single .sf file kind of "merging" sample_1_seq1 and sample_1_seq2?
Maybe I can provide salmon from the beginning with the 2 sets of reads somehow?
Thank you for your advice
There are a few ways to go about doing this technically. However, I would actually suggest _not_ merging the paired-end and single-end runs. Instead, I'd treat the sequencing protocol (SE vs PE) as a technical factor in the design matrix when you want to do differential testing in DESeq2.
tend to agree with Rob here, however if you really insist to merge them I would propose to take the forward reads of the paired run (and only that one!!) and merge that with the SE one and run salmon using that combined input file. (not sure though what the influence of the diff read lengths will be)
The thing here is that the experimental design is suboptimal and your analysis, if you really want to do it properly (at least what I think would be proper), is limited by the "weakest link of the chain" which is 1x75bp. I would therefore trim everything to 75bp, keeping only R1, followed by checking for potential other batch effects using something like PCA or MDS. The latter is probably not necessary if it is indeed the exact same pool of cDNA you sequenced.
Thank you all for useful suggestions