Question

Merging the output of salmon for RNA-seq counting with paired-end and single-end reads simultaneously

0

Entering edit mode

6.0 years ago

nanoide ▴ 120

Hi all

So, I'm working with some RNA-seq raw reads, both paired end (2 x 150 bp) and single-end (1x75 bp). These are coming from the same samples, but sequenced differently in 2 rounds. I want to get counts for each at the trasncript level using salmon. For DEseq2 differential expression analysis at the gene level, I just took the raw counts for each sample (sample_1_seq1 and sample_1_seq2 and sum them. This was after checking there was correlation and the 2 experiments were similar. This way, I just got one count for each sample.

With salmon, I got the quant.sf file for each sample. I see inside the length, effectivelength, TPM and NumReads for each transcript. Does anyone have any idea on how could I get one single .sf file kind of "merging" sample_1_seq1 and sample_1_seq2?

Maybe I can provide salmon from the beginning with the 2 sets of reads somehow?

Thank you for your advice

RNA-Seq single-end paired-end salmon • 3.8k views

ADD COMMENT • link 6.0 years ago by nanoide ▴ 120

2

Entering edit mode

There are a few ways to go about doing this technically. However, I would actually suggest _not_ merging the paired-end and single-end runs. Instead, I'd treat the sequencing protocol (SE vs PE) as a technical factor in the design matrix when you want to do differential testing in DESeq2.

ADD REPLY • link 6.0 years ago by Rob 7.1k

2

Entering edit mode

tend to agree with Rob here, however if you really insist to merge them I would propose to take the forward reads of the paired run (and only that one!!) and merge that with the SE one and run salmon using that combined input file. (not sure though what the influence of the diff read lengths will be)

ADD REPLY • link 6.0 years ago by lieven.sterck 15k

2

Entering edit mode

The thing here is that the experimental design is suboptimal and your analysis, if you really want to do it properly (at least what I think would be proper), is limited by the "weakest link of the chain" which is 1x75bp. I would therefore trim everything to 75bp, keeping only R1, followed by checking for potential other batch effects using something like PCA or MDS. The latter is probably not necessary if it is indeed the exact same pool of cDNA you sequenced.