Question

Can I merge fastq files of a same sample from two sequencers for gene expression analysis?

1

Entering edit mode

8.2 years ago

Megan ▴ 50

Hi, I am trying to follow the GDC mRNA analysis pipeline to calculate the expression of genes. And now I have several samples and each of them was sequenced twice on different instruments. The first was 5G and second was 12G. And the read length are different. Can I merge the fastq files for alignment and gene expression analysis thereafter?

Thanks!

RNA-Seq gene expression STAR • 3.4k views

ADD COMMENT • link updated 8.2 years ago by Petr Ponomarenko ★ 2.8k • written 8.2 years ago by Megan ▴ 50

0

Entering edit mode

What is 5G and 12G referring to?

ADD REPLY • link 8.2 years ago by GenoMax 150k

0

Entering edit mode

The sequence depth of the two times are different. 5G and 12G refer to the amount of data generated.

ADD REPLY • link 8.2 years ago by Megan ▴ 50

score 0 · Answer 1 · 2017-02-24

0

Entering edit mode

8.2 years ago

Petr Ponomarenko ★ 2.8k

No, especially if it is paired end experiment and you are aligning to the transcriptome. The best way is to merge as late (downstream) as possible, because within each pair you have different distributions and thus different parameters for statistical models that are being used in almost every step and most of the time parameters are automatically optimized for your data or part of it. Also because of this you want unimodal distributions, since this is what most of the tools are designed for. By having data combined from different instruments and with different read length you are in the field of bimodal distributions all over the place. Please correct me if I am wrong and there are a reason and a good way of merging data as early as possible and what is the best way to do it?

ADD COMMENT • link 8.2 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

If read lengths and library prep would be the same there would be no issue with merging (technical sequencing replicates, e.g. from multiple lanes).

ADD REPLY • link 8.2 years ago by WouterDeCoster 47k

0

Entering edit mode

If same sample library was run on two different sequencers merging the data before alignment should be ok (you would still be able to recognize which sequencer the data came from). To be safe one could add a factor in to differentiate the sequencers when doing DE analysis (in case there is a "sequencer" effect).

ADD REPLY • link 8.2 years ago by GenoMax 150k

0

Entering edit mode

Hi, you mentioned "The best way is to merge as late as possible". What exactly does this mean? Here are two processes I'm thinking about.

Process 1

(1) Alignment-I'm using STAR. (2) raw counts generate (3) add the raw counts from 5G and 12G sample together

Process 2

(1) Merge the fastq from 5G and 12G (2) Alignment using STAR (3) raw counts generate

What's the potential artifact of these two process? Or is there any reference deal with this issue? Thanks!

ADD REPLY • link 8.2 years ago by Megan ▴ 50

0

Entering edit mode

I meant as late as possible. From this the first process I'd better then the second. Althought I would try to estimate expression levels first, then merge, because statistic is different

ADD REPLY • link 8.2 years ago by Petr Ponomarenko ★ 2.8k