How to align the same samples that were sequenced in multiple flow cells and lanes
2
0
Entering edit mode
7.5 years ago
Matina ▴ 250

Hi,

I have a set of FASTQ files that I want to align to the reference genome. The sequencing for each sample has been done on 2 different runs (flow cells) and 2 different lanes so for each sample I have 4 files. I am not sure when I should merge my files, before or after alignment? I read previous posts that suggest to merge the samples after alignment, but I am not sure what is the best in my case. Could I merge the samples using samtools? Do I just simply cat one at the end of the other?

An example for sample1 is shown below (FC = flow cell, L = Lane)

sample1.FC1.L1

sample1.FC1.L2

sample1.FC2.L1

sample1.FC2.L2

Thanks a lot in advance!

RNA-Seq alignment • 4.6k views
ADD COMMENT
3
Entering edit mode
7.5 years ago

when aligning, you should specify the lane information in the read group (RG): .e.g see: How to choose the right RG,SM and LB values for alignment

you can align and sort each pair of fastq and merge them later: e.g: Merging Bam Files

ADD COMMENT
0
Entering edit mode

thanks a lot for the reply. I was wondering why it is important to specify RG.

ADD REPLY
0
Entering edit mode

Read groups may be used to indicate which libraries are technical replicates of one another. That will help the variant caller decide how much variability comes from the instrument itself.

ADD REPLY
2
Entering edit mode
7.5 years ago

In general it is probably best to keep these separate as they form technical replicates and will help you assess potential biases between runs. You would be able the merge the alignment files later.

If you were to perform a study that works best with maximal data (like a genome assembly) then merging them early on is recommended

ADD COMMENT
0
Entering edit mode

OK i got it, align first and then merge. I want to do a simple differential expression analysis. Thanks a lot!

ADD REPLY
0
Entering edit mode

Hi! I wanted to follow up on this if you wouldn't mind elaborating a bit more.

When you say " You would be able the merge the alignment files later.", I understand you mean that by just selecting these as replicates for downstream analyses you are essentially merging them and so there's no need to literally merge the files together, right?

I'm mostly curious as to whether this is generally accepted, and whether I can report the data generated like this just the same as if I had generated it by concatenating them from the beginning. I have already done DEG and pathway analyses while keeping the "flowcell1" and "flowcell2" samples as technical replicates, and practically I understand that I wont obtain much different results from concatenating them at the beginning, but still, I wonder if such processing is assumed (and so expected) in published data.

Thanks in advance for your help!

ADD REPLY
0
Entering edit mode

when we talk about merging what we all mean that one could literally merge the files if that becomes necessary - for example suppose there is an analysis tool can only take a single BAM file.

Combining information from multiple alignment files is also technically a merging, but not of the BAM files, but some subset information from them.

ADD REPLY

Login before adding your answer.

Traffic: 1903 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6