How to properly merge multiple runs into one bam file ?
2
2
Entering edit mode
8.9 years ago
Rob ▴ 150

Hi all,

I have 4 bam files which correspond to 4 runs of one library. I want to merge those 4 files in order to do some variant calling analysis with only one file per sample.

I tried some basic method, with samtools merge, and MergeSamFiles from picard tools, but I am now not sure what they commands do exactly, because at the end I don't find the same number of SNP with one way or the other.

So my question is: What is the method for merge multiple related bam files from the same sample in one? And what is the difference between merge from samtools and MergeSamFiles from picard tools?

I know this is a really basic question and I apologize for that, but I never did this kind of analysis before.

SNP genome samtools • 12k views
ADD COMMENT
3
Entering edit mode
8.9 years ago
abascalfederico ★ 1.2k

I am wondering whether the differences you get between using MergeSamFiles and samtools merge may be related to improperly handling read group (RG) information. Each of your BAM files should have a specific RG assigned. This information is taken into account during different steps of variant-calling pipelines: duplicates filtering, variant calling... I would inspect the merged BAM file to see whether RG tags have been properly assigned.

ADD COMMENT
0
Entering edit mode

Well, my bam files doesn't have @RG in the header, only the reference sequence dictionary (@SQ). Should I add an RG column by myself ? How is it formatted ?

ADD REPLY
0
Entering edit mode

Picard has a command to add RG tags (AddOrReplaceReadGroups). Run it for each BAM, assigning different RG tags, then merge the BAM files. I don't remember the details but looking for RG tags in biostars or google you will find the necessary details.

ADD REPLY
1
Entering edit mode

Hi, I finally added RG tags and it seems to work. (It didn't explain exactly why I had different results with Picard MergeSamFiles and samtools merge btw..)

Anyway, thanks for your help !

ADD REPLY
0
Entering edit mode

Good!

I know RG can make a difference if you are working with GATK (models are built separately for each RG - in my peripheral understanding of the thing). However, I don't know whether samtools/mpileup do use this info or not. If you ever find this, leave a note here ;-)

ADD REPLY
0
Entering edit mode

If it's there then it's used, if it's not then it's not used. The merging of files with read groups has historically worked better with picard.

ADD REPLY
0
Entering edit mode
8.9 years ago

Samtools and picard (MergeSamFiles) should produce essentially the same merged BAM file (there might be minor differences in some of the ordering, but that should be irrelevant). Are you sure you used the same setting when calling variants on the two versions of the merged file? This sounds more like an issue with the variant caller or its usage.

ADD COMMENT
0
Entering edit mode

I use the same pipeline for both files, so I suppose it wasn't the problem.

ADD REPLY
0
Entering edit mode

That's disturbing. Have you loaded both files with the VCF in IGV and looked at the discordant sites? Perhaps that will show what the problem is.

ADD REPLY
0
Entering edit mode

Well, some SNP appear or disappear depends on the file, others are the same.. It's difficult to say why just by looking the vcf files, even in IGV..

ADD REPLY
0
Entering edit mode

Does your variant-calling pipeline include duplicate marking/removal? And did you add RG tags before using SAMtools merge, or only for Picard MergeSamFiles? If only the latter, then (depending upon your duplicate filtering software) duplicate reads from different read groups might not be identified as such. Duplicate filtering would produce different data sets downstream of the merge function, which would impact variant calling.

ADD REPLY
0
Entering edit mode

Hi, my pipeline only contain variant-calling (with mpileup) and some filter (like vcfutils.pl varFilter). There is no RG tags in my bam files (even with picard MergeSamFiles). Should I add it ? The samtools doc for mpileup say "if no @RG lines are present, each alignment file is taken as one sample" that is suppose to be okay in my case since I merge my bam file before I use mpileup no? (Maybe it's also a bad idea to not remove the duplicate reads ?)

ADD REPLY
0
Entering edit mode

Use SAMtools tview to compare the aligned reads of the two different merged BAMs, at a few of the loci with VCF discrepancies. Either they are the same (and the problem is downstream) or not.

ADD REPLY

Login before adding your answer.

Traffic: 2506 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6