Hi all,
I have 4 bam files which correspond to 4 runs of one library. I want to merge those 4 files in order to do some variant calling analysis with only one file per sample.
I tried some basic method, with samtools merge, and MergeSamFiles from picard tools, but I am now not sure what they commands do exactly, because at the end I don't find the same number of SNP with one way or the other.
So my question is: What is the method for merge multiple related bam files from the same sample in one? And what is the difference between merge from samtools and MergeSamFiles from picard tools?
I know this is a really basic question and I apologize for that, but I never did this kind of analysis before.
Well, my bam files doesn't have @RG in the header, only the reference sequence dictionary (@SQ). Should I add an RG column by myself ? How is it formatted ?
Picard has a command to add RG tags (AddOrReplaceReadGroups). Run it for each BAM, assigning different RG tags, then merge the BAM files. I don't remember the details but looking for RG tags in biostars or google you will find the necessary details.
Hi, I finally added RG tags and it seems to work. (It didn't explain exactly why I had different results with Picard MergeSamFiles and samtools merge btw..)
Anyway, thanks for your help !
Good!
I know RG can make a difference if you are working with GATK (models are built separately for each RG - in my peripheral understanding of the thing). However, I don't know whether samtools/mpileup do use this info or not. If you ever find this, leave a note here ;-)
If it's there then it's used, if it's not then it's not used. The merging of files with read groups has historically worked better with picard.