Hello,
I'm making genotype calls with GATK. I have bam files with multiple read groups i.e. multiple @RG fields:
@RG ID:2893626757 PL:illumina PU:C2CDLACXX.8-AGGCGAAG-AGGCGAAG LB:H_LS-LL-A6FR-10B-01D-A31U-09-lib1 PI:0 DS:paired end DT:2013-09-25T19:00:00-0500 SM:H_LS-LL-A6FR-10B-01D-A31U-09 CN:WUGSC @RG ID:2893626761 PL:illumina PU:C2CDLACXX.7-AGGCGAAG-AGGCGAAG LB:H_LS-LL-A6FR-10B-01D-A31U-09-lib1 PI:0 DS:paired end DT:2013-09-25T19:00:00-0500 SM:H_LS-LL-A6FR-10B-01D-A31U-09 CN:WUGSC
I need to realign these files with bwa mem, I'm doing so by converting the bam file into an inter-levelled fastq and carrying out the alignment. the -R option in bwa allows me to add one @RG field, I'm however unsure on how to add the other read group fields, is there a way to make these additions at the alignment ste?. I know read groups can be added by picard but is there a way to add multiple read groups at once?
Thank you, Vakul
You can't have multiple RG on the same read (or any tag for that matter), so you must mean you want to mark some reads in your BAM as belonging to group X, and some reads as belonging to group Y.
Basically, I don't know how to do this without writing something custom in python, that writes out a new BAM file adding the RGID for group X or Y depending on some read criteria (which you will have to define, probably parsing the QNAME). All the tools I know of will mark all the reads in a file as being from a particular read group, and that info is maintained during a merge, because this is the simplest way to do this.
Sorry for being unclear, yes as you said i need to mark some reads belonging to group X and other to group Y.
Alternative to writing custom script is possible to split the original bam file by group, convert them to fastq files align them separately add the read group information to the individual bam files and then merge them so that the reads have their original read group information?
Oh, so they already have the right RGIDs, you just want to preserve those IDs after alignment when you suspect they will get some different RGID due to your mapping pipeline? In that case you probably want MergeBamAlignment from Picard. It does exactly that. Merges a new mapped BAM with an old BAM with more metadata in it (as well as a bunch of other things to "normalize" the BAM to the standards.)
But there's still a lot of uncertainty in your post about what you have, what you want, etc. I think if you were to add some more detail to the post (like snippits from the BAMs you have), that would help a lot and might bring up more/better solutions to your issue.
I have a BAM file with two read groups
1.@RG ID:2893626757 PL:illumina PU:C2CDLACXX.8-AGGCGAAG-AGGCGAAG LB:H_LS-LL-A6FR-10B-01D-A31U-09-lib1 PI:0 DS:paired end DT:2013-09-25T19:00:00-0500 SM:H_LS-LL-A6FR-10B-01D-A31U-09 CN:WUGSC
To run GATK on this file I want to realign it with BWA MEM, so to keep read group information about both read groups I was wondering if the right approach will be to split the bam file by reads, convert the resulting split bam files to fastq and map them separately (while providing their appropriate read groups to bwa with -R) and then merge the resultant files so that I have a realigned BAM file with the same read group information as the original BAM file.
You've basically repeated the original post and not provided any new read information, but I understand you now :P
Yes, what you want to do will work perfectly well. You could also just convert the whole existing BAM to FASTQ, map that 1 BAM (don't specify -R because that info will be lost anyway), and then use MergeBamAlignment to make a third BAM with the mapping info from the second and the metadata from the first. Then delete the first and second BAMs.
Thank you so much, your suggestion is much simpler than what I was planning to do. The documentation of MergeBamAlignment alignment suggests that the original bam file needs to be unaligned, is that necessary, will it make a difference if my original bam file is already aligned?
I dont think so, but to be honest im not 100% sure. Perhaps try with a little <1Gb BAM file to test what happens? Sorry this is something I really should know, but I dont. Please reply if you find out :)
I'm trying it out right now, I'll post what I find here.
I tried it out with a smaller bam file I created by extracting reads from chr1. I shuffled the file converted it to fastq and aligned it with bwa and coordinate sorted it with picard to obtain the final bam. When i tried to merge this bam file with the one i started from picard threw an error. I'll be grateful if you can help me make sense of it. I have messagte below
\INFO 2016-04-16 18:59:25 SamAlignmentMerger [test.bam]
WARNING 2016-04-16 18:59:25 SamAlignmentMerger Exception merging bam alignment - attempting to sort aligned reads and try again: Underlying iterator is not queryname sorted: HWI-ST467_114519247:4:2108:16313:13007 1/2 100b aligned read. > HWI-ST467_114519247:4:1202:13489:43853 2/2 100b aligned read.
INFO 2016-04-16 18:59:35 SamAlignmentMerger Read 1000000 records from alignment SAM/BAM.
INFO 2016-04-16 18:59:42 SamAlignmentMerger Read 2000000 records from alignment SAM/BAM
<similar rows="">
< '' >
< '' >
< " >
[Sat Apr 16 19:02:37 EDT 2016] picard.sam.MergeBamAlignment done. Elapsed time: 3.20 minutes. Runtime.totalMemory()=2718433280 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" java.lang.IllegalStateException: Aligned record iterator (HWI-ST434_117786517:6:1101:10000:140664) is behind the unmapped reads (HWI-ST729_114206781:7:2103:17869:148598) at picard.sam.AbstractAlignmentMerger.mergeAlignment(AbstractAlignmentMerger.java:428) at picard.sam.SamAlignmentMerger.mergeAlignment(SamAlignmentMerger.java:143) at picard.sam.MergeBamAlignment.doWork(MergeBamAlignment.java:248) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
Oh no - so close! At the end you coordinate sorted it but I think it needs to be queryname sorted :)
I tried that as well, still gave me the same error. I think it might be because the first bam file is aligned to GRCh37 while my alignments are fine to hg19, which I need to use as the reference genuine for my analysis.
Ah, could be - or maybe one is sorted with samtools and the other with picard? They have different methods of sorting qname :(
I sort both files with samtools and give it a short. Though the more I think about it the whole split by read group, align and merge approach seems far less complicated, though its probably going to eat a whole bunch of space on the cluster with all the temporary files :(
I'm wondering where you landed on this @vakul.mohanty. I'm currently facing a similar challenge.
Thanks, Ben
I stuck to the simple split, align and merge approach. it takes a little longer and you have to run multiple instances of the alignment but it works well enough on the cluster.