Hi there,
Ok so I have some Illumina whole genome bams. They have been aligned by illumina using Casava but we wanted to re-run the alignments ourselves using bwa so I have used bam2fastq and extracted the paired end sequences. These have then been sucessfully aligned using bwa mem to produce a sam file that is then converted to a new, sorted and indexed bam file.
So far so good.
I want to use tools from the GATK and will need to insert readgroup data in order to do so. Each bam I have represents a single sample from a single library prep but they were run on multiple lanes (typically 3) as indicated from the fastq files and from the QNAME variable in the sam file.
For example: HS2000-1259_127:3:1210:15640:52255
With, I believe, 3 being the flowcell lane.
As the sample was the same and it's from the same library prep can I ignore the fact they were run on different lanes or is it necessary to individually tag each read with a read group according to it's flowcell lane?
If it is necessary to tag each read separately, am I correct in thinking that Picard's AddOrReplaceGroups is not capable of doing this (or at least not without splitting the bam up first, running picard then remerging)? And if it isn't, has someone already written something to carry out this task? (I'm sure I could probably whip something up, but there's no point reinventing the wheel!).
Thanks in advance.
the simplest method is to set the read group with
BWA sampe -r '.....'
Hi Pierre,
I think I need to explain a bit more, the files we received from Illumina were already aligned by Illumina (not us) using Casava. We want to re-run the alignments using bwa. So we've extracted fastq files from the bams we received from Illumina. While I can use the -r method (or the equivalent -R method in bwa mem that I'm using), that still doesn't account for the fact the fastq files contain separate lanes (unless I'm missing something).
Doing it this way would it in fact be necessary to split out the lanes from the fastq files (i.e. create 3 fastq files, one for each lane) and align separately, while adding appropriate readgroups, the merge the resultant sam files?
but you could split your fastq per lane isn't it ? how to split reads for different flowcell lanes in fastQ files? , align each pair of fastq and merge later with picard/MergeSam
Thanks Pierre, that's what I thought. Assuming that I haven't actually done this though (which i haven't in this case, but I can alter the pipeline for the remaining samples), should I then use another tool (or make something myself) to add the read group info to the sam file on a per lane basis?