I'm aligning reads to a genome to later do some variant calling. I have mapped some reads to a genome using bwa and created the corresponding dictionary
bwa index -p ref.fa ref.fa
bwa mem ref.fa forward.fastq.gz reverse.fastq.gz -t 16 > mapped_file.sam
samtools view -bS mapped_file.sam -o mapped_file.bam
java -jar picard.jar CreateSequenceDictionary \
R=ref.fa \
O=ref.dict
samtools faidx ref.fa
And then proceeded to sort and remove duplicates
java -jar picard.jar SortSam \
I=mapped_file.bam \
O=sorted_file.bam \
SORT_ORDER=coordinate
java -jar picard.jar MarkDuplicates \
REMOVE_DUPLICATES=true \
ASSUME_SORT_ORDER=coordinate \
I=sorted_file.bam \
O=clean_file.bam \
M=metric_file.txt
And finally tried to realign the sequences using GATK, but the programmed complained that my input bam file is malformed.
java -jar GenomeAnalysisTK.jar \
-T RealignerTargetCreator \
-R ref.fa \
-I clean_file.bam \
-o clean_file.intervals
I ran ValidateSamFile, as GATK recommends, and got this error
## HISTOGRAM java.lang.String
Error Type Count
ERROR:MISSING_READ_GROUP 1
WARNING:RECORD_MISSING_READ_GROUP 324404
To solve this problem I tried running AddOrReplaceReadGroups and FixMateInformation on my clean_file.bam, as the GATK webpage recommends (http://gatkforums.broadinstitute.org/gatk/discussion/7571/errors-in-sam-bam-files-can-be-diagnosed-with-validatesamfile), to no avail. I also ran ValidateSamFile directly in my Bra output and got a similar error message. Is there an explanation for this problem? Has my BWA output been corrupted in any stage?
The RG tag has to be added manually, so it's not really a "corrupt" BAM... you just haven't added RG metadata and GATK needs that. AddOrReplaceReadGroups is the answer. Why didn't it, er, "avail"? :)