Gatk Has Issues With Sorted Bam File
1
3
Entering edit mode
13.5 years ago
Oliver ▴ 240

Hello folks,

I try to establish the following pipeline stated here in this board:

http://biostar.stackexchange.com/questions/1269/what-is-the-best-pipeline-for-human-whole-exome-sequencing

Since time changes and so did the parameter for the tools, I had to do some changes to the commands (see the foot of this posting). But the pipeline has issues with one of the last two steps. When I run the IndelGenotyper:

java -jar /opt/biosw/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -T IndelGenotyperV2 -R ../reference/hg19_chr15.fa -I output/xxx.sorted.realigned.bam -o output/xxx_indel.txt -verbose output/xxx_indel_statistics.txt

it breaks with an error message:

Missorted Input SAM/BAM file SAMFileReader{output/xxx.sorted.realigned.bam}: file sorted in coordinate order but coordinate is required; Read XXX:2:1:10032:2323#0 out of order on the contig

and that I may use ReorderSam from the picard tools to get an appropriate sorting. Confusingly, the following step in the pipeline runs well with this bam file:

java -jar /opt/biosw/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ../reference/hg19_chr15.fa -I output/xxx.resorted.realigned.bam -o output/xxx.vcf.calls -stand_call_conf 30.0 -stand_emit_conf 10.0

Question here: Why does the GATK takes this file in one step but reject the same file in the other?

So I used the picard tool ReorderSam to reorder the BAM file after the 7. step in the pipeline:

java -jar /opt/biosw/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -T IndelRealigner -R ../reference/hg19_chr15.fa -I output/xxx.sorted.bam -targetIntervals output/xxx.intervals -o output/xxx.sorted.realigned.bam
java -jar /opt/biosw/picard-tools-1.45/ReorderSam.jar I=output/xxx.sorted.realigned.bam O=output/xxx.resorted.realigned.bam REFERENCE=../reference/hg19_chr15.fa

Sadly, this changed nothing, I keep getting this error message:

Missorted Input SAM/BAM file SAMFileReader{output/xxx.resorted.realigned.bam}: file sorted in coordinate order but coordinate is required; Read XXX:2:1:10032:2323#0 out of order on the contig

I also tried to resort it with the samtools:

samtools sort output/xxx.sorted.realigned.bam output/xxx.resorted.realigned

Now more confusingly the file runs well with the first command which failed in the previous step (IndelRealignerV2) but fails in the last step (UnifiedGenotyper) with this error message:

SAM/BAM file SAMFileReader{output/xxx.resorted.realigned.bam} is malformed: Read XXX:2:1:6409:5134#0 is either missing the read group or its read group is not defined in the BAM header, both of which are required by the GATK

So I could do two sortings, keeping the old one for the UnifiedGenotyper and sort with the samtools again for the IndeRealignerV2, but I don't think this is an acceptable workaround. So does anyone has an idea what the point of the GATK toolkit is?

As you may have noticed, I took only a part of the whole genome to speed up the pipeline (we want to run it with less data, then we could use the whole set). I can't image that this is the issue since its only about to order some entries. But I'm not sure.

Any help appreciated, Oliver

Here my current pipeline:

 1.  bwa aln -f output/xxx.sai -t 4 ../reference/hg19_chr15.fa fastq/xxx.fastq
 2.  bwa samse -f output/xxx.sam ../reference/hg19_chr15.fa output/xxx.sai fastq/xxx.fastq
 3.  samtools import ../reference/hg19_chr15.fa output/xxx.sam output/xxx.bam
 4.  samtools sort output/xxx.bam output/xxx.sorted
 5.  samtools index output/xxx.sorted.bam output/xxx.sorted.bam.bai
 6.  java -jar /opt/biosw/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ../reference/hg19_chr15.fa -I output/xxx.sorted.bam -o output/xxx.intervals
 7.  java -jar /opt/biosw/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -T IndelRealigner -R ../reference/hg19_chr15.fa -I output/xxx.sorted.bam -targetIntervals output/xxx.intervals -o output/xxx.sorted.realigned.bam
(8.a samtools sort output/xxx.sorted.realigned.bam output/xxx.resorted.realigned)
(8.b java -jar /opt/biosw/picard-tools-1.45/ReorderSam.jar I=output/xxx.sorted.realigned.bam O=output/xxx.resorted.realigned.bam REFERENCE=../reference/hg19_chr15.fa)
 9.  samtools index output/xxx.resorted.realigned.bam output/xxx.resorted.realigned.bam.bai
10.  java -jar /opt/biosw/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -T IndelGenotyperV2 -R ../reference/hg19_chr15.fa -I output/xxx.resorted.realigned.bam -o output/xxx_indel.txt -verbose output/xxx_indel_statistics.txt
11.  java -jar /opt/biosw/GenomeAnalysisTK-1.0.4705/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ../reference/hg19_chr15.fa -I output/xxx.resorted.realigned.bam -o output/xxx.vcf.calls -stand_call_conf 30.0 -stand_emit_conf 10.0
exome next-gen sequencing gatk bam coordinates • 12k views
ADD COMMENT
8
Entering edit mode
13.5 years ago
Russh ★ 1.2k

Had this yesterday don't use 'samtools sort' use picard tools 'SortSam' at that step, with 'SORT_ORDER=coordinate' you have to use 'coordinate' to get the right header. I believe 'samtools sort' is equivalent to 'picard: SortSam ... SORT_ORDER=queryname' and GATK fails with 'queryname' sorted files. Well, samtools better be equivalent to that method, because it puts 'queryname' in the sort method flag of the resulting .sam file

For me, picard tools is a lot neater than samtools in a lot of ways

For example, Depending on your plans in GATK, you might find that you require 'readgroups' in your bam file which aren't currently present. These can be added, and the bam can be sorted, in one step using the picard tools 'AddOrReplaceReadGroups' program

All the best,

Russ, Liverpool

ADD COMMENT
2
Entering edit mode

Sorry, I ploughed into answering your question without reading the final half of it. You definitely need AddOrReplaceReadGroups from picard tools. Use that at an early stage and you should fly through unified genotyper when it comes around

ADD REPLY
1
Entering edit mode

you could use it at step 4, because you can use AddOrReplace... to both add read groups, sort your bam (using SORT_ORDER=coordinate) and index the bam at the same time (CREATE_INDEX=TRUE).

I don't think picard likes unmapped reads to be present in your bam, so you might have to filter them out prior to running this step.

I recently used this java -jar ~/tools/picard-tools-1.45/AddOrReplaceReadGroups.jar INPUT=realigned.sorted.temp.bam OUTPUT=realigned.sorted.temp2.bam SORT_ORDER=coordinate RGLB=1 RGPL=solid RGPU=1 RGSM=ID439_01 VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true

ADD REPLY
0
Entering edit mode

Hello Russ, thank you for your answer. This AddOrReplaceReadGroups thing worked well, at least with a fraction of the reads (the whole set of reads crashs for some reason I have to find out). Can you tell what it exactly does and what you mean with early stage? We put it right in between 7 and 9 in the above pipeline (without any of the 8* steps). According to the description of the picard tools, it replaces all read groups in the INPUT file with a new read group and assigns all reads to this read group in the OUTPUT. I can't really think of what that exactly means.

ADD REPLY

Login before adding your answer.

Traffic: 1089 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6