Hi
I am trying to run a read pair GATK walker on a bam file (e.g. CountPairs). I thought it was an easy thing but I'm trapped with issues related with bam file sorting and indexing.
If I understood well the pipeline should be:
- sorting the bam file using picard SortSam tool with "queryname" as a sort order,
- indexing the resulting bam file with samtools index
- running GATK walker on resulting bam file
But the step 2. is failing with the following error:
[bam_index_core] the alignment is not sorted (SRR003480.10000060): 23501638 > 19966740 in 22-th chr
[bam_index_build2] fail to index the BAM file.
I tried the pipeline using the sort order coordinate. In this case samtools works well but GATK complains it can only process "queryname" ordered file.
Both "pipelines" of commands come after my signature, if you want to have a look.
Apart from the toubleshooting I would really appreciate some explanation on the meaning of these sort orders. Thanks in advance for any help!
BTW, I know there is a similar post here, I tried to use AddOrReplaceReadGroups instead of SamSort but it didn't help.
Regards, Pascal
Using SO=queryname
$ java -Xmx1024m -jar ~/tools/picard/SortSam.jar I=NA11992.chrom22.ILLUMINA.bwa.CEU.exon_targetted.20100311.bam O=NA11992_sorted_coordinate.bam SO=queryname
$ samtools index NA11992_sorted_queryname.bam
[bam_index_core] the alignment is not sorted (SRR003480.10000060): 23501638 > 19966740 in 22-th chr
[bam_index_build2] fail to index the BAM file.
Using SO=coordinate
$ java -Xmx1024m -jar ~/tools/picard/SortSam.jar I=NA11992.chrom22.ILLUMINA.bwa.CEU.exon_targetted.20100311.bam O=NA11992_sorted_coordinate.bam SO=coordinate
$ samtools index NA11992_sorted_coordinate.bam
$ java -Xmx2g -jar ../../gatk/GenomeAnalysisTK-1.2-60-g585a45b/GenomeAnalysisTK.jar -R ../references/human_g1k_v37.fasta -T CountPairs -o output.txt -I NA11992_sorted_coordinate.bam
[...]
##### ERROR MESSAGE: Missorted Input SAM/BAM files: files are not sorted in queryname order; Read pair walkers can only walk over query name-sorted data. Please resort your input BAM file.
Something that is not very clear reading GATK FAQ: it clearly states that the bam file "must be sorted in coordinate order (not by queryname and not unsorted)". But when I run GATK against such a file it complains: "Missorted Input SAM/BAM files: files are not sorted in queryname order" ?! This is not consistent, isn't it?