I ran several bam files through a pipeline with CleanSam, SortSam, and MarkDuplicates without a problem.
However, one of the input files gave me the following error with CleanSam:
ERROR: Record 2106053, Read name A00187:414:HMYCYDSXY:3:1426:13367:11083, Alignment start (21157039) must be <= reference sequence length (21154825) on reference 7
Because all of the bam files were generated from libraries from the same dataset using the same pipeline and aligned/mapped to the same reference genome, I'm having difficulty knowing where to begin to trouble shoot this error. The Picard script that I used is:
"java -Xmx" . $mem . "g -Djava.io.tmpdir=`pwd`/tmp -jar " . $picard . "CleanSam.jar INPUT=" . $BFile[$i] . ".bam OUTPUT= " . $BFile[$i] . "clean.bam";
Where Bfile is just the prefix from a glob list of input bam file names S1.bam....S8.bam
Any suggestions on where to start? Since I'm using the same reference genome for this as for the alignment I don't understand how it's possible to get coordinates outside the range of the reference genome length.
try to use
VALIDATION_STRINGENCY=LENIENT
Could you please explain why I'm getting this error message to begin with?
Additionally, I assume that I will have to use this for every strage of the piplein, i.e. CleanSam, SortSam, MarkDuplicates, etc?
for example, if you have one read mapped at the end of the chr1 but this read contains some clipped bases then its unclipped 3' end will be greater than the size of the chromosome 1 .
the best is to look at the read
A00187:414:HMYCYDSXY:3:1426:13367:11083
....yes , but...
use
samtools sort
instead of SortSamuse
sambamba markdup
instead of MarkDuplicatesRunning the bam files through the pipeline with Validation_Strategy=Lenient does generate the desired cleaned/sorted files. However, when I attempt to run the resulting bam file through GATK, I get the error:
If as you suggest the reads were "overhanging" the reference sequence, would I expect to see this error due to mismatches between the mapped read coordinates and the reference sequence, and if so, are there any arguments I can pass to GATK to correct this error?
Note: this error is not due to using inconsistent reference genomes. I used the same Drosophila melanogaster reference genome for this alignment as for all other libraries. The only difference is that I had to use the lenient validation strategy during the sort/clean/markduplicates phase of the pipeline for this particular library.
the reference used to map the reads is not the same that the one you're using for gatk.
I'm using the same Drosophila reference genome for mapping as for GATK, which is why I'm not sure why I'm getting this error message from the latter.
Could you please clarify what you're stating as the phrasing is ambiguous: do you mean to suggest that I'm getting this error message because I'm failing to use the same reference genome for both mapping and GATK, or because I'm using the same reference genome but shouldn't be? Presumably you mean the first, but as I said, I'm using the same reference genome throughout the pipeline.
That is why I think that this error is somehow a consequence of using the validation strategy = lenient condition in sort/clean bam, as that is the only thing that has changed in the pipeline.