Question

How To Filter Out Reads With A Bad Cigar M Operator

1

Entering edit mode

11.4 years ago

munch ▴ 310

My aim was to run freebayes on my .bam files generated with bwa, but I got the error:

Error: cannot construct subsequence with negative offset or length < 1(attempting start = 66 and length = -8)

So I think there is something wrong with my .bam file. The output of ValidateSamFile.jar from picard is like this:

ERROR: Read groups is empty
ERROR: Record 42921, Read name SRR033861.69041, CIGAR M operator maps off end of reference
ERROR: Record 42922, Read name SRR033861.809038, CIGAR M operator maps off end of reference
ERROR: Record 42923, Read name SRR033861.607221, CIGAR M operator maps off end of reference
ERROR: Record 42949, Read name SRR033861.481003, CIGAR M operator maps off end of reference
...

I have tried to filter this reads out with samtools view -f 0x200 myfile.bam to filter out reads not passing quality controls, but there is no read with this flag.

Is there any tool to filter out this violation of the .bam file? Or you know the reason for the Error message?

Thank you for your efforts in advance

Phil

samtools cigar bam picard • 5.0k views

ADD COMMENT • link updated 11.4 years ago by KCC ★ 4.1k • written 11.4 years ago by munch ▴ 310

0

Entering edit mode

Can you please share. how did you fix this ?

ADD REPLY • link 10.1 years ago by Chirag Nepal ★ 2.4k

score 1 · Answer 1 · 2013-07-10

Looks like the read is on the end of the chromosome, so start position plus read length is longer than the reference. Perhaps you can just write a script that looks to see what the start position of the read is and drops it if it's longer than the length of the chromosome. To be fully correct, I could parse the CIGAR expression for the exact length of your read in the reference, but I'm going to ignore that for now.

    #syntax: python thisprogram.py infile.sam <read length> outfile.sam

    import sys

    #chromosome ends
    chromosomes = {'chr1': 2000000, 'chr2': 5000000}

    input = open(sys.argv[1])

    readlength = int(sys.argv[2])

    output = open(sys.argv[3],'w')

    for line in input:
            if (not line.startswith('@')):
                    data = line.strip().split("\t")

                    chr   = data[2]
                    start = int(data[3])

                    end = start + readlength

                    if(end >= chromosomes[chr]):
                            continue

                    output.write(line)
            else:
                    output.write(line)

This code is untested. It should work however. You could add the code to parse the CIGAR string with a little more work if that's needed.