Question

CIGAR and query sequence are of different length when trying to convert from sam to bam?

0

Entering edit mode

8.6 years ago

joreamayarom ▴ 140

I am trying to convert a sam file to bam using the line of code described here, but samtools is throwing the following error at me:

samtools view -bT reference.fasta sequences.sam > sequences.bam

[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] parse error at line 102
[main_samview] truncated file.

The offending line looks like this:

SRR808297.2571281       99      gi|309056|gb|L20934.1|MSQMTCG   747     80      101M    =       790     142     TTGGTATAAAATTTAATAATCCCTTATTAATTAATAAACTTCGGCTTCCTATTCGTTCATAAGAAATATTAGCTAAACAAAATAAACCAGAAGAACAT      @@CFDDFD?HFDHIGEGGIEEJIIJJIIJIGIDGIGDCHJJCHIGIJIJIIJJGIGHIGICHIICGAHDGEGGGGACGHHGEEEFDC@=?CACC>CCC      NM:i:2  MD:Z:98A1A

I have seen this error reported elsewhere, it seems to arise from a bug in samtools and solutions don't seem to be universal. In my case, I am not sure which is my source of error, but I have a likely culprit: This sequence seems to be two bases shorter than the rest of the sequences with the same CIGAR. Could this be the error? Is that is so, how can I solve it? Bear in mind that my file contains paired end reads. Furthermore, I have several more files producing the same error and have to automatize a solution in a fast an efficient script.

samtools sam bam • 11k views

ADD COMMENT • link 8.6 years ago by joreamayarom ▴ 140

1

Entering edit mode

the tool/software that produced your sam file "sequences.sam" is bugged.

ADD REPLY • link 8.6 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

The read says '101M' which is cigar-slang for 101bp can be mapped to the genome. The read's DNA however is only 98bp long. This means the read is telling samtools two different things, so it raised an error. I feel bad that you look at all the other reads to figure out why this one was erroneous :(

ADD REPLY • link 8.6 years ago by John 13k

score 1 · Answer 1 · 2016-09-24

1

Entering edit mode

8.6 years ago

harold.smith.tarheel ★ 5.0k

Picard (available here) command ValidateSamFile can be used to identify problematic SAM files. A discussion of possible errors and repairs can be found here.

ADD COMMENT • link 8.6 years ago by harold.smith.tarheel ★ 5.0k

1

Entering edit mode

ValidateSamFile detects the erros, but there is little info in your link on how to solve this particular issue. John is right, the Cigar string is of different length than some of my reads. My guess now is that I will have to parse the file using Perl or Python and eliminate the lines in which the Cigar doesn't match the length of the sequence...

ADD REPLY • link 8.6 years ago by joreamayarom ▴ 140

2

Entering edit mode

You could certainly go down that path, but i'd recommend you find the culprit that caused the error in the first place, because the others might have the CIGAR/read length correct by chance, but be wrong for other reasons.

FYI this is why bioinformatics sometimes sucks - but dont worry when you've figured it out you'll be better for it :)

ADD REPLY • link 8.6 years ago by John 13k

1

Entering edit mode

Fixing the problem depends upon the types of errors observed and why they occurred. For example, if the only reported issue is short CIGAR strings due to alignment off the end of the reference, then you can use Picard's CleanSam to fix. If there is a variety of error types but the number of affected reads is small, you may decide to remove those reads (ValidateSamFile MODE=VERBOSE will return the error-containing read IDs, which you can use to filter).

Personally, I would try to figure out why the SAM file is corrupted and/or repeat the alignment (if you inherited/downloaded the SAM file, convert to FASTQ using Picard SamToFastq then map with your favorite aligner).

ADD REPLY • link 8.6 years ago by harold.smith.tarheel ★ 5.0k