I am trying to convert a sam file to bam using the line of code described here, but samtools is throwing the following error at me:
samtools view -bT reference.fasta sequences.sam > sequences.bam
[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] parse error at line 102
[main_samview] truncated file.
The offending line looks like this:
SRR808297.2571281 99 gi|309056|gb|L20934.1|MSQMTCG 747 80 101M = 790 142 TTGGTATAAAATTTAATAATCCCTTATTAATTAATAAACTTCGGCTTCCTATTCGTTCATAAGAAATATTAGCTAAACAAAATAAACCAGAAGAACAT @@CFDDFD?HFDHIGEGGIEEJIIJJIIJIGIDGIGDCHJJCHIGIJIJIIJJGIGHIGICHIICGAHDGEGGGGACGHHGEEEFDC@=?CACC>CCC NM:i:2 MD:Z:98A1A
I have seen this error reported elsewhere, it seems to arise from a bug in samtools and solutions don't seem to be universal. In my case, I am not sure which is my source of error, but I have a likely culprit: This sequence seems to be two bases shorter than the rest of the sequences with the same CIGAR. Could this be the error? Is that is so, how can I solve it? Bear in mind that my file contains paired end reads. Furthermore, I have several more files producing the same error and have to automatize a solution in a fast an efficient script.
the tool/software that produced your sam file "sequences.sam" is bugged.
The read says '101M' which is cigar-slang for 101bp can be mapped to the genome. The read's DNA however is only 98bp long. This means the read is telling samtools two different things, so it raised an error. I feel bad that you look at all the other reads to figure out why this one was erroneous :(