It's a representation of positional alignment pattern of one sequence against another: the match/mismatch [M], deletion [D] and insertion [I] patterns of an alignment, as well as upstream and downstream padding, an other possibilities. See a table here:
[?]
A few simple examples:
[?]
Reference AAAAAAAAAA
Query AAAAAAAAAA
CIGAR MMMMMMMMMM or 10M
Reference AAAAAAAAAAT
Query AAAAA-AAAAT
CIGAR MMMMMDMMMMM or 5M1D5M
Reference AAAAA-AAAAT
Query AAAAAAAAAAT
CIGAR MMMMMIMMMMM or 5M1I5M
[?]
Bit 0x4 is the only reliable place to tell whether the fragment is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10 and 0x100
and the bit 0x20 of the next fragment in the template.
In other words, the don't even look at the CIGAR unless bit 0x4 is not set. If I read your alignments above correctly (a little tough with a mixture of white space and tabs), both of the reads you report are unmapped so the CIGAR strings are irrelevant.
Spec, schmec. BWA apparently reports mapq for unmapped reads (not sure about cigar) and this breaks Picard, which apparently does make assumptions for MAPQ.
ADD REPLY
• link
updated 5.3 years ago by
Ram
44k
•
written 13.5 years ago by
Ketil
4.1k
0
Entering edit mode
Not for all mapped reads. bwa concatenates the reference multi-fasta before aligning, and if the case of a read that hangs off the edge of one reference contig onto another, well, it would be a little misleading for the aligner to say that it has no idea where that read goes, so it gives a map position, and a mapq, and sets the unmapped flag too, so you can see clearly that something isn't quite right. It's a feature, not a bug (meant mostly seriously)
That is bad output. e.g.: what does "0GBGTGM=GGMGMRYMD==" in the sequence column mean?
Plus that cigar string in nonsensical.
I would check that read in your sequence files and make sure it looks normal.
Then try rerunning, if you can reproduce this with just a few reads--including that one, I would contact the tophat developers.
I have checked the original reads, which has much shorter length than normal ones(18nt).
@DBV2SVN1_64_7_2105_9755_2866_0_2
ACCGCGCAACAGACATCA
+
ggggggggggfgggeggg
Could this be the reason that tophat output such strange output?
Are you sure this is output from tophat? That second SAM line is obviously broken; I guess it's what samtools produces from a mangled BAM file.
Here's a wild guess: you ran samtools to make a BAM file, but then your hard disk became full. So you deleted a few files and everything seemed fine. Unfortunately, samtools has no error handling to speak of, so your BAM file is now missing data in the middle, and decoding it produces very weird effects.
Looks to me like corrupted or truncated file. Are you sure it's correct? I.e. does it continue with more reads with sensible content?