Question

What Does This Cigar Mean?

2

Entering edit mode

13.4 years ago

Gangcai ▴ 230

Dear All, Following are part of the output of tophat(v1.3.0):

DBV2SVN1_64_7_2105_14622_191104_0_2     256     I       4589    3       50M     *       0       0   AGTGTTAGTGTTGTTGTCTGTTGCTTGGTCAACATCATCTTGTCGTTTTT       HFHFHFHFHFHHHHHFHHFHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHE   NM:i:0  NH:i:2  CC:Z:II CP:i:4527909    HI:i:0
DBV2SVN1_64_7_2105_9755_2866_0_2        272     I       4600    1       3421396b        *       0   0GBGTGM=GGMGMRYMD==      dqj�Oc!ijd!�9I��i�

Does anybody known the meaning of CIGAR "3421396b"? Thanks.

samtools tophat cigar • 5.8k views

ADD COMMENT • link updated 13.4 years ago by Marvin ▴ 900 • written 13.4 years ago by Gangcai ▴ 230

0

Entering edit mode

Looks to me like corrupted or truncated file. Are you sure it's correct? I.e. does it continue with more reads with sensible content?

ADD REPLY • link 13.4 years ago by Ketil 4.1k

score 5 · Answer 1 · 2011-06-16

5

Entering edit mode

13.4 years ago

2184687-1231-83- ★ 5.1k

It's a representation of positional alignment pattern of one sequence against another: the match/mismatch [M], deletion [D] and insertion [I] patterns of an alignment, as well as upstream and downstream padding, an other possibilities. See a table here:

[?]

A few simple examples: [?] Reference AAAAAAAAAA Query AAAAAAAAAA CIGAR MMMMMMMMMM or 10M

Reference AAAAAAAAAAT Query AAAAA-AAAAT CIGAR MMMMMDMMMMM or 5M1D5M

Reference AAAAA-AAAAT Query AAAAAAAAAAT CIGAR MMMMMIMMMMM or 5M1I5M [?]

(not sure what you G and E are in the example)

ADD COMMENT • link 13.3 years ago by 2184687-1231-83- ★ 5.1k

1

Entering edit mode

As far as I know your cigar operations should have a 1 in front of the D and I. It makes parsing much easier.

ADD REPLY • link 13.4 years ago by Gww ★ 2.7k

0

Entering edit mode

...and it's what samtools does, which also makes it a good idea. And the spec says: *|([0-9]+[MIDNSHPX=])+, so the number appears to be mandatory.

ADD REPLY • link 13.4 years ago by Ketil 4.1k

Ram · Answer 2 · 2011-06-16

3

Entering edit mode

13.4 years ago

Sean Davis 27k

From the SAM v1.4 spec:

Bit 0x4 is the only reliable place to tell whether the fragment is unmapped. If 0x4 is set, no assumptions can be made about RNAME, POS, CIGAR, MAPQ, bits 0x2, 0x10 and 0x100 and the bit 0x20 of the next fragment in the template.

In other words, the don't even look at the CIGAR unless bit 0x4 is not set. If I read your alignments above correctly (a little tough with a mixture of white space and tabs), both of the reads you report are unmapped so the CIGAR strings are irrelevant.

ADD COMMENT • link 13.4 years ago by Sean Davis 27k

0

Entering edit mode

Thanks for fixing the output in the original question. Brentp is right that there is a problem with that SAM line.

ADD REPLY • link 13.4 years ago by Sean Davis 27k

0

Entering edit mode

yeah, I wrapped the output in code tags so it would render more nicely.

ADD REPLY • link 13.4 years ago by brentp 24k

0

Entering edit mode

Thanks again for making the original more easy to read.

ADD REPLY • link 13.4 years ago by Gangcai ▴ 230

0

Entering edit mode

Spec, schmec. BWA apparently reports mapq for unmapped reads (not sure about cigar) and this breaks Picard, which apparently does make assumptions for MAPQ.

http://seqanswers.com/forums/archive/index.php/t-4246.html

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 13.4 years ago by Ketil 4.1k

0

Entering edit mode

Not for all mapped reads. bwa concatenates the reference multi-fasta before aligning, and if the case of a read that hangs off the edge of one reference contig onto another, well, it would be a little misleading for the aligner to say that it has no idea where that read goes, so it gives a map position, and a mapq, and sets the unmapped flag too, so you can see clearly that something isn't quite right. It's a feature, not a bug (meant mostly seriously)

ADD REPLY • link 13.3 years ago by Swbarnes2 ★ 1.6k

score 2 · Answer 3 · 2011-06-16

2

Entering edit mode

13.4 years ago

brentp 24k

That is bad output. e.g.: what does "0GBGTGM=GGMGMRYMD==" in the sequence column mean?

Plus that cigar string in nonsensical.

I would check that read in your sequence files and make sure it looks normal. Then try rerunning, if you can reproduce this with just a few reads--including that one, I would contact the tophat developers.

ADD COMMENT • link 13.4 years ago by brentp 24k

0

Entering edit mode

I have checked the original reads, which has much shorter length than normal ones(18nt). @DBV2SVN1_64_7_2105_9755_2866_0_2 ACCGCGCAACAGACATCA + ggggggggggfgggeggg

Could this be the reason that tophat output such strange output?

ADD REPLY • link 13.4 years ago by Gangcai ▴ 230

0

Entering edit mode

probably, doesn't tophat try to split at 25bp?

ADD REPLY • link 13.4 years ago by brentp 24k

0

Entering edit mode

If tophat produces that, regardless of input, I'd say it's a bug in tophat.

ADD REPLY • link 13.4 years ago by Ketil 4.1k

score 0 · Answer 4 · 2011-08-04

Are you sure this is output from tophat? That second SAM line is obviously broken; I guess it's what samtools produces from a mangled BAM file.

Here's a wild guess: you ran samtools to make a BAM file, but then your hard disk became full. So you deleted a few files and everything seemed fine. Unfortunately, samtools has no error handling to speak of, so your BAM file is now missing data in the middle, and decoding it produces very weird effects.