Question

SAM file: same reads ID, slightly different sequence

1

Entering edit mode

8.0 years ago

davide.chiarugi ▴ 20

After aligning with STAR data coming from RNAseq to a bacterial genome, I notice something (for me) unexpected in the SAM output file. As it can be seen from the lines below, some reads having the same ID (first column) have alightly different sequences (10th column) . Why so ?

HWI-ST1299:109:C3ELBACXX:7:1104:21137:23785 16  gi|15674250|ref|NC_002737.1|    1578076 0   45M *   0   0   CCCCAACTACATCAGGCGTTCTAGGGCTTAACTGCTGTGTTCGGC   IEGIGHEGFGGBJJJJJJJIIJJJIHGHEHIHCJJJJIHFHHHFD   NH:i:6  HI:i:1  AS:i:44 nM:i:0
HWI-ST1299:109:C3ELBACXX:7:1104:21137:23785 256 gi|15674250|ref|NC_002737.1|    269351  0   45M *   0   0   GCCGAACACAGCAGTTAAGCCCTAGAACGCCTGATGTAGTTGGGG   DFHHHFHIJJJJCHIHEHGHIJJJIIJJJJJJJBGGFGEHGIGEI   NH:i:6  HI:i:2  AS:i:44 nM:i:0
HWI-ST1299:109:C3ELBACXX:7:1104:21137:23785 272 gi|15674250|ref|NC_002737.1|    1331147 0   45M *   0   0   CCCCAACTACATCAGGCGTTCTAGGGCTTAACTGCTGTGTTCGGC   IEGIGHEGFGGBJJJJJJJIIJJJIHGHEHIHCJJJJIHFHHHFD   NH:i:6  HI:i:3  AS:i:44 nM:i:0
HWI-ST1299:109:C3ELBACXX:7:1104:21137:23785 256 gi|15674250|ref|NC_002737.1|    22063   0   45M *   0   0   GCCGAACACAGCAGTTAAGCCCTAGAACGCCTGATGTAGTTGGGG   DFHHHFHIJJJJCHIHEHGHIJJJIIJJJJJJJBGGFGEHGIGEI   NH:i:6  HI:i:4  AS:i:44 nM:i:0
HWI-ST1299:109:C3ELBACXX:7:1104:21137:23785 256 gi|15674250|ref|NC_002737.1|    28064   0   45M *   0   0   GCCGAACACAGCAGTTAAGCCCTAGAACGCCTGATGTAGTTGGGG   DFHHHFHIJJJJCHIHEHGHIJJJIIJJJJJJJBGGFGEHGIGEI   NH:i:6  HI:i:5  AS:i:44 nM:i:0
HWI-ST1299:109:C3ELBACXX:7:1104:21137:23785 256 gi|15674250|ref|NC_002737.1|    84283   0   45M *   0   0   GCCGAACACAGCAGTTAAGCCCTAGAACGCCTGATGTAGTTGGGG   DFHHHFHIJJJJCHIHEHGHIJJJIIJJJJJJJBGGFGEHGIGEI   NH:i:6  HI:i:6  AS:i:44 nM:i:0

RNA-Seq alignment SAM format • 2.8k views

ADD COMMENT • link 8.0 years ago by davide.chiarugi ▴ 20

0

Entering edit mode

Thanks for your answer Pierre. My problem is that, if I am not wrong, the 10th column should report the sequence of the read and, thus, I'm expecting that given the same ID in the first column I should have the same read in the 10th column. This case is confusing me, because, apparently, all the reads, even though multi-mapped (as indicated by the SAM flags) look like to map equally well (MAPQ =0, AS=44 for all of them, nM=0 for all of them). In such a case, according to the STAR developers, the primary alignment flag is assigned randomly. Thus, I'm wondering how could it be that reads with (slightly) different sequences:

1) have the same ID 2) map equally well

ADD REPLY • link 8.0 years ago by davide.chiarugi ▴ 20

0

Entering edit mode

I should have the same read in the 10th column.

wrong: the sequence has been reverse-complemented because the flag says it's on the reverse strand.

map equally well

yes, and a random read is chosen as the 'primary' alignment.

ADD REPLY • link 8.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks you indeed for your prompt response. Probably I am misunderstanding something: what does the 10th column exactly report ? The sequence of the "sequenced read" or the sequence to which the read has been aligned ?

ADD REPLY • link 8.0 years ago by davide.chiarugi ▴ 20

3

Entering edit mode

It's the bases in the sequenced read, presented as they would appear on the reference's positive strand. Thus if two reads have been mapped to the same position, they will look the same (modulo sequencing errors etc) in the SAM file, regardless of which strands actually went into the sequencing machine.

[Equivalently to Pierre's reply, but differently phrased.]

ADD REPLY • link 8.0 years ago by John Marshall 3.1k

0

Entering edit mode

The sequence of the "sequenced read" or the sequence to which the read has been aligned ?

its 'the sequence of the read, rev-compl is the read was mapped on the reverse strand, truncated if the read is hard-clipped (see CIGAR string).

ADD REPLY • link 8.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thank you indeed for the clarifications !

ADD REPLY • link 8.0 years ago by davide.chiarugi ▴ 20

score 2 · Answer 1 · 2016-12-07

2

Entering edit mode

8.0 years ago

Pierre Lindenbaum 164k

see http://broadinstitute.github.io/picard/explain-flags.html

reads with flag = 256 are secondary alignments, 272= secondary alignments+ reverse strand.

ADD COMMENT • link 8.0 years ago by Pierre Lindenbaum 164k