Question

Sam file arrangement

0

Entering edit mode

3.3 years ago

aenna_p • 0

Hello,

I have a question regarding the length information of reads obtained from BAM files. I have converted BAM files into BED files and kept the read sequence. So, it looks something like this:

Chr 6791    7891    TCGAATATCAGGGTGCCCTCTGGCAAGGGCTTGCCCAGCGTACGTCAC    -
Chr 6966    7304    ATTGATGAGGGATGTGGGTGGATGGATGATGATGGAAATATGATATGC    +

I always assumed that columns 2 and 3 provide information on the start and end positions of the read alignment. So, column3 - column2 is the read length. However, if I calculate the number of characters in the DNA string (column 4) with function nchar() in R, I get a different value.
Can anyone explain what I am missing?

Thank you!

BAM BED • 955 views

ADD COMMENT • link updated 3.3 years ago by ATpoint 85k • written 3.3 years ago by aenna_p • 0

score 1 · Answer 1 · 2021-08-01

1

Entering edit mode

3.3 years ago

ATpoint 85k

Alignment length != read length. Reads might got soft-clipped, and parts of the read might align elsewhere, depending on how te aligner handles clipping and non-primary alignments.

ADD COMMENT • link 3.3 years ago by ATpoint 85k

0

Entering edit mode

Thank you! I do understand why read length may be larger than alignment length. But I still do not understand how sometimes the alignment length can be larger than the read length. Can you explain this further?

ADD REPLY • link 3.3 years ago by aenna_p • 0

2

Entering edit mode

Alignment:  GATCGATCACTGACGTATCTAGGCGATCAGTCGTACGTATCACTA
Read:       GATCGATCACTGACGTATCTA  CGATCAGTCGTACGTATCACTA

Here a simple example of a deletion in the read compared to the reference that makes the alignment two bp larger than the read length, as start and end of the alignment define the coordinates.