Question

SAM CIGAR parsing and position counter in R

1

Entering edit mode

6.5 years ago

niu.shengyong ▴ 70

I have a SAM file like this:

SRR400619.13    0       NC_000913.3     3440402 40      1S25M10S        *       0       0       NTGCCGTTGGTTTCCATTTCGATGACNNNNNNNNNN    %-:9986986666:5#####################    HI:i:1  NH:i:1  NM:i:0
SRR400619.14    0       NC_000913.3     3442282 20      26M10S  *       0       0       CACATTTGGCAACTTCGTCACGCAGCNNNNNNNNNN    BBCBBB@@BCACC@BBB8BB################    HI:i:1  NH:i:1  NM:i:1

I hope to build up a counter of each position. For example, in the first row, I have start position 3440402 and CIGAR string 1S25M10S. Hence, the counter will increase by 1 in the positions of 3440403 to 3440427. Is there any R package or script can do it?

PS: I hope to extract the alignment length and increase the corresponding position in my counter array. For example, if the start position is 100, and CIGAR is 5M1S, then the count would be [100] 1 [101] 1 [102]1[103]1[104]1[105]0

Simon

SAM CIGAR parsing R • 2.7k views

ADD COMMENT • link 6.5 years ago by niu.shengyong ▴ 70

2

Entering edit mode

Just to be sure you want to extract from cigar the alignment length (so 25 in the first row). And add it to the start position. Could you rephrase your question please ?

ADD REPLY • link 6.5 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Hi Nicolas, thanks for your advice and fast reply! I hope to extract the alignment length and increase the corresponding position in my counter array. For example, if the start position is 100, and CIGAR is 5M1S, then the count array would be [100] 1 [101] 1 [102]1[103]1[104]1[105]0

ADD REPLY • link 6.5 years ago by niu.shengyong ▴ 70

0

Entering edit mode

What about insertion, deletion and splice junction encoded within CIGAR ? If I understand well you want to report all position where one read aligned.

ADD REPLY • link 6.5 years ago by Nicolas Rosewick 11k

0

Entering edit mode

Yes, you are right. I want to report all position where one read aligned. I also need to consider all other situation in CIGAR string including insertion, deletion, splice, mismatch, etc. (like the operators list here http://bioinformatics.cvr.ac.uk/blog/tag/cigar-string/

ADD REPLY • link 6.5 years ago by niu.shengyong ▴ 70

0

Entering edit mode

What is the actual problem you are trying to solve? Are you attempting to get per-base read depth information? There's already R packages that will give you that.

ADD REPLY • link 6.5 years ago by d-cameron ★ 2.9k

0

Entering edit mode

I'm now building up a machine learning based tool. It takes the per-base read information as training dataset. The input is like the following array: [1] 1 [2] 1 [3] 0 .... etc. which means the first and second positions are mapped with one read, and there is no read mapped in the position 3.

Thanks!

ADD REPLY • link 6.5 years ago by niu.shengyong ▴ 70

0

Entering edit mode

Using something like Rsamtools::pileup will be much easier than having to parse read CIGAR strings yourself.

ADD REPLY • link 6.5 years ago by d-cameron ★ 2.9k