Entering edit mode
6.5 years ago
niu.shengyong
▴
70
I have a SAM file like this:
SRR400619.13 0 NC_000913.3 3440402 40 1S25M10S * 0 0 NTGCCGTTGGTTTCCATTTCGATGACNNNNNNNNNN %-:9986986666:5##################### HI:i:1 NH:i:1 NM:i:0
SRR400619.14 0 NC_000913.3 3442282 20 26M10S * 0 0 CACATTTGGCAACTTCGTCACGCAGCNNNNNNNNNN BBCBBB@@BCACC@BBB8BB################ HI:i:1 NH:i:1 NM:i:1
I hope to build up a counter of each position. For example, in the first row, I have start position 3440402 and CIGAR string 1S25M10S. Hence, the counter will increase by 1 in the positions of 3440403 to 3440427. Is there any R package or script can do it?
PS: I hope to extract the alignment length and increase the corresponding position in my counter array. For example, if the start position is 100, and CIGAR is 5M1S, then the count would be [100] 1 [101] 1 [102]1[103]1[104]1[105]0
Simon
Just to be sure you want to extract from cigar the alignment length (so 25 in the first row). And add it to the start position. Could you rephrase your question please ?
Hi Nicolas, thanks for your advice and fast reply! I hope to extract the alignment length and increase the corresponding position in my counter array. For example, if the start position is 100, and CIGAR is 5M1S, then the count array would be [100] 1 [101] 1 [102]1[103]1[104]1[105]0
What about insertion, deletion and splice junction encoded within CIGAR ? If I understand well you want to report all position where one read aligned.
Yes, you are right. I want to report all position where one read aligned. I also need to consider all other situation in CIGAR string including insertion, deletion, splice, mismatch, etc. (like the operators list here http://bioinformatics.cvr.ac.uk/blog/tag/cigar-string/
What is the actual problem you are trying to solve? Are you attempting to get per-base read depth information? There's already R packages that will give you that.
I'm now building up a machine learning based tool. It takes the per-base read information as training dataset. The input is like the following array: [1] 1 [2] 1 [3] 0 .... etc. which means the first and second positions are mapped with one read, and there is no read mapped in the position 3.
Thanks!
Using something like Rsamtools::pileup will be much easier than having to parse read CIGAR strings yourself.