Hi,
I have tried a "home-made" method in order to extract the number of nucleotides per read out of a SAM file but is not working correctly for all the reads due to some deletions (I guess).
grep -e "pattern" my.sam | cut -f 2,3,4,5 > output.txt
To explain, I search for reads with a pattern and then extract some information from every read such as Chromosome, Position, Strand, and Sequence. Then I use R software to count the number of characters of the "Sequence" column and I get the number of nucleotides per read. However, the sequence sometimes might contain a deletion "-" which counts as a character and I get some misplaced reads. I don't want to get rid of these reads. My alignment parameters allow only 1 mismatch so I expect to get a single deletion or addition if that matters.
Is there any way to use the cigar strings of each read and extract the correct number of aligned nucleotides per read?
Thanks for your time, Ioannis
that's not a SAM file. The specification doesn't allow a hyphen. https://samtools.github.io/hts-specs/SAMv1.pdf
I see, but maybe I didn't make it clear. In the Sequence column I do not get any "-" or any other symbols but I get some aligned reads with chromosome and position that do not much the genomic sequence by one nucleotide.
Example: Genomic sequence: AGTCTAGTACCC Aligned sequence: AGCTAGTACCC
In that case there is a single deletion and I wonder if within the SAM file this information can be found and somehow extracted.