Question

Finding the number of aligned nucleotides per read in a sam file

0

Entering edit mode

6.7 years ago

ioannis ▴ 50

Hi,

I have tried a "home-made" method in order to extract the number of nucleotides per read out of a SAM file but is not working correctly for all the reads due to some deletions (I guess).

grep -e "pattern" my.sam | cut -f 2,3,4,5 > output.txt

To explain, I search for reads with a pattern and then extract some information from every read such as Chromosome, Position, Strand, and Sequence. Then I use R software to count the number of characters of the "Sequence" column and I get the number of nucleotides per read. However, the sequence sometimes might contain a deletion "-" which counts as a character and I get some misplaced reads. I don't want to get rid of these reads. My alignment parameters allow only 1 mismatch so I expect to get a single deletion or addition if that matters.

Is there any way to use the cigar strings of each read and extract the correct number of aligned nucleotides per read?

Thanks for your time, Ioannis

SAM CIGAR stings • 2.2k views

ADD COMMENT • link updated 6.3 years ago by jilguero888 ▴ 20 • written 6.7 years ago by ioannis ▴ 50

0

Entering edit mode

However, the sequence sometimes might contain a deletion "-" which counts as a character a

that's not a SAM file. The specification doesn't allow a hyphen. https://samtools.github.io/hts-specs/SAMv1.pdf

String *|[A-Za-z=.]+ segment SEQuence

ADD REPLY • link 6.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I see, but maybe I didn't make it clear. In the Sequence column I do not get any "-" or any other symbols but I get some aligned reads with chromosome and position that do not much the genomic sequence by one nucleotide.

Example: Genomic sequence: AGTCTAGTACCC Aligned sequence: AGCTAGTACCC

In that case there is a single deletion and I wonder if within the SAM file this information can be found and somehow extracted.

ADD REPLY • link 6.7 years ago by ioannis ▴ 50

0

Entering edit mode

6.3 years ago

jilguero888 ▴ 20

Hi, you can use the raw output from samtools mpileup (without options), and on the 5th column you have the aligned nucleotides at every position.

Get all the nucleotides from a bam file that is aligning to specific position in the genome

ADD COMMENT • link 6.3 years ago by jilguero888 ▴ 20

1

Entering edit mode

the question was "number of aligned bases per reads" not "number of reads per position"

ADD REPLY • link 6.3 years ago by Pierre Lindenbaum 166k

score 3 · Accepted Answer · 2018-10-16

3

Entering edit mode

6.7 years ago

Pierre Lindenbaum 166k

using bioalcidaejdk http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html and looping over the cigar string:

java -jar dist/bioalcidaejdk.jar -e 'stream().filter(R->!R.getReadUnmappedFlag() && R.getCigar()!=null).forEach(R->println(R.getReadName()+"\t"+R.getContig()+"\t"+R.getStart()+"\t"+R.getCigar().getCigarElements().stream().filter(CE->CE.getOperator().isAlignment()).mapToInt(CE->CE.getLength()).sum()));' input.bam

ADD COMMENT • link 6.7 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

Sorry for taking some time to answer but I am trying to understand how everything works without knowing java basics. The script you shared is working perfectly.

Thanks Pierre

ADD REPLY • link 6.6 years ago by ioannis ▴ 50