I have been working on a script for identifying frequent 'end points' of paired end reads from a SAM file on a genome for determining uneven cut frequencies, which I need to integrate the orientation and 1st/2nd read information into. The 2nd column of a SAM file gives the detail on the orientation with a number (as described here and here).
I hadn't used a bioperl import method because all I was doing was pulling the genomic start position of the read and building a hash by frequency, but I need to now tune it for directionality and mate pair.
My question is whether there is a better way to do this than:
## Numbers are random examples, not correctly done at the moment.
if ($line[1] == 99){
$coordinate = $start
}elsif($line[1] == 147){
$coordinate = $start - $readlength
}elsif($line[1] == 83){
$coordinate = $start + $readlength
}.........etc
Given that a picture paints a thousand words, basically what I want is to pull out the coordinates of the red circles from my SAM file. It's part of a bigger perl script, so I'd prefer to avoid other tools, but if they exist I should look at them.
what is -/+36 the read length ? it 's not as simple for the negative-strand read because you'll have to walk to the cigar string (indel and insertion)...
Sorry, those numbers would vary based upon the read length. I jut know that in my data here, the reads are all trimmed to 36bp.
From my bowtie mapping, the forward and reverse reads are independent, not pair-matched, so I think they can be independently looked at without needing to worry about indels, right? I considered mapping them together but as I'm interested in the end points, it's not essential that everything has a pair and I wanted to avoid throwing any more out.
"From my bowtie mapping, the forward and reverse reads are independent, not pair-matched, so I think they can be independently looked at without needing to worry about indels, right? " : no if there is an indel in the read (cigar string contains 'I' or 'D' ) or even clipping you can't just add the values.
Ah, ok I see. I was only thinking of indels between the paired reads. I can include that in the calculation, but I'm starting to think that the bioperl SAM module will be better suited for this.