I want to write a program that converts SAM files to genome coverage (so wiggle or bedgraph format). So, my question is related to prrocessing the output of the aligner. My program would work a little bit like the genomeCoverageBed
function in bedtools
genomeCoverageBed -bg -d -ibam reads.bam -g genome.csv
However, I wouldn't have to do the extra step of translating from SAM to BAM.
Now, it's reasonably straightforward to scan through a SAM file and pick out the strand and location of a tag. The length of the read can be inferred. Of course, one will often know the length of the reads anyway.
My question is how to handle the hard/soft clipping in terms of the length of the tag. Presumably, taking the clipping into account would mean dropping a few bases at the start or the end, thus having a shorter read. This would affect the tag count totals in the output to my function.
In DNA-seq, it seems like it doesn't make much sense to take clipping into account, because the location of the read is what mattered. Any feedback would be appreciated.
I think of read clipping as something that is done by the aligner. Perhaps you are talking about read trimming (prior to alignment)? Could you clarify?