Hi guys !
From a bam alignment file, I want to compute the ratio between the number of reads terminating and overlapping at all genomic positions (called psi-ratio in this figure).
For the number of reads overlapping, it's easy (its basically the coverage, given by samtools depth
for instance), but for the reads terminating I'm a bit lost. I guess this has to do with the POS field of the sam/bam alignment (1-based leftmost mapping POSition) but I can't go much further than that... Any help or resource will be appreciated !
Subsidiary question : what if the input data is paired-end and I want the ratio between the FRAGMENTS terminating and overlapping ?
For those interested in why I want to do that, the idea is very similar to the recently described psi-seq : I want to detect positions on RNAs that blocks the reaction of reverse transcription during the library preparation.
Thanks a lot !
Thanks for the advice, I didn't know about this tool. It looks quite straightforward. I'll try it asap !
Hi again !
I tried your method, it looks like it works great but I have a few questions.
1- What is the name of the file containing the code for the methods like
getAlignmentEnd()
?2- What does
getAlignmentEnd()
do exactly ? Does it return the positions of the last base of a read ? or both the last and first one ?3- What do you mean exactly by "unclipped alignments" ? I ran your code on some paired end data (.bam) with both
getUnclippedEnd
andgetAlignmentEnd
and the output is exactly the same.4- I usually use samtools depth to compute the coverage, but it does some filtering on the reads. For consistency (my end goal is to do the ratio), is it possible to compute the read coverage with your method ?
Thanks a lot for your help already. And even if it's a bit hard to get in, your tool looks really great !
PS : as u probably guessed, I don't know java.
https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/SAMRecord.java
https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/SAMRecord.java#L590
get the alignment start and scan the CIGAR string of the SAM until it reaches the end of the read on the REF genome.
'Last' is base in 3' on the REF
What is difference between soft-clipped and hard-clipped in SAM specification? ?
so, you don't have clip :-)
4- I usually use samtools depth to compute the coverage, but it does some filtering on the reads. For consistency (my end goal is to do the ratio), is it possible to compute the read coverage with your method ?
yes but it's out of the scope of your original question; You could use bioalcidae to generate a BED graph of your data and then use bedtools genomecov ?
I'm honestly impressed. thanks =) One last question : "Last is base in 3' on the REF" ... So with a read pair I get this, right ? (the red lines would be the positions returned by getAlignmentEnd)
If this is correct, then it'll be a little bit more complicated to get exactly what I want, but I think I can figure it out with all the details you provided !
NO: for the read on the right, alignment end is always on the genomic REF 3', so alignmentEnd is always on the right.