Read alignments terminating at genomic positions.
1
1
Entering edit mode
8.6 years ago

Hi guys !

From a bam alignment file, I want to compute the ratio between the number of reads terminating and overlapping at all genomic positions (called psi-ratio in this figure).

image description

For the number of reads overlapping, it's easy (its basically the coverage, given by samtools depth for instance), but for the reads terminating I'm a bit lost. I guess this has to do with the POS field of the sam/bam alignment (1-based leftmost mapping POSition) but I can't go much further than that... Any help or resource will be appreciated !

Subsidiary question : what if the input data is paired-end and I want the ratio between the FRAGMENTS terminating and overlapping ?

For those interested in why I want to do that, the idea is very similar to the recently described psi-seq : I want to detect positions on RNAs that blocks the reaction of reverse transcription during the library preparation.

Thanks a lot !

alignment bam psi-seq sam samtools • 2.1k views
ADD COMMENT
2
Entering edit mode
8.6 years ago

for the reads terminating I'm a bit lost.

using bioalcidae : https://github.com/lindenb/jvarkit/wiki/BioAlcidae

$ java -jar dist/bioalcidae.jar input.bam \
    -e 'while(iter.hasNext()) { var rec = iter.next(); if(rec.getReadUnmappedFlag()) continue; out.println(rec.getContig()+"\t"+rec.getAlignmentEnd()); }'   | LC_ALL=C sort -k1,1 -k2,2n | LC_ALL=C  uniq -c 

      1 rotavirus   52
      1 rotavirus   56
      1 rotavirus   64
      1 rotavirus   66
      1 rotavirus   67
      2 rotavirus   68
      1 rotavirus   69
      6 rotavirus   70
      8 rotavirus   71
      5 rotavirus   72
      5 rotavirus   73
      7 rotavirus   74
      4 rotavirus   75
      6 rotavirus   76
      7 rotavirus   77
      5 rotavirus   78
      8 rotavirus   79
      3 rotavirus   80
     10 rotavirus   81
      4 rotavirus   82
      6 rotavirus   83
      7 rotavirus   84
      4 rotavirus   85
      3 rotavirus   86
     10 rotavirus   87
      5 rotavirus   88
      6 rotavirus   89
      4 rotavirus   90
      9 rotavirus   91
      7 rotavirus   92
      4 rotavirus   93
      1 rotavirus   94
      5 rotavirus   95
      5 rotavirus   96
      6 rotavirus   97
      7 rotavirus   98
      3 rotavirus   99
      4 rotavirus   100
(...)

you might use getUnclippedEnd instead of getAlignmentEnd if you want the unclipped alignments.

ADD COMMENT
0
Entering edit mode

Thanks for the advice, I didn't know about this tool. It looks quite straightforward. I'll try it asap !

ADD REPLY
0
Entering edit mode

Hi again !

I tried your method, it looks like it works great but I have a few questions.

1- What is the name of the file containing the code for the methods like getAlignmentEnd() ?

2- What does getAlignmentEnd() do exactly ? Does it return the positions of the last base of a read ? or both the last and first one ?

3- What do you mean exactly by "unclipped alignments" ? I ran your code on some paired end data (.bam) with both getUnclippedEnd and getAlignmentEnd and the output is exactly the same.

4- I usually use samtools depth to compute the coverage, but it does some filtering on the reads. For consistency (my end goal is to do the ratio), is it possible to compute the read coverage with your method ?

Thanks a lot for your help already. And even if it's a bit hard to get in, your tool looks really great !

PS : as u probably guessed, I don't know java.

ADD REPLY
1
Entering edit mode

1- What is the name of the file containing the code for the methods like getAlignmentEnd() ?

https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/SAMRecord.java

2- What does getAlignmentEnd() do exactly ? Does it return the positions of the last base of a read ? or both the last and first one ?

https://github.com/samtools/htsjdk/blob/master/src/java/htsjdk/samtools/SAMRecord.java#L590

get the alignment start and scan the CIGAR string of the SAM until it reaches the end of the read on the REF genome.

'Last' is base in 3' on the REF

3- What do you mean exactly by "unclipped alignments" ?

What is difference between soft-clipped and hard-clipped in SAM specification? ?

I ran your code on some paired end data (.bam) with both getUnclippedEnd and getAlignmentEnd and the output is exactly the same.

so, you don't have clip :-)

4- I usually use samtools depth to compute the coverage, but it does some filtering on the reads. For consistency (my end goal is to do the ratio), is it possible to compute the read coverage with your method ?

yes but it's out of the scope of your original question; You could use bioalcidae to generate a BED graph of your data and then use bedtools genomecov ?

ADD REPLY
0
Entering edit mode

I'm honestly impressed. thanks =) One last question : "Last is base in 3' on the REF" ... So with a read pair I get this, right ? (the red lines would be the positions returned by getAlignmentEnd) getAlignmentEnd Schema

If this is correct, then it'll be a little bit more complicated to get exactly what I want, but I think I can figure it out with all the details you provided !

ADD REPLY
0
Entering edit mode

. So with a read pair I get this, right ?

NO: for the read on the right, alignment end is always on the genomic REF 3', so alignmentEnd is always on the right.

ADD REPLY

Login before adding your answer.

Traffic: 2267 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6