How to extract coordinate of hard and soft clipped long reads ?
3
0
Entering edit mode
6.1 years ago
BioGeek ▴ 170

I wanted to extract the genome coordinate of all those location where longreads are soft/hard clipped.

Genome
                          ] [
__________________________________________________________________________________
--- ----------------------- --------------------------------- --------------------
--------------------------- -----------------------------------------
---- ---------------------- -------------------- --------------------------------
---------------------------- ------------------------------  -----------
  ------------------------ --------------------------------------------------------
 ------- ------------------ ------------------------- ----------------------------

                           ^

the arrow (^) where all the long reads are soft/hard clipped. Interestingly these region have few bases overlaps ??

longreads BAM clipped mapping • 2.7k views
ADD COMMENT
0
Entering edit mode

@BioGeek, did you find a way to do that ? I have the same problem ! Thanks

ADD REPLY
0
Entering edit mode
6.1 years ago
Vitis ★ 2.6k

Probably by reading and analyzing the CIGAR strings, finding all operations before "H" and "S" and calculating the clipping locations based on mapping locations and operation lengths before clipping.

For example, the CIGAR for one read is 120M31S, the mapping position for the read is 12345, then the clipping location might be in the area close to 12345 + 120 = 12465.

ADD COMMENT
0
Entering edit mode
6.1 years ago

using Bioalcidae http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html : stream the bam, for each read extract the unclipped/start|end and clipped start/end. If there is a clip, print the BED record(s).

$ java -jar dist/bioalcidaejdk.jar -e 'stream().filter(R->!R.getReadUnmappedFlag()).forEach(R->{int x1=R.getUnclippedStart(),x2=R.getStart();if(x1<x2) println(R.getContig()+"\t"+(x1-1)+"\t"+(x2)+"\t"+R.getReadName());x1=R.getEnd();x2=R.getUnclippedEnd();if(x1<x2) println(R.getContig()+"\t"+(x1-1)+"\t"+(x2)+"\t"+R.getReadName());});'  src/test/resources/S1.bam 

RF01    567 576 RF01_508_1107_4:0:0_0:0:0_af
RF01    1679    1682    RF01_1679_2192_3:0:0_3:0:0_a8
RF01    1886    1889    RF01_1821_2315_3:0:0_2:0:0_98
RF01    1989    1993    RF01_1445_1994_3:0:0_3:0:0_5e
RF02    407 410 RF02_98_411_1:0:0_3:0:0_2e
RF02    446 450 RF02_382_836_3:0:0_0:0:0_82
RF02    1811    1821    RF02_1811_2313_5:0:0_2:0:0_74
RF02    2220    2224    RF02_1737_2289_3:0:0_4:0:0_5e
RF03    739 741 RF03_327_808_1:0:0_3:0:0_1e
RF03    813 822 RF03_309_823_1:0:0_4:0:0_82
(...)
ADD COMMENT
0
Entering edit mode

RF01 567 576 RF01_508_1107_4:0:0_0:0:0_af

I think these are longreads location! Is this possible to get reference chromosome/scaffold/contig location ?

ADD REPLY
0
Entering edit mode
6.1 years ago
Carambakaracho ★ 3.3k

In case you're really only interested in the start position, this might do the trick

perl -nwe 'next if /^@/;@a=split;if($a[5]~=/^\d*H*(\d+)S/{$a[3]+$1; print "$a[0]\t$a[3]"}' <samtools view my.bam >out.txt
ADD COMMENT

Login before adding your answer.

Traffic: 2997 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6