I used bowtie2 to do the alignments.
Do you think that the methodology described below is correct?
1) To extract the reads that align only once
grep -E "@|NM:" alignments.sam | grep -v "XS:" > unique.sam
Now I am looking for a way to find which reads' alignments don't overlap with any other alignment.
2) Converting to bam & sorting:
samtools view -Sb unique.sam > unique.bam
samtools sort unique.bam -o unique.sorted.bam
3) Then view the sorted file:
samtools view GCF_000090985.2_k15_local_verySensitive.unique.sorted.bam | head -n 4
sl157419384 0 NC_013038.1 67349 22 35S30M10S * 0 0 ACCCCTTTGCCCAGNTCCCCTTTGCCCAGNNCCCCTTTGCCCCCTCCCCCTTTGCCCCNTTCCCCTTTGCCCCNT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:57 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:23C6 YT:Z:UU
sl157419385 0 NC_013038.1 67349 22 20S30M25S * 0 0 TCCCCTTTGCCCAGNNCCCCTTTGCCCCCTCCCCCTTTGCCCCNTTCCCCTTTGCCCCNTGCCCCTTTGCCCNGA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:57 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:23C6 YT:Z:UU
sl157419386 0 NC_013038.1 67349 22 5S30M40S * 0 0 NCCCCTTTGCCCCCTCCCCCTTTGCCCCNTTCCCCTTTGCCCCNTGCCCCTTTGCCCNGATCCCCTTTGCCCNGA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:57 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:23C6 YT:Z:UU
sl158880407 16 NC_013038.1 180163 22 22S9M3I40M1S * 0 0 NCCAGCCGAGGAGGTAGCAGCCGAGGAGGNTAGAGCCGAGGAGGNCNGAGCCGAGGAGGAGGGAGCCGAGGAGGN IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII AS:i:59 XN:i:0 XM:i:5 XO:i:1 XG:i:3 NM:i:8 MD:Z:7G11A0G0G16T10 YT:Z:UU
...
I think that the 4-th column represents the starting position of the alignment on the genome.
4) Since my reads have constant length, I intend to iterate all over these starting positions, add the length and compare with the next line to see if any overlapping happened.
What do you think about these steps? Do they make sense, or I am missing something?
Thanks
I wrote http://lindenb.github.io/jvarkit/BamTile.html to get a tiling path of reads. But there will be a minimal overlap. I neededb I can add a new option to remove any overlapping .
@Pierre Lindenbaum Thank you for your comments. I haven't still figured out what exactly a "tiling path of reads" is but this seems helpful. Apparently I would be interested in overlapping removal as well :) Could you provide some reference about the tiling path thing for me to check out?
Figure 10 on this page. It is the minimum number of reads/genomic entities that provide contiguous coverage for a region.
I've updated my tool to prevent an overlap (option '-n') http://lindenb.github.io/jvarkit/BamTile.html
this is a wrong method if you have any deletion, soft clip in your read.
and you do have soft clipping !
35S30M10S