Hi all!
I am trying to find a tool that will allow me to count the number of indel in certain regions specified in a bed file. I have aligned my reads but I have no idea about which tool can extract the number of indels for each region. The reads are much longer than the regions indicated in the bed file, so I am wondering if there is a way to get that information for a sub portion of my mapped reads.
Thank you in advance!
thanks for your answer! I was wondering if it is possible to get the information related to number of indel for each region included in a bed file, using just the bam file. In the bam file, the CIGAR string should indicate the total number of INDELs that you have for each read, but the value is calculated considering the full reads. However, I would like to extract that value on a sub portion of the read (based on the regions indicated in the bed file).
show us an example of output.
thanks for helping! I am working with PacBio data, so the reads are long. For example, I want to check the number of INDEL in this region: chr1:3985539-3985593. I know that one read overlaps that region. From the alignment file, I can extract all the information of that read and among them I can extract the CIGAR string. Eg (sorry, I had to cut out some part of the read and CIGAR, otherwise I exceed the maximum number of character):
However the CIGAR string is related to the full read, since I want to check the number of INDEL in just this location chr1:3985539-3985593, so a sub portion of that read, I am wondering if there is a tool to extract that information.
Thanks!
As Pierre suggested, an example output would be helpful (that was actually the input haha).
Anyways, I would suggest you do something like
samtools view -b -h -L bedfile.bed bamfile.bam
to subset your bam based on the bed file.Now, how to call indels from a CIGAR string I am not entirely sure. I don't know what you are working on, or exactly why you need to call indels from CIGAR strings, but I definitely recommend you do a regular variant calling instead. Bear in mind, that a CIGAR string is informative for each individual read condition, but cannot (and is not intended to) account for the presence or absence of a variant in the whole sample (this is actually what variant callers are for).
If you still need to do it from the CIGAR string, you could build some custom script to count for Ns and Is in each string, and you would need to decide what to do with the alignment gaps (these might as well be part of INDELs).
Hope it helps! Daiana