extract indel subportion of mapped reads
1
0
Entering edit mode
2.5 years ago
pingu77 ▴ 20

Hi all!

I am trying to find a tool that will allow me to count the number of indel in certain regions specified in a bed file. I have aligned my reads but I have no idea about which tool can extract the number of indels for each region. The reads are much longer than the regions indicated in the bed file, so I am wondering if there is a way to get that information for a sub portion of my mapped reads.

Thank you in advance!

bed indel subreads • 1.2k views
ADD COMMENT
2
Entering edit mode
2.5 years ago
daianagan ▴ 40

HI!

I am not sure if I understood you correctly. But I guess you could do your variant calling and then use a tool such as VCFtools (http://vcftools.sourceforge.net/man_latest.html) which can extract your variants based on a bed file (--bed) and if you want only indels you can use --keep-only-indels.

Hope it helps!

Best,

Daiana

ADD COMMENT
0
Entering edit mode

thanks for your answer! I was wondering if it is possible to get the information related to number of indel for each region included in a bed file, using just the bam file. In the bam file, the CIGAR string should indicate the total number of INDELs that you have for each read, but the value is calculated considering the full reads. However, I would like to extract that value on a sub portion of the read (based on the regions indicated in the bed file).

ADD REPLY
0
Entering edit mode

show us an example of output.

ADD REPLY
0
Entering edit mode

thanks for helping! I am working with PacBio data, so the reads are long. For example, I want to check the number of INDEL in this region: chr1:3985539-3985593. I know that one read overlaps that region. From the alignment file, I can extract all the information of that read and among them I can extract the CIGAR string. Eg (sorry, I had to cut out some part of the read and CIGAR, otherwise I exceed the maximum number of character):

seqID10001 0    chr1    3976171 60  23M1I136M1I346M1D14M1I40M1I90M1D21M1D53M1D12M1I22M1I83M1D143M1I130M1D62M1D16M1I215M1D13M1D48M1I116M1D292M2I45M1D29M1D80M1I235M1D28M1D10M1D79M1D125M1D75M1D180M1D161M1D91M1I81M1D19M1D172M1D58M1D90M1D111M1D111M1D161M1I98M1D38M1D4M1D50M1D281M1D5M1D23M1I83M1D22M1D65M1I180M1I10M1D71M1I17M1D87M1D38M1D99M1D48M1D93M1D130M1D203M1I64M1D92M1D92M1D235M1D11M1D137M1I219M1I131M1D44M1D103M1D36M1D169M1D114M1D145M1D10M1D44M1D13M1D50M1D78M1D62M1D69M1D52M1D10M1D401M1D180M1I115M1I46M1D60M1D28M1D157M1D152M1D52M1I117M1D46M1D27M1D131M1D139M1D129M1D101M1D98M8D166M1I16M2D143M1D89M255I203M1I78M1I56M1I34M1D24M...71M1I212M1I108M7D90M1D204M   *   0   0   TTCTTCTTTGCCTTCTGCCATGATTTGTAAGTTTCCCGAGGCCTCCTCAGCCATGCAGAACTGTCAATTAAACATCTTTCTTTTATAAATTACCAGTCTCGGGTGGTATCTTTATCGCAGTGTGAGAAAAAGTGAATATGGGCAGAAACACATGGCCAGTCCTCCAGGAGCAGACAGGAGGCCCGAAGGCCGGAGCGGAGCCAGGGAGAGGAAAGCGAGGGACGCTGAAGCCGCGTCCCGAGGGCCTGGAGGGCGGCTCTGGTGACCTTGGCTCTTGCTCTGTGTGAGCCAGGAGCCATGTGAGGGTTTCGAGCAGAGGAGTAACATGGTCTGATTTCATTGTTTTGGGCTGGAGGGACTGGAAGATGGAGTTGCCGCTGAGCTGAGAGATGGCTGGGGACAGCTCTGGGGAATGTGGATTTGGACGCGGTAATATGCAATGCTTATTGGACATCCCAGTGGCGCTACTGGGGAATCTTTAGGAAAAGTGGACCTGACATTTGGAGGAAGATCACAGATGGAAAAACGTGCTTTCGGGAGATTTGAGTGTAGGGCTGAGGCTGGGACGCGCTCACCAGGGGTGAGTGGGCAGAGAGAGGGAGAGATCCCAGAATGAAAACCCGGGTCCCACTGTTAGGAGGCAGGGGAGGAGTGGATCCACAAGGGCAAAGCAGCACGTCCCCTGACTACGGGGTCACCATTCACCTCGCAGGCCATGTGAATCGCACCCTGGAACTGTGGCAGGGCCCACCTGTGTGGCCACCACGCAGTGACCCTGAGACAAGGAGTGACCACAGAAGTGCATGGCAAAGCAGAAAGGTGTGACGTCTTGGAGGACGCAGGAGGAAGCACGCCCAGCGGACAGGAGCCATCACGCGGCTGCGGGCTGCTGCTGAGAACCTGGCGTGGGCCACCGTCGTCAGCAGCGTGGAGGTAACTGACGACCTTGGCCAAGAAGTTGGGTGGCAGCGTGGAGGTCATGCTCAGAGATGAGTGGGCGCAAGAGACGGTGGCGGGAGAGGGAGTGGAGATCATAGGCGTGGATGGCTCTTATGAGCCTTTTCTCTGTAGAGGGGAGTGGGGGAATGGGGGAAGCTAGGAGGGAGCTGTGCCTTGTGCGTGTGTGCTGTGTGCTTATCAGTTAAGGGGAAAAACAGCGGCAGGTGAGTTTCGTATGTTGCTGGGATGACGTGGCAGAGGGGACTATCCCACGAGGGAGCAGACGGAAGCAGGCTTGCTGGGGTGATGGCCTGGGCAGGGTCGAGGGGGTGAACTTAGTGCACAGTGGGAGGGCTGGCGCTGGAGAGGGGCCTGGACTGGCCCCTGGGGTGACAAGGGAGAAGGTGGAAGGCATGGGCACTGAGGTGGGTTGTGGAACCTCCTGGAAACTCCCATTTGAGACAGTGACATGATGCACCACTCAGGAGCTAGCGTGTGGAGGGGCTGGGGGCCGGGCCAGCAGGGAGGATGAGGGGCCCCCCTTGAGATGCATGGGCTGAATGTAAAGTGAGGCCGATGGCATGAGCTGTTCCACATCCTCGTTGAGCCGAAGGATGCGGGTGCTGCGAGTTGGGGGATGGGCGACTGTCCCTGGACTGAAGTTTTGCAGGCTGCCAAAGCGACAGGGCATGAGGGTCTTGC...AGACTACTCCCTGTATTTTCTTGACAGTGGCACACCCATGATCTAACCTGCCTTATCCTCAAGAAAATTCACTACTTCTGGGTCAGCACCTGAGTGTTCCCTGAGGGCCAAAGTTAGAAAATCCAACCATGCTCAGGCCAGCTGAGACGGGAAACTCCCATGCCCACCTCACAGCACTGTACCCATCACAGCCTCATCTCACTCAGCCTGGCTGCCCCCCACCACCTGTCCAAACAGCCTCCCTGCCACCACCCCAGACCCCGGGTCGTACTGCCACAGCCTCCCTCTCCCTTCCAACATGAACACCCTGAGAAGCAAGGGCTGGCACCTCCAGGGCAGCTACATTGCAAAAGCGCTTGACCACAAAAGCCCCGTTCACCGAAGTGTTACATGATTCGCTCAGCAGAAAGCTGGTCCTCACCCATGTCATGTATGCAGCCCCCAGGACATGCGTCTGGGGGATGTCGGGGCCTCCTTGTGCAGTAGGTAGTTCCAGGTTGTGCTGCGTGAGGAAGGGGTGCCCCTGGGGGCAGGGTGAGGAAAGGTGACACCTGAGAGGCAACTGCTCTGTGGCCTCGAGCTCGCTGCTGGATGTGGGGAGTCGGCCATGGGGACCAGTGCCTCTGGACTCCAGGCCAGGCCATGGGCACTGGGCACAGCTTCCTTTTCCATGACAAGGACCAGTTCCCTTGGAGCATGGCTAGCCCTGGCCTCTCTGACCGTGCCCACTGCTCATCCGATCCAGCAGGGTCACTCTGTGTGATCGCTGTGGTAGGAGCAAGGGGACTCAGCTGAAACCAAAAATTCCTCCTAGGCTGAGACGTTTGGTTACTGTGGTGGGCAGAGTAATGGCCCCCTAAGTTGTCTGCGTCCAATTCCCAGAGCCAGTGGATATGTCACCTGGCAGGGCAGAGGGGGACTAAGGTCGTAGATGGAGTTTAGGTTGCTACATGAACTGACCTGAAAATAAGGAGGTGGCCCTGGATTATCTGGGTGGCCCAGAGTCATCACAAAGGTCCTCACATGGGGAAGAGGAGGCAGGAGAGGAGGTGAGAGAAATGCTGTGCTGGCCTGACTGCGGATGGAGGAAGGGCCGAGCCGGGGGATGAGAGCGGCTGCTGGATGCTGCAAACGGCGAGGAACAGATCCCCCCCAGAGCCGCAGAGGGAGCGCAGCCCTGCAGACCCTTGCTTATAGGCGAGGGGGCCCTGTTTCAGACTTCTGACCTCCAGAGCTACGAGGCAATACTTTCATGTTGTTTTAAGATGCTAAATTTCCAGTGATTTGTTGCAGCAGCAGTAGGAAACTACATTAGAGCGTTCCTTTGCGTCTCAGCCGGGCATGGTGGCTCAAGCCTGTAATCCCAGCACTTTGGGAGCCTGAGGTGGGTGGATCACTTGAGGCAAGGAGTTTGAGAACAGCCTGACCAATATGGTGAAACCACGTCTCTATCAAAAGTACAAAAATTAGCTGGGCGTGGTGGCATGTATCGGTAGTCCTAGCTACTTGGGAGGCTGAGGCAGGAGAATTGCTTGAACCTGGGAGCCTGGGAGGCAGAGGTTGCAGTGAGCCGAGATTATGCCACTGCACTCCAACCTGGGTGACAGAGCAATGACTCTGTCTAAAAAAAAAAAAGAGTTCCTTTTGTCTTTTGAAGTTGCCACATTAAGGAATGAAACAACACTGAAGTTTGTTATGGTCCATAAAAAAAAAAAGTCTGGCAGGCATGGTGGCTTACGCCTACAATCCCAGCACTTTGGGAGGCCAAGGCAGGAGGGATGGCTTGAGTCCAGGAGTTTGAGACCAGCCTGGGCAATGGAGTGTGACCTTATTTCTACAAGAAAAAGTCTTTGGGAGGCCGAGGTGGGTGGATCACTTCAGGTCAGGAGTTCAAGACTAGCCTGCCCAACATGGCAAAACTCCATCCCTACTAAAAGTACAAAAAAATTAGCCAGGCATGGTGGCACACACCTGTAATCCCAGCTACTCCGGGAGGCTGAGGCATGAGAATCGTTTAAATCTCAGAGGCTGAGCCTGTAGCGAGCCAAGATGGCTCCACTGTACTCCAGCCTGGGCAACACAGCAAGACTCCCTCTCAAAAAAAAGTCTTAAGAAGTTTATAATTACAAACATCTGCTATTCTATTTATCCTTCCATCAGCAGGTAATAGCAACGCTATTGATAGCAGCCAGGCCAGGAGGCTGGAACCAGGCTGCGGAATTGCGGTGTACCTGGGCTGAGCCTGGGTGTGGGTCGTTCCCCTTGCTGGGCCTCAGTTTCCTCATCTGTGCAAGGCTGGGAGTGAGGCTGGTCCATCCCTGCTTCAGCCTCTCAGGAGACTTGTCGTCTGAAGTCACTGGGACTCCACGCTCTGTCTGAGTGTCTTGGGGG

However the CIGAR string is related to the full read, since I want to check the number of INDEL in just this location chr1:3985539-3985593, so a sub portion of that read, I am wondering if there is a tool to extract that information.

Thanks!

ADD REPLY
0
Entering edit mode

As Pierre suggested, an example output would be helpful (that was actually the input haha).

Anyways, I would suggest you do something like samtools view -b -h -L bedfile.bed bamfile.bam to subset your bam based on the bed file.

Now, how to call indels from a CIGAR string I am not entirely sure. I don't know what you are working on, or exactly why you need to call indels from CIGAR strings, but I definitely recommend you do a regular variant calling instead. Bear in mind, that a CIGAR string is informative for each individual read condition, but cannot (and is not intended to) account for the presence or absence of a variant in the whole sample (this is actually what variant callers are for).

If you still need to do it from the CIGAR string, you could build some custom script to count for Ns and Is in each string, and you would need to decide what to do with the alignment gaps (these might as well be part of INDELs).

Hope it helps! Daiana

ADD REPLY

Login before adding your answer.

Traffic: 1782 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6