Hi,
I have a bam file from RNA-Seq data and I would like to have some overview statistics on the occurring variations (number of mismatches, which transitions exactly, number of deletions/insertions..)
However, rather than computing this on the pile-up-level (coverage-level), I would like to have this over all reads for each positions: e.g. on positions 1 of all alignments of all reads 10% have a match, 20% a mismatch, 30% an insertion and so on.. In the ideal case, one could even explore this information in further details: such as how many reads have a transition from A-G at position 1.
I know that pysamstats does something along these lines, however on the coverage-level. One option would be to exploit pysam and just compute these statistics.
I just wanted to be sure that I do not overlook a tool which is already offering something along these lines...
So, I would appreciate any hints!
Thanks a lot,
Stefanie
Personally I don't think there is a tool that can help you directly. There are some functions within samtools/bcftools that check for variant distance bias but I am not sure how easily you can manipulate it for your purpose. A reverse engineering approach would be to first identify variants including SNPs and indels. Knowing variant sites will reduce your space search drastically. Once you have the vcf file, you can pick up each variant from the vcf file and extract reads overlapping that position from the bam file. For each read you can extract the start position from the bam file and then calculate the position of your variant on that read. Of course, different reads will have different positions of your variant unless they have same starting positions ( normally those reads are filtered). In the end, you can sum up all this information to answer your questions.