Hi, I hope everyone is doing well. I'm new to bioinformatics so I'm sorry if my question is silly. I'm trying to quantify ADAR (A to I) editing index in my RNA-seq data following a specific pipeline (link attached below). Basically, ADAR editing is a defence mechanism against dsRNA. it alters the Adenosines into inosines. inosines are usually interpreted by the translation machinery as guanosines. so in order to measure the editing index, I need to count the number of A-G mismatch and compare it to the total A-A matches. the pipeline recommended marking all matches and mismatches in the BAM files using samtools 1.8 mpileup with the following parameters: --B --ff SECONDARY -d 100000 -l<regions BED>. that gave me a pileup file (very short example of how it's look like is below)
**chr1 15265 C 83 <<><<<<<<<><<<<>>>>>>><>>>>><>>>>>>>><<<>><><><>
>>>><<><<><<<<<><><><<<>>>>>>>><<<< FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
chr1 15266 C 83 <<><<<<<<<><<<<>>>>>>><>>>>><>>>>>>>><<<>><><><>
>>>><<><<><<<<<><><><<<>>>>>>>><<<< FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
chr1 15267 A 83 <<><<<<<<<><<<<>>>>>>><>>>>><>>>>>>>><<<>><><><>
>>>><<><<><<<<<><><><<<>>>>>>>><<<< FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
chr1 15268 C 83 <<><<<<<<<><<<<>>>>>>><>>>>><>>>>>>>><<<>><><><>
>>>><<><<><<<<<><><><<<>>>>>>>><<<< FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
chr1 15269 T 83 <<><<<<<<<><<<<>>>>>>><>>>>><>>>>>>>><<<>><><><>
>>>><<><<><<<<<><><><<<>>>>>>>><<<< FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF**
Then I should count A to G mismatches and A-A matches. May I ask for any advice on how can I quantify these matches and mismatches from mpileup file? someone has recommended VARScan for me but the pipeline's publisher disagreed, saying these mismatch is constitute only 1% of the genome and varient calling tools are not good for this purpose. (sadly they didn't recommend any alternative way)
for interested people, the link for the pipeline is https://www.nature.com/articles/s41592-019-0610-9#Sec8 it's in the method section - index calculation.
Many thanks in advance. Surar
Thank you very much dear Mathias for your answer and really sorry I wasn't clear in my question. ADAR editing is a defence mechanism against dsRNA. it alters the Adenosines into inosines. inosines are usually interpreted by the translation machinery as guanosines. so in order to measure the editing index, I need to count the number of A-G mismatch and compare it to the total A-A matches. the pipeline suggested converting the BAM files into mpileup and then counting these matches/mismatches. but I didn't know how to do that so I asked. will read about perbase tool you have suggested, and will edit the question.
Many thanks, Surar.
You are welcome! I have to admit that I had overlooked that you linked a paper that describes what an ADAR is, but the answer was hopefully helpful nonetheless.
I recommend perbase instead of pileup, because the latter's output is a bit bulky. It is read like this:
At chromosome 1, position 15268 the reference base is a C and there are 83 reads that align there. However, all 83 reads have a reference skip (< >) here. The most likely reason is that this particular base is located inside an intron, because you have a transcriptome sequencing aligned to a reference genome. The F respectively : denote the basecall quality of this base in the respective reads.
So in order to generate your mutational profile from this, you would need to filter the exon positions (which then show a . for a match and a letter if mutated) count the number of occurrences for a particular letter and also potentially consider the basecall quality.
In contrast, the perbase output:
would directly give you already the counts. For a particular position, you can see the REF_BASE, the DEPTH and the actually called bases. In this example 26 out of 26 reads would have a G instead of a A and therefore possibly indicate the modification or a mutation.
This table can then be further processed with other tools and indexed with tabix for a fast access.
Thank you very much for your detailed answer, it's much appreciated. yes, I had a look at the perbase tools and it's great, it's exactly what I needed. And I totally agree that the pileup files are so bulky, I ran out of memory after having only half of the files converted into pileup.
Best regards, Surar.