Question

Counting mismatches from mpileup files

0

Entering edit mode

2.6 years ago

S.O.T.AL-HASHIMI2 ▴ 60

Hi, I hope everyone is doing well. I'm new to bioinformatics so I'm sorry if my question is silly. I'm trying to quantify ADAR (A to I) editing index in my RNA-seq data following a specific pipeline (link attached below). Basically, ADAR editing is a defence mechanism against dsRNA. it alters the Adenosines into inosines. inosines are usually interpreted by the translation machinery as guanosines. so in order to measure the editing index, I need to count the number of A-G mismatch and compare it to the total A-A matches. the pipeline recommended marking all matches and mismatches in the BAM files using samtools 1.8 mpileup with the following parameters: --B --ff SECONDARY -d 100000 -l<regions BED>. that gave me a pileup file (very short example of how it's look like is below)

**chr1  15265   C   83  <<><<<<<<<><<<<>>>>>>><>>>>><>>>>>>>><<<>><><><>
>>>><<><<><<<<<><><><<<>>>>>>>><<<< FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
chr1    15266   C   83  <<><<<<<<<><<<<>>>>>>><>>>>><>>>>>>>><<<>><><><>
>>>><<><<><<<<<><><><<<>>>>>>>><<<< FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
chr1    15267   A   83  <<><<<<<<<><<<<>>>>>>><>>>>><>>>>>>>><<<>><><><>
>>>><<><<><<<<<><><><<<>>>>>>>><<<< FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
chr1    15268   C   83  <<><<<<<<<><<<<>>>>>>><>>>>><>>>>>>>><<<>><><><>
>>>><<><<><<<<<><><><<<>>>>>>>><<<< FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
chr1    15269   T   83  <<><<<<<<<><<<<>>>>>>><>>>>><>>>>>>>><<<>><><><>
>>>><<><<><<<<<><><><<<>>>>>>>><<<< FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF**

Then I should count A to G mismatches and A-A matches. May I ask for any advice on how can I quantify these matches and mismatches from mpileup file? someone has recommended VARScan for me but the pipeline's publisher disagreed, saying these mismatch is constitute only 1% of the genome and varient calling tools are not good for this purpose. (sadly they didn't recommend any alternative way)

for interested people, the link for the pipeline is https://www.nature.com/articles/s41592-019-0610-9#Sec8 it's in the method section - index calculation.

Many thanks in advance. Surar

editing RNA-seq Mpileup RNA A-I • 1.5k views

ADD COMMENT • link 2.6 years ago by S.O.T.AL-HASHIMI2 ▴ 60

score 2 · Answer 1 · 2022-04-07

2

Entering edit mode

2.6 years ago

Matthias Zepper 5.0k

I don't fully understand what you are trying to find, but I think perbase is the tool you are looking for. It is insanely fast and outputs you similar details with base coordinate precision.

Good luck with your analysis and welcome in the world of bioinformatics!

ADD COMMENT • link 2.6 years ago by Matthias Zepper 5.0k

1

Entering edit mode

Thank you very much dear Mathias for your answer and really sorry I wasn't clear in my question. ADAR editing is a defence mechanism against dsRNA. it alters the Adenosines into inosines. inosines are usually interpreted by the translation machinery as guanosines. so in order to measure the editing index, I need to count the number of A-G mismatch and compare it to the total A-A matches. the pipeline suggested converting the BAM files into mpileup and then counting these matches/mismatches. but I didn't know how to do that so I asked. will read about perbase tool you have suggested, and will edit the question.

Many thanks, Surar.

ADD REPLY • link 2.6 years ago by S.O.T.AL-HASHIMI2 ▴ 60

1

Entering edit mode

You are welcome! I have to admit that I had overlooked that you linked a paper that describes what an ADAR is, but the answer was hopefully helpful nonetheless.

I recommend perbase instead of pileup, because the latter's output is a bit bulky. It is read like this:

chr1    15268   C   83  <<><<<<<<<><<<<>>>>>>><>>>>><>>>>>>>><<<>><><><>
>>>><<><<><<<<<><><><<<>>>>>>>><<<< FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

At chromosome 1, position 15268 the reference base is a C and there are 83 reads that align there. However, all 83 reads have a reference skip (< >) here. The most likely reason is that this particular base is located inside an intron, because you have a transcriptome sequencing aligned to a reference genome. The F respectively : denote the basecall quality of this base in the respective reads.

So in order to generate your mutational profile from this, you would need to filter the exon positions (which then show a . for a match and a letter if mutated) count the number of occurrences for a particular letter and also potentially consider the basecall quality.

In contrast, the perbase output:

REF     POS     REF_BASE        DEPTH   A       C       G       T       N       INS     DEL     REF_SKIP        FAIL    NEAR_MAX_DEPTH
chr1    734964  A       26      0       0       26       0      0       0       0       0       0   false

would directly give you already the counts. For a particular position, you can see the REF_BASE, the DEPTH and the actually called bases. In this example 26 out of 26 reads would have a G instead of a A and therefore possibly indicate the modification or a mutation.

This table can then be further processed with other tools and indexed with tabix for a fast access.

ADD REPLY • link 2.6 years ago by Matthias Zepper 5.0k

1

Entering edit mode

Thank you very much for your detailed answer, it's much appreciated. yes, I had a look at the perbase tools and it's great, it's exactly what I needed. And I totally agree that the pileup files are so bulky, I ran out of memory after having only half of the files converted into pileup.

Best regards, Surar.

ADD REPLY • link 2.6 years ago by S.O.T.AL-HASHIMI2 ▴ 60