Hi all,
I am trying to filter mitochondrial reads from my ATAC-seq data after trimming with Trimmomatic and then aligning with Bowtie2. After searching through many pipelines, I have found 2 ways that people often do this (both using inputs that are sorted and indexed BAM files):
1) with samtools and grep; (see example below)
samtools view -@ 5 -h <input.bam> | grep -v MT | samtools sort -@ 5 -O bam -o <output.bam>
2) using a Python script called removeChrom; (see example below); code can be found here: https://github.com/harvardinformatics/ATAC-seq/tree/master/atacseq
samtools view -h <input.bam> | python ~/removeChrom.py - - MT | samtools view -b - > <output.bam>
I have tried both these options, and I ran samtools idxstats after each (see example below) in order to check the number of reads mapped to the mitochondria to make sure I had actually filtered them out.
samtools idxstats <output.bam> > <output.bam.idxstats>
After using option 1 (samtools and grep), I found that all the mitochondrial reads were gone when I checked with samtools idxstats, but that there were also reads missing from the other chromosomes as well. Upon some further investigation, I think this is because I have reads where the 7th field in the SAM/BAM file (RNEXT) is "MT", while the 3rd field (RNAME) is a different chromosome, and these are being filtered out.
With option 2 (the python script), this does not happen- only reads from MT are "gone" when I re-check samtools idxstats (but there is also an increase in reads assigned to an asterisk). After having some help looking at the Python script (I am very new to all coding!), my understanding is that the code puts an asterisk in the RNAME field (3rd field) and a "0" in the MAPQ field (5th field) if RNAME is MT (regardless of what RNEXT is), and that these can then be actually filtered out later using samtools view based on the MAPQ score. However, if MT is only in the RNEXT field (and not also in RNAME) then these will not get a "0" in MAPQ field. [However, they do get an * placed in the RNEXT field and "0"s placed in PNEXT and TLEN field, but I am not sure what the significance of this is and whether that means they would also end up filtered out.]
After all this, my question is- based on all this, what is the better method if I want to remove mitochondrial reads/are these methods really ending up doing different things? It is my understanding that the RNAME and RNEXT fields should generally be the same since this would indicate that the paired reads are aligned to the same chromosome- is this a correct assumption? If so, then what does it mean when the RNAME field is a nucelosomal chromosome and RNEXT is MT? Would this mean a poor alignment, and it is fair to remove these reads as I would in option 1 with samtools & grep? Are the alterations (adding 0s) that the removeChrom.py script does to PNEXT and TLEN significant/would these affect how these reads are treated downstream in analysis?
I would really appreciate any help at all!!! Thank you!!
I think it would be fair (and conservative) to remove read pairs where one maps to the mitochondrial genome. It should be a relatively very small number of reads.