I'm working on Bash scripts for a ChIPseq pipeline for my lab. Even though the ENCODE guideline suggests to remove duplicates, some people here want to not remove duplicates but filter the reads with certain MAPQ values. For this purpose, I am working on a script that does that.
On the Internet and some other people's scripts, all the examples I have seen so far are filtering SAM files (for example, with awk or grep, knowing that the MAPQ value is on the 5th column of the SAM file, it's not a challenge to extract this; let's say it becomes a simple file-management and text-editing problem). Nevertheless, in the pipeline I'm working on, the inputs come in the form of sorted BAMS (because there's another script in the lab that does the mapping, sorting and conversion to BAM).
So I was wondering, is there a way of doing this filtering of MAPQ=certain values from the sorted BAMs I got from the people, without having to ask them for the SAM files? Thank you!
Thank you, now it's all clear :) . One more question: if instead of wanting to filtering a BAM for mapping quality I wanted to filter out reads with q<20 in a fastq file (so fastq filtering, not bam), which would be the most recommendable way to do it? Thank you again
Fastq files contain quality score per base, maybe you want reads with average quality < 20?
Considering both possibilities - per base filtering and per average read score - what is your recommendation? The concept is that I would like to have a fastq file of higher quality, but I'm not sure which approach is best to take.