Hi all,
I know this is a basic question, or in my mind sounds "basic", but I can not figure out the solution. I have a bam file that has been already filtered according to the quality:
samtools view -@ 6 -b -q 30 raw.bam > filtered.bam
I would like to remove from filtered.bam
, all the reads whose mate is not present (because it has been removed during quality filter).
Is there any tool to perfom this? Or I should create a script, search for each read ID and look if the mate is present? Also, it could be nice to do this while doing the quality filter, but I don't think that its possible.
Any suggestion?
EDIT: I think that I could use something like this:
awk '{if (x[$1]) { x_count[$1]++; print $0; if (x_count[$1] == 1) { print x[$1] } } x[$1] = $0}' filtered.bam
Your awk script will not work because the information (forward/reverse' is in the sam-flag (not in the name) and because you cannot put a large bam in memory.
Thanks for pointing out the memory problem, hopefully my specific bam is quite small. I guess I could also check the sequence field and see if the sequence of "two" mates is equal or not, and if it is, discard them.
You may want to check whether the aligner you're using will even give mates different MAPQs if they're not mapped as singletons (many won't). Your aligner might also append the MQ auxiliary tag. These would make the task much faster and let you avoid any sorting.