Hi all! I'm trying to find common reads between two bam files. I first sorted the bam files by read name and then used comm (https://linux.die.net/man/1/comm). Here is my code:
samtools sort -n unmapped_reads1.bam -o unmapped_reads1_sorted.bam
samtools sort -n unmapped_reads2.bam -o unmapped_reads2_sorted.bam
comm -12 unmapped_reads1_sorted.bam unmapped_reads2_sorted.bam > common_seqs.bam
Comm spits out these two messages:
comm: file 2 is not in sorted order
comm: file 1 is not in sorted order
and the resulting output file (common_seqs.bam) is empty. Anyone experienced this? or recommend any other ways to do this task?
Thank you!
When I use the sorting commands, my resulting sorted bam files are a series of lines like this (not the typical bam format with sequences of reads)
yes, that's what you want if you want to use 'comm'...
Oh i see! I should be more clear. I want my output to still be .bam just with all of the common sequences
once you have the common read names you can go back to the bam and get the reads: Extract reads from bam/sam files using read id
I will try that thank you!!!
define "common sequences"
my apologies for my bad explaining. I started off with paired end sequencing data. I mapped these sequences to two different reference genomes separately, and each time I extracted only the unmapped reads. (those two reference genomes represent possible contaminant species).
So now i have two unmapped reads files (unmapped_reads1.bam, unmapped_reads2.bam). I would need to keep sequences that are common to BOTH files as my final output, in bam format.