Question

Filter sam/bam files

0

Entering edit mode

6.4 years ago

second_exon ▴ 210

Hi:

I mapped single-end sequenced GBS reads onto two similar genomes. I have two sam files one for each genome. I'm interested to filter sam/bam files based on these conditions:
1. reads unique to a genome/reads mapped only on one genome
2. reads are not unique/mapped on both genomes.

Are there any existing implementations for this? I want to know before I go ahead and implement this in pysam.

GBS sam bam filter samtools • 1.7k views

ADD COMMENT • link updated 6.4 years ago by swbarnes2 14k • written 6.4 years ago by second_exon ▴ 210

1

Entering edit mode

Compare BAM files ?

ADD REPLY • link 6.4 years ago by Eric Lim ★ 2.2k

0

Entering edit mode

Were the reads mapped to each genome separately? I don't think there is a tool that does what you want, but BBSplit from BBTools comes close - you would have to do the mapping again

If you end up coding your own solution, I think it would be wise to sort the sam / bam by name, so parsing both files simultaneously is simpler and uses less memory.

ADD REPLY • link 6.4 years ago by h.mon 35k

0

Entering edit mode

Thanks for the suggestion! Yes, they were mapped separately. I will look into BBTools.

ADD REPLY • link 6.4 years ago by second_exon ▴ 210

1

Entering edit mode

6.4 years ago

swbarnes2 14k

I have two sam files one for each genome.

This is going to be over-stringent. Some reads will map perfectly to one genome, and map with discrepancies to the other. You probably want to count those for the first genome, not throw them away becasue the aligner forced them to align to the second genome.

Better to combine the genomes, align to that, and then you can fish out reads that aligned uniquely.

ADD COMMENT • link 6.4 years ago by swbarnes2 14k

score 3 · Accepted Answer · 2018-06-25

3

Entering edit mode

6.4 years ago

Pierre Lindenbaum 164k

your reads are single end ?

samtools view -F 4 in1.bam | cut -f 1 | sort | uniq > r1.txt 
samtools view -F 4 in2.bam | cut -f 1 | sort | uniq > r2.txt

reads mapped only on one genome

comm -3 r1.txt r2.txt

reads are not unique/mapped on both genomes.

comm -12 r1.txt r2.txt