How to efficiently remove a list of reads from BAM file?
5
5
Entering edit mode
8.9 years ago
Tao ▴ 540

Hi Guys,

I have a BAM file, and a big read list. What I want to do is to remove the reads in the read list from the BAM file. I can transform Bam to Sam file and then use a Python script to remove unwanted reads. And then transform Sam to Bam again. But I am wondering if there is a more efficient way, which I mean faster, easier, and memory-efficient, to achieve this goal?

Any advice is appreciated!

Tao

RNA-Seq BAM samtools Sam • 17k views
ADD COMMENT
10
Entering edit mode
8.9 years ago

picard FilterSamReads

READ_LIST_FILE (File) Read List File containing reads that will be included or excluded from the OUTPUT SAM or BAM file. Default value: null.

ADD COMMENT
0
Entering edit mode

Exciting! That's what I'm looking for!

Many Thanks! Pierre.

ADD REPLY
5
Entering edit mode
8.9 years ago
Vivek ★ 2.7k
samtools view -h sample1.bam | grep -vf read_ids_to_remove.txt | samtools view -bS -o sample1_filter.bam -​

http://genometoolbox.blogspot.com/2013/06/remove-list-of-reads-from-bam-file.html

ADD COMMENT
4
Entering edit mode

Hi Vivek,

Thanks for your nice reply. I also found this method, but unfortunately, it's kind of some inefficient. I think may be 'grep' is always doing global search. Here, given a read ID, grep has to confirm all the bam context doesn't contain this read ID(pattern). So it's much slower when a big list of unwanted reads occur. While in my script, I only need to confirm the first column(reads ID) of the Bam file doesn't contain those unwanted reads. But, anyway, your method is a good choice to handle small read list. Thanks!

Tao

ADD REPLY
2
Entering edit mode

this can be made faster by adding the -@ parameter in samtools view. Given a multicore machine, utilize threads to make samtools (implemented in newer versions) faster.

ADD REPLY
1
Entering edit mode

That's a good idea!

ADD REPLY
3
Entering edit mode
8.9 years ago

The fastest and easiest solution is probably to use BBMap + samtools:

filterbyname.sh in=mapped.bam out=filtered.bam names=names.txt include=false

Samtools needs to be in the path. The memory usage depends on the number of names; the speed doesn't (well, not much).

ADD COMMENT
1
Entering edit mode
2.3 years ago
slvz ▴ 10

Indeed, this works:

samtools view -h sample1.bam | grep -vf read_ids_to_remove.txt | samtools view -bS -o sample1_filter.bam -​

If possible, adding --threads to samtools view is advisable. \ Using awk instead of grep should be faster.

samtools view ... | awk 'FNR==NR{a[$1];next} {if(!($1 in a)) {print $0 }}' read_ids_to_remove.txt - | samtools view ...

ADD COMMENT
0
Entering edit mode

not

grep -vf read_ids_to_remove.txt

but:

LC_ALL=C grep -vFwf read_ids_to_remove.txt
ADD REPLY
1
Entering edit mode
2.3 years ago

Since version 1.12, samtools has an option -N / --qname-file for filtering on the read name field:

samtools view -N read_ids_to_remove.txt -U filtered.bam -o /dev/null input.bam

(Add threading -@ / --threads options to taste, as described in other answers.)

ADD COMMENT

Login before adding your answer.

Traffic: 1547 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6