Extracting Subsets Of Reads From A Bam File
5
8
Entering edit mode
12.4 years ago
ijon.tichy ▴ 80

Hi, I have a large paired-end dataset in the BAM format and a list of read IDs which belong to a single mate of a pair. What I want to do is to extract their second mates from the whole dataset. Could you please advise me some efficient ways to do this like using, let's say, Bio-SamTools or something like that? Something memory- and time-efficient. Thanks!

bam samtools sequencing • 20k views
ADD COMMENT
0
Entering edit mode

how many reads-IDs do you have, does it fit in memory ?

ADD REPLY
0
Entering edit mode

Just now it's only about 200. But for other datasets it could be much more, 10^3-10^5.

ADD REPLY
8
Entering edit mode
12.4 years ago

2018: use picard https://broadinstitute.github.io/picard/command-line-overview.html#FilterSamReads


The following C++ code should filters only print the reads contained in your file (I've added to my variation toolkit http://code.google.com/p/variationtoolkit )

http://code.google.com/p/variationtoolkit/source/browse/trunk/src/bamgrepreads.cpp

Compilation:

g++ -O3 -Wall -I  path/to/samtool -L path/to/samtool bamgrepreads.cpp -lbam -lz

Usage:

./a.out -R file_containing_the_reads_name.txt (stdin|bam1 bam2...)

Options:

 -f INT   required flag
 -F INT   filtering flag
 -R FILE reads file
 -e  only one match per name (goes faster)
ADD COMMENT
0
Entering edit mode

Thank you! It works great!

ADD REPLY
0
Entering edit mode

Does this take advantage of the queryname sorting order of a .bam file, or does it simply search through all the reads? Thanks!

ADD REPLY
1
Entering edit mode

no, it searches 'all' the reads in the bam.

ADD REPLY
0
Entering edit mode

Hi, I am trying to use this script, but somehow I can not compile it. seems the samtools I tried is not suitable for this script. Any suggestion that may be helpful?

Here are two error messages of my efforts.


shaoy@Darwin ~/rd/script $ g++ -O3 -Wall -I ~/rd/software/samtools-0.1.19 -L ~/rd/software/samtools-0.1.19 bamgrepreads.cpp -lbam -lz In file included from /usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.7.2/../../../../include/c++/4.7.2/ext/hash_set:61:0, from bamgrepreads.cpp:15: /usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.7.2/../../../../include/c++/4.7.2/backward/backward_warning.h:33:2: warning: #warning This file includes at least one deprecated or antiquated header which may be removed without further notice at a future date. Please use a non-deprecated interface with equivalent functionality instead. For a listing of replacement headers and interfaces, consult the file backward_warning.h. To disable this warning use -Wno-deprecated. [-Wcpp] /home/shaoy/rd/software/samtools-0.1.19/libbam.a(bgzf.o): In function `bgzf_mt': /home/shaoy/rd/software/samtools-0.1.19/bgzf.c:445: undefined reference to `pthread_create' /home/shaoy/rd/software/samtools-0.1.19/libbam.a(bgzf.o): In function `mt_destroy': /home/shaoy/rd/software/samtools-0.1.19/bgzf.c:458: undefined reference to `pthread_join' collect2: error: ld returned 1 exit status

shaoy@Darwin ~/rd/script $ g++ -O3 -Wall -I ~/rd/software/samtools-1.0 -L ~/rd/software/samtools-1.0 bamgrepreads.cpp -lbam -lz In file included from /usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.7.2/../../../../include/c++/4.7.2/ext/hash_set:61:0, from bamgrepreads.cpp:15: /usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.7.2/../../../../include/c++/4.7.2/backward/backward_warning.h:33:2: warning: #warning This file includes at least one deprecated or antiquated header which may be removed without further notice at a future date. Please use a non-deprecated interface with equivalent functionality instead. For a listing of replacement headers and interfaces, consult the file backward_warning.h. To disable this warning use -Wno-deprecated. [-Wcpp] In file included from bamgrepreads.cpp:41:0: /home/shaoy/rd/software/samtools-1.0/bam.h:48:25: fatal error: htslib/bgzf.h: No such file or directory compilation terminated.
ADD REPLY
1
Entering edit mode

Whoa that's an old post. This C++ package is now obsolete, you'd better have a look at: https://github.com/lindenb/jvarkit/wiki/SamGrep

ADD REPLY
0
Entering edit mode

and later: picard contains a tool to filter the reads using a list of read names.

ADD REPLY
5
Entering edit mode
10.0 years ago
Fabio Marroni ★ 3.0k

I just ran in the same need, and I solved it using picard-tools:

java -jar FilterSamReads.jar INPUT=input.sam FILTER=includeReadList READ_LIST_FILE=reads_list.txt OUTPUT=selected_polym.sam

You can also give a look to this post.

ADD COMMENT
1
Entering edit mode
12.4 years ago
Geparada ★ 1.5k

If you're a python programmer, take a look to Pysam. There is a function called "mate" that can do what you're looking for.

ADD COMMENT
0
Entering edit mode
12.4 years ago
samtools view | grep -f

would work, though I can't vouch for how fast it will be.

It might be faster to grep out those reads from the fastq, and just realign them.

ADD COMMENT
0
Entering edit mode

That's exactly what I want to avoid. My datasets are very large and 'grep' takes a lot of time.

ADD REPLY
0
Entering edit mode
3.6 years ago

Same as previous answers but another option to run Picard script with GATK. It works for me.

gatk FilterSamReads \
-I input.bam\
-O output.bam \
--READ_LIST_FILE read_names.txt \
--FILTER includeReadList
ADD COMMENT

Login before adding your answer.

Traffic: 1783 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6