Question

How To Extract A Subset Of Reads In Fastq Using An Id List?

6

Entering edit mode

13.2 years ago

Luke ▴ 240

Hello! I obtained a list of unmapped reads IDs from my BAM file and I want to remap only the unmapped reads with other parameters. How can I extract the subset of unmapped reads from my original fastq file? Thank you in advance, Luke

fastq bam • 20k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 13.2 years ago by Luke ▴ 240

0

Entering edit mode

I have a post here which addresses part of this question

ADD REPLY • link 7.7 years ago by steve ★ 3.5k

3

Entering edit mode

13.2 years ago

Arun 2.4k

I prefer writing my own little snippets. However, it's possible using biopieces. This reply is from seqanswers (by maasha), pasted here for convenience.

First you need a file with the FASTQ sequence names you are interested in - or IDs if you like - one per line. And then:
read_fastq -i in.fastq | grab -E ids.txt | write_fastq -xo out.fastq
Check out grab for details.

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 13.2 years ago by Arun 2.4k

2

Entering edit mode

13.2 years ago

swbarnes2 15k

It is simpler to go back to the original .bam, and just pull out the .bam entries that are unmapped. samtools view -f4 should do it. Then, you can use something like Picard's SamToFastq to go back to fastq format, if you need to. (Some software, like velvet, is fine with using .bam as input)

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 13.2 years ago by swbarnes2 15k

0

Entering edit mode

13.2 years ago

Luke ▴ 240

I've found a quick solution with cdbfasta and cdbyank tools.

First you have to index your fastq with cdbfasta, then you can search for the IDs in fastq with cdbyank. For more info http://sourceforge.net/projects/cdbfasta/

Thank you,
Luke

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 13.2 years ago by Luke ▴ 240

Ram · Accepted Answer · 2015-05-27

11

Entering edit mode

10.2 years ago

Brian Bushnell 20k

I also wrote a program for this purpose, distributed with BBMap. Usage:

filterbyname.sh in=reads.fq out=filtered.fq names=names.txt include=t

The include flag will toggle between including or excluding the names in names.txt (which can, alternately, be another fastq or fasta file). This also supports paired input/output, and names being substrings or superstrings of read IDs.

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.2 years ago by Brian Bushnell 20k

1

Entering edit mode

Thank you for this excellent tool which is rediculously fast when compared to scripts I've been using to achieve this goal.

ADD REPLY • link 10.0 years ago by CraigM ▴ 90