Hello! I obtained a list of unmapped reads IDs from my BAM file and I want to remap only the unmapped reads with other parameters. How can I extract the subset of unmapped reads from my original fastq file? Thank you in advance, Luke
Hello! I obtained a list of unmapped reads IDs from my BAM file and I want to remap only the unmapped reads with other parameters. How can I extract the subset of unmapped reads from my original fastq file? Thank you in advance, Luke
I also wrote a program for this purpose, distributed with BBMap. Usage:
filterbyname.sh in=reads.fq out=filtered.fq names=names.txt include=t
The include
flag will toggle between including or excluding the names in names.txt
(which can, alternately, be another fastq or fasta file). This also supports paired input/output, and names being substrings or superstrings of read IDs.
I prefer writing my own little snippets. However, it's possible using biopieces. This reply is from seqanswers (by maasha), pasted here for convenience.
First you need a file with the FASTQ sequence names you are interested in - or IDs if you like - one per line. And then:
read_fastq -i in.fastq | grab -E ids.txt | write_fastq -xo out.fastq
Check out grab for details.
It is simpler to go back to the original .bam, and just pull out the .bam entries that are unmapped. samtools view -f4
should do it. Then, you can use something like Picard's SamToFastq to go back to fastq format, if you need to. (Some software, like velvet, is fine with using .bam as input)
I've found a quick solution with cdbfasta and cdbyank tools.
First you have to index your fastq with cdbfasta, then you can search for the IDs in fastq with cdbyank. For more info http://sourceforge.net/projects/cdbfasta/
Thank you,
Luke
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I have a post here which addresses part of this question