Hi All,
I have mapped genomic data using bwa. What is the most efficient way to extract all the mapped reads (in fastq format)?
Thanks
Hi All,
I have mapped genomic data using bwa. What is the most efficient way to extract all the mapped reads (in fastq format)?
Thanks
I see this as a two step process:
To do this as efficiently as possible, using BBTools:
reformat.sh in=reads.sam out=mapped.fq mappedonly
Also, BBMap has a lot of options designed for filtering, so it can output in fastq format and separate mapped from unmapped reads, preventing the creation of intermediate sam files. This approach also keeps pairs together, which is not very easy using samtools for filtering.
bbmap.sh ref=reference.fa in=reads.fq outm=mapped.fq outu=unmapped.fq
You can first filter your raw bam file for mapped reads only: samtools view -F 4 in.bam -o out.bam
.
The -F
option discards everything with the given flag, and flag 4 refers to all unmapped reads (see: http://broadinstitute.github.io/picard/explain-flags.html) and then use bedtools bamtofastq to get the fastq from the bam.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
BBTools is the new GATK :)
Hi Brian, If I use
reformat.sh in=reads.bam out=mapped.fq mappedonly
, it works fine but treat mydata as 'Input is being processed as unpaired
'.When I specified
verifypaired=T
(reformat.sh in=reads.bam out=mapped.fq mappedonly verifypaired=T
), I get the following errors:Also note that I am using directly the bam file. Is this ok?
As long as samtools is installed, you can directly use bam input.
"verifypaired" only works with fasta or fastq input, though, since the sam format offers no guarantees about read order (paired reads are not necessarily next to each other). The reads will be output in the same order as they exist in the sam file. The message about input being processed as unpaired indicates that reads are being filtered individually, rather than keeping pairs together; so if in a pair one read is mapped and the other is not, the mapped read will be kept and the unmapped read will be discarded.
Hi Brian, Thanks a lot for this. My goal is to keep the mapped pairs different files and also to keep the reads where one pair is mapped and the other one is not in one of the files. I am doing this because I have to remap these mapped reads (mapped on few genes ~ 2000 genes) onto the whole genome.
Is it possible to do? or if you have any other suggestions then please suggest.
Well... BBMap, by default, will send unmapped pairs to the "outu" stream and mapped pairs to the "outm" stream, keeping pairs together. It considers a pair mapped if either read is mapped (this behavior can be changed with the "pairedonly" flag). The output outm/outu streams can create fastq files instead of sam, which is often useful. Reformat cannot currently apply this filter to an existing sam/bam file.