I had exactly the same problem when I moved from bowtie to bowtie2. For me the reason was because our facility started to routinely output 101bp paired-end sequence. I will detail below how I replicated the -m1
flag setting in bowtie2, but the overall conclusion is that I no longer use this method and instead allow mapping of reads to multiple locations as long as the read pairs are concordant and high quality.
1) Replicating -m1 in bowtie2 [ not recommended ]
This is mostly based upon Devon Ryan's answer to C: How To Extract Reads-Pairs Aligned Concordantly Exactly 1 Time?
also here is another great discussion: How to extract unique mapped results from Bowtie2 bam results?
You can run bowtie2 with default settings, but employ '-k 2', which will report up to two mapped location per read/pair.
The resulting SAM file can then be filtered using the XS:i flag, which indicates the second best mapping location, i.e. it identifies non-uniquely mapping reads. Below is some dummy code to illustrate:
# remove non-primary mapping reads (-F 256) and retain properly paired, i.e. concordant read pairs (-f 2)
samtools view -Sh -f 2 -F 256 sample.sam | grep -v "XS:i:" > sample_PP_filterXS.sam
# see below for check_pairs.py code, remove orphaned reads when one is deleted due to multi-mapping
./check_pairs.py < sample_PP_filterXS.sam > sample_PP_filterXS_rm-orphan.sam
# SAM to BAM
samtools view -bS -o sample_PP_filterXS_rm-orphan.bam sample_PP_filterXS_rm-orphan.sam
# sort BAM ---> index and flagstat as required
samtools sort sample_PP_filterXS_rm-orphan.bam sample_PP_filterXS_rm-orphan_sorted
check_pairs.py
#!/usr/bin/env python
import csv
import sys
f = csv.reader(sys.stdin, dialect="excel-tab")
of = csv.writer(sys.stdout, dialect="excel-tab")
last_read = None
for line in f :
#take care of the header
if(line[0][0] == "@") :
of.writerow(line)
continue
if(last_read == None) :
last_read = line
else :
if(last_read[0] == line[0]) :
of.writerow(last_read)
of.writerow(line)
last_read = None
else :
last_read = line
2) Learning to love multi-mapped reads [ recommended - "by me" ]
To cut a long story short after various discussions with colleagues and the reading of an illuminating blog post by Heng Li (http://lh3lh3.users.sourceforge.net/mapuniq.shtml) it was decided we should stop worrying about multi-mapping reads from bowtie2, as long as they belonged to proper-pairs (a.k.a concordant) and are of good quality.
Again here is some dummy code:
# SAM to BAM
samtools view -bS sample.sam > sample.bam
# retain properly paired, i.e. concordant read pairs (-f 2) and high quality reads (-q30)
samtools view -f2 -q30 -bS sample.sam > sample_PP_Q30.bam
# sort BAM ---> index and flagstat as required
samtools sort sample_PP_Q30.bam sample_PP_Q30_sorted
Conclusion
Although it feels safer to only use uniquely mapping reads when using paired-end reads it is OK to allow multi-mapping as long as the pairs are checked to there relative distance, orientation and quality. At least in examples I have seen mapping "looks" better in ChIP-seq analysis, where "real" peaks remain, but "odd" looking peaks disappear over repetitive regions.
For short single end reads I would still use bowtie1 '-m1' :)
I hope this helps as I agonised over this for a long time.
Not sure about bowtie2... but you can do that with BBMap, by using the "outm" flag instead of "out". For example...
You can specify any or all of outm, outu, and/or out depending on which reads you want to collect, and output them in other formats like fastq if you want, instead of sam.
I have also encountered that problem switching from Bowtie to Bowtie2. In my case, I would like to keep reads which fall under a certain -m parameter. I have not yet worked out a way to do this.