Hello all,
I am currently involved in a project where I have human blood and plasma samples, and I wish to examine the nonhuman DNA sequences in these samples. These samples were all sequenced as 50 bp reads.
Originally, I used a combination of Bowtie2 and BLAST -- Bowtie2 for computational subtraction of human reads and BLAST for alignment of the leftover reads.
BLAST is just too slow for the number of reads I have, so I then decided to use SNAP. SNAP is very fast, but it doesn't seem to be suited for 50 bp reads. Namely, it indexes the genome at seeds of size ~20 and they recommend that the seed size be less than a quarter of the read size. In my case, that would mean seeds of less than 12.5. SNAP only allows seeds as low as 16, so that's out of the question -- also lower seed length decreases performance of the aligner.
What alignment tools out there would be best suited for my purposes? I have investigated things like RINS, PathSeq, READSCAN as well as alignment tools like Bowtie, BWA, etc., but I'm not too sure they are suited for 50 bp reads. Would anyone happen to have any advice they could offer?
Thank you all.
Are there any filtering capabilities associated with BBMap? For example, BLAST and SMALT have a percent identity option, and BLAST has other options such as e-value.
Basically, is there a way I can control sensitivity of the nonhuman alignments, or perhaps only consider alignments above a certain threshold?
BBMap has various filtering options such as
idfilter
(don't allow mappings below X% identity),strictmaxindel
(don't allow indels longer than this),perfectmode
(only allow perfect alignments),semiperfectmode
(only allow alignments that are perfect, but Ns in the reference are allowed),kfilter
(only allow sites with at least k consecutive bases matching),minhits
(require at least this many seed matches),ambig
(controls behavior of reads that map ambiguously to multiple locations),po
(only allow paired alignments), and various others. You can adjust the seed length from 8 to 15. It's also possible to map to multiple references simultaneously to filter out reads that map better to another reference, with BBSplit, which uses BBMap. For example, if you were interested in a human parasite, you could map with BBSplit usingref=parasite.fasta,human.fasta
as the reference files, and get two output sets, one with reads that better match the parasite, and one with reads that better match human.Hi Brian,
When I run
bbmap.sh
withref
,in
, andout
arguments, I get the following error:That's probably due to the spaces between
out
and=
. Command should be:I'll try to make the error message more clear.
Hi Brian,
What if I wanted to map to multiple references at the same time? For example, I wanted to use a bacteria reference genome set, but I kept getting errors of this nature:
So then I split the reference up into multiple fasta files. How can I align to all of them at the same time?
If you do not include a
-Xmx
parameter, BBMap will (try to) automatically determine how much memory is available and use that much. If you do include it, BBMap will use that much memory; if your reference is too big to fit, it will crash with ajava.lang.OutOfMemoryError: Java heap space
error. So, if you are using a computer with a large amount of memory, either increase the value of-Xmx
to around 85% of the physical memory of that computer, or exclude it entirely to allow autodetection.BBMap requires approximately 6 bytes per reference base for large genomes. You can get a very precise estimate of how much memory BBMap will need by running
stats.sh ref=genome.fasta k=13
(since the exact amount of memory depends on the kmer length).-Xmx24g
is specifically for processing of the human genome, which is about 3Gbp, as 3Gbp*6bytes/bp = 24GB. But very small genomes, like a single bacteria, somewhat more memory is needed per bp of reference.You can reduce the amount of memory needed to approximately half (around 3 bytes per reference base) by using the flag
usemodulo
, with very slightly lower sensitivity.If you do not have enough memory for BBMap, I suggest Bowtie2 or bwa-mem - both of those use less memory than BBMap - closer to 1 byte per bp. However, they have lower sensitivity, and cannot detect long indels (or intronic splices), so whether they are a good choice depends on your use-case.
Great, thanks for all you replies so far. I am a little bit confused about how to use BBSplit.
If I have, say, 8 genomes, would the command be something like...
bbsplit.sh ref=1.fa, 2.fa, ... , 8.fa in=reads.fq, basename=o%.sam
? Do I have to specify build numbers if I'm going to be using fasta files directly as the reference? Will this require enough memory to load all the reference genomes at once? In which case, is there a "usemodulo" flag?I ask because right now, I'm aligning using BBMap to each genome separately. But then I get reads that mapped to multiple genomes, and I'm not sure how to deal with that.
BBSplit does need to load all references simultaneously. And yes, a sample command would be:
Note that there are no spaces in the list of references. All of the references will be indexed under a single build number (if you don't specify one, the default is 1). And you can use the 'usemodulo' flag with BBSplit, but that only reduces the total memory usage by about 50%.
To index and run in separate steps, which is a little more straightforward, the command would be (including the optional usemodulo and build flags):
So, "build=1" in the first line tells it where to store the index, and "build=1" in the second line tells it which index to use. "usemodulo" has to be specified both when building the index and when mapping, if you want to use it.
hello Brain, I downloaded BBMap in windows(i do not anything about Linux and it is so difficult for me to understand it) for align reads (fastq data format). please help me Run BBMap in windows,I see you an expert of this software
Your attention would be really appreciated.
Please take a look at this thread to see how to use BBMap. You will need Java. @Brian is the author of BBMap Suite.