NOTE: This collection was originally posted at SeqAnswers.com. Creating a copy here to preserve the information.
Part II is available here: Yes .. BBMap can do that! - Part II randomreads (read simulation), demuxbyname/filterbyname, bbsplit (read binning/decontamination) and pileup (coverage stats)
Part III is available here: Yes .. BBMap can do that! - Part III clumpify (mark (and dedupe) duplicates without alignment), mutate (create mutant genomes) and other miscellaneous tools
Brian Bushnell is author of BBMap suite.
BBMap suite has become an essential part of the bioinformatics tool arsenal for many ...
There are common options for most BBMap suite programs and depending on the file extension the input/output format is automatically chosen/set.
Note: For most programs in BBTools, you can add the parameter config=foo.txt
. Every line in "foo.txt" will be added as if it were a command-line parameter, regardless of whitespace, or the length of the file.
This may be convenient if you use a certain set of parameters for different set of analyses.
I have tried to categorize the program options under relevant programs. This classification is not perfect.
BBMap.sh
- Mapping Nanopore reads
bbmap.sh
has a length cap of 6kbp. Reads longer than this will be broken into 6kbp pieces and mapped independently.
$ mapPacBio.sh -Xmx20g k=7 in=reads.fastq ref=reference.fa maxlen=1000 minlen=200 idtag ow int=f qin=33 out=mapped1.sam minratio=0.15 ignorequality slow ordered maxindel1=40 maxindel2=400
The maxlen
flag shreds them to a max length of 1000; you can set that up to 6000. But I found 1000 gave a higher mapping rate.
- Using Paired-end and single-end reads at the same time
BBMap itself can only run single-ended or paired-ended in a single run, but it has a wrapper that can accomplish it, like this:
$ bbwrap.sh in1=read1.fq,singletons.fq in2=read2.fq,null out=mapped.sam append
This will write all the reads to the same output file but only print the headers once. I have not tried that for bam output, only sam output
Note about alignment stats: For paired reads, you can find the total percent mapped by adding the read 1 percent (where it says "mapped: N%") and read 2 percent, then dividing by 2. The different columns tell you the count/percent of each event. Considering the cigar strings from alignment, "Match Rate" is the number of symbols indicating a reference match (=) and error rate is the number indicating substitution, insertion, or deletion (X, I, D).
- Exact matches when mapping small reads (e.g. miRNA)
When mapping small RNA's with BBMap use the following flags to report only perfect matches.
ambig=all vslow perfectmode maxsites=1000
It should be very fast in that mode (despite the vslow flag). Vslow mainly removes masking of low-complexity repetitive kmers, which is not usually a problem but can be with extremely short sequences like microRNAs.
- Important note about BBMap alignments
BBMap is always non-deterministic when run in paired-end mode with multiple threads, because the insert-size average is calculated on a per-thread basis, which affects mapping; and which reads are assigned to which thread is nondeterministic. The only way to avoid that would be to restrict it to a single thread (threads=1), or map the reads as single-ended and then fix pairing afterward:
bbmap.sh in=reads.fq outu=unmapped.fq int=f
repair.sh in=unmapped.fq out=paired.fq fint outs=singletons.fq
In this case you'd want to only keep the paired output.
bbsplit.sh
is based on BBMap, so it is also non-deterministic in paired mode with multiple threads. BBDuk and Seal (which can be used similarly to BBSplit) are always deterministic.
reformat.sh
- Count k-mers/find unknown primers
$ reformat.sh in=reads.fq out=trimmed.fq ftr=19
This will trim all but the first 20 bases (all bases after position 19, zero-based).
$ kmercountexact.sh in=trimmed.fq out=counts.txt fastadump=f mincount=10 k=20 rcomp=f
This will generate a file containing the counts of all 20-mers that occurred at least 10 times, in a 2-column format that is easy to sort in Excel.
ACCGTTACCGTTACCGTTAC 100
AAATTTTTTTCCCCCCCCCC 85
...etc. If the primers are 20bp long, they should be pretty obvious.
- Convert SAM format from 1.4 to 1.3 (required for many programs)
$ reformat.sh in=reads.sam out=out.sam sam=1.3
- Removing N basecalls
You can use BBDuk or Reformat with "qtrim=rl trimq=1". That will only trim trailing and leading bases with Q-score below 1, which means Q0, which means N (in either fasta or fastq format). The BBMap package automatically changes q-scores of Ns that are above 0 to 0 and called bases with q-scores below 2 to 2, since occasionally some Illumina software versions produces odd things like a handful of Q0 called bases or Ns with Q>0, neither of which make any sense in the Phred scale.
- Sampling reads
$ reformat.sh in=reads.fq out=sampled.fq sample=3000
To sample 10% of the reads:
reformat.sh in1=reads1.fq in2=reads2.fq out1=sampled1.fq out2=sampled2.fq samplerate=0.1
or more concisely:
reformat.sh in=reads#.fq out=sampled#.fq samplerate=0.1
and for exact sampling:
reformat.sh in=reads#.fq out=sampled#.fq samplereadstarget=100k
- Changing fasta headers
Remove anything after the first space in fasta header.
reformat.sh in=sequences.fasta out=renamed.fasta trd
"trd" stands for "trim read description" and will truncate everything after the first whitespace.
- Extract reads from a sam file
$ reformat.sh in=reads.sam out=reads.fastq
- Verify pairing and optionally de-interleave the reads
$ reformat.sh in=reads.fastq verifypairing
- Verify pairing if the reads are in separate files
$ reformat.sh in1=r1.fq in2=r2.fq vpair
If that completes successfully and says the reads were correctly paired, then you can simply de-interleave reads into two files like this:
$ reformat.sh in=reads.fastq out1=r1.fastq out2=r2.fastq
- Base quality histograms
$ reformat.sh in=reads.fq qchist=qchist.txt
That stands for "quality count histogram".
- Filter SAM/BAM file by read length
$ reformat.sh in=x.sam out=y.sam minlength=50 maxlength=200
- Filter SAM/BAM file to detect/filter spliced reads
$ reformat.sh in=mapped.bam out=filtered.bam maxdellen=50
You can set maxdellen
to whatever length deletion event you consider the minimum to signify splicing, which depends on the organism.
repair.sh
- "Re-pair" out-of-order reads from paired-end data files
$ repair.sh in1=r1.fq.gz in2=r2.fq.gz out1=fixed1.fq.gz out2=fixed2.fq.gz outsingle=singletons.fq.gz
bbmerge.sh
BBMerge now has a new flag - outa
or outadapter
. This allows you to automatically detect the adapter sequence of reads with short insert sizes, in case you don't know what adapters were used. It works like this:
$ bbmerge.sh in=reads.fq outa=adapters.fa reads=1m
Of course, it will only work for paired reads! The output fasta file will look like this:
>Read1_adapter
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
>Read2_adapter
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG
If you have multiplexed things with different barcodes in the adapters, the part with the barcode will show up as Ns, like this:
GATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG
Note: For BBMerge with micro-RNA, you need to add the flag mininsert=17. The default is 35, which is too long for micro-RNA libraries.
- Identifying adapters
If you have paired reads, and enough of the reads have inserts shorter than read length, you can identify adapter sequences with BBMerge, like this (they will be printed to adapters.fa):
$ bbmerge.sh in1=r1.fq in2=r2.fq outa=adapters.fa
BBDuk.sh
Note: BBDuk is strictly deterministic on a per-read basis, however it does by default reorder the reads when run multi-threaded. You can add the flag ordered
to keep output reads in the same order as input reads
- Order of operations for bbduk.sh
Ref thread: http://seqanswers.com/forums/showpost.php?p=153217&postcount=43
filter by minAvgQuality
force-trim
kmer-analyze (trim, filter, or mask)
trim by overlap
quality-trim
- Finding reads with a specific sequence at the beginning of read
$ bbduk.sh -Xmx1g in=reads.fq outm=matched.fq outu=unmatched.fq restrictleft=25 k=25 literal=AAAAACCCCCTTTTTGGGGGAAAAA
In this case, all reads starting with "AAAAACCCCCTTTTTGGGGGAAAAA" will end up in "matched.fq" and all other reads will end up in "unmatched.fq". Specifically, the command means "look for 25-mers in the leftmost 25 bp of the read", which will require an exact prefix match, though you can relax that if you want.
So you could bin all the reads with your known sequence, then look at the remaining reads to see what they have in common. You can do the same thing with the tail of the read using "restrictright" instead, though you can't use both restrictions at the same time.
$ bbduk.sh in=reads.fq outm=matched.fq literal=NNNNNNCCCCGGGGGTTTTTAAAAA k=25 copyundefined
With the "copyundefined" flag, a copy of each reference sequence will be made representing every valid combination of defined letter. So instead of increasing memory or time use by 6^75, it only increases them by 4^6 or 4096 which is completely reasonable, but it only allows substitutions at predefined locations. You can use the "copyundefined", "hdist", and "qhdist" flags together for a lot of flexibility - for example, hdist=2 qhdist=1 and 3 Ns in the reference would allow a hamming distance of 6 with much lower resource requirements than hdist=6. Just be sure to give BBDuk as much memory as possible.
- Removing illumina adapters (if exact adapters not known)
If you're not sure which adapters are used, you can add ref=truseq.fa.gz,truseq_rna.fa.gz,nextera.fa.gz
and get them all (this will increase the amount of overtrimming, though it should still be negligible).
- Removing illumina control sequences/phiX reads
bbduk.sh in=trimmed.fq.gz out=filtered.fq.gz k=31 ref=artifacts,phix ordered cardinality
- Identify certain reads that contain a specific sequence
$ bbduk.sh in=reads.fq out=unmatched.fq outm=matched.fq literal=ACGTACGTACGTACGTAC k=18 mm=f hdist=2
Make sure "k" is set to the exact length of the sequence. hdist
controls the number of substitutions allowed. outm
gets the reads that match. By default this also looks for the reverse-complement; you can disable that with rcomp=f
.
- Extract sequences that share kmers with your sequences with BBDuk
$ bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31
This will print to C all the sequences in A that share 100% of their 31-mers with sequences in B.
- Extract sequences that contain N's with bbduk
bbduk.sh in=reads.fq out=readsWithoutNs.fq outm=readsWithNs.fq maxns=0
If you have, say, 100bp reads and only want to separate reads containing all 100 Ns, change that to maxns=99
.
General notes for bbduk.sh
BBDuk can operate in one of 4 kmer-matching modes: Right-trimming (ktrim=r), left-trimming (ktrim=l), masking (ktrim=n), and filtering (default). But it can only do one at a time because all kmers are stored in a single table. It can still do non-kmer-based operations such as quality trimming at the same time.
BBDuk2 can do all 4 kmer operations at once and is designed for integration into automated pipelines where you do contaminant removal and adapter-trimming in a single pass to minimize filesystem I/O. Personally, I never use BBDuk2 from the command line. Both have identical capabilities and functionality otherwise, but the syntax is different.