Question

Best Aligner for Short-Read (50 bp) Alignment of Non-human Sequences from Human Samples

1

Entering edit mode

10.7 years ago

melonpwn ▴ 10

Hello all,

I am currently involved in a project where I have human blood and plasma samples, and I wish to examine the nonhuman DNA sequences in these samples. These samples were all sequenced as 50 bp reads.

Originally, I used a combination of Bowtie2 and BLAST -- Bowtie2 for computational subtraction of human reads and BLAST for alignment of the leftover reads.

BLAST is just too slow for the number of reads I have, so I then decided to use SNAP. SNAP is very fast, but it doesn't seem to be suited for 50 bp reads. Namely, it indexes the genome at seeds of size ~20 and they recommend that the seed size be less than a quarter of the read size. In my case, that would mean seeds of less than 12.5. SNAP only allows seeds as low as 16, so that's out of the question -- also lower seed length decreases performance of the aligner.

What alignment tools out there would be best suited for my purposes? I have investigated things like RINS, PathSeq, READSCAN as well as alignment tools like Bowtie, BWA, etc., but I'm not too sure they are suited for 50 bp reads. Would anyone happen to have any advice they could offer?

Thank you all.

sequencing alignment genome blast • 5.9k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by melonpwn ▴ 10

Ram · Answer 1 · 2014-11-14

1

Entering edit mode

10.7 years ago

GouthamAtla 12k

You can try using SURPI pipeline from Dr.Chiu lab.

Here is a post describing few issues in implementing the pipeline.

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by GouthamAtla 12k

Ram · Answer 2 · 2014-11-16

0

Entering edit mode

10.7 years ago

Brian Bushnell 20k

I suggest you try BBMap, which is fast, easy to use, and works well with 50bp reads at default settings. It has much higher sensitivity than SNAP, which is useful when aligning unknown reads to a potentially different strain, species, or even a different higher taxonomic order. You can also use it for the initial subtraction. For example, to split the reads:

bbmap.sh ref=hg19.fasta in=reads.fq outm=human.fq outu=nonhuman.fq

To map the nonhuman reads:

bbmap.sh ref=refseq_microbial.fasta in=nonhuman.fq out=mapped.sam

You could alternately do both of those steps at once with BBSplit, which is included in the BBMap package. You can get usage information by running the shellscripts with no arguments (or ask if you encounter any problems).

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Are there any filtering capabilities associated with BBMap? For example, BLAST and SMALT have a percent identity option, and BLAST has other options such as e-value.

Basically, is there a way I can control sensitivity of the nonhuman alignments, or perhaps only consider alignments above a certain threshold?

ADD REPLY • link 10.7 years ago by melonpwn ▴ 10

0

Entering edit mode

BBMap has various filtering options such as idfilter (don't allow mappings below X% identity), strictmaxindel (don't allow indels longer than this), perfectmode (only allow perfect alignments), semiperfectmode (only allow alignments that are perfect, but Ns in the reference are allowed), kfilter (only allow sites with at least k consecutive bases matching), minhits (require at least this many seed matches), ambig (controls behavior of reads that map ambiguously to multiple locations), po (only allow paired alignments), and various others. You can adjust the seed length from 8 to 15. It's also possible to map to multiple references simultaneously to filter out reads that map better to another reference, with BBSplit, which uses BBMap. For example, if you were interested in a human parasite, you could map with BBSplit using ref=parasite.fasta,human.fasta as the reference files, and get two output sets, one with reads that better match the parasite, and one with reads that better match human.

ADD REPLY • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi Brian,

When I run bbmap.sh with ref, in, and out arguments, I get the following error:

java -Djava.library.path=/gpfs/runtime/opt/bbmap/33.92/bin/jni/ -ea -Xmx24g -cp /gpfs/runtime/opt/bbmap/33.92/bin/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 -Xmx24g ref=all_viruses.fna in=SRR1024346_FUMBT.fastq out = mapped.sam
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
        at align2.AbstractMapper.preparse0(AbstractMapper.java:711)
        at align2.AbstractMapper.&lt;init&gt;(AbstractMapper.java:53)
        at align2.BBMap.&lt;init&gt;(BBMap.java:42)
        at align2.BBMap.main(BBMap.java:30)

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 10.7 years ago by melonpwn ▴ 10

0

Entering edit mode

That's probably due to the spaces between out and =. Command should be:

bbmap.sh -Xmx24g ref=all_viruses.fna in=SRR1024346_FUMBT.fastq out=mapped.sam

I'll try to make the error message more clear.

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 10.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi Brian,

What if I wanted to map to multiple references at the same time? For example, I wanted to use a bacteria reference genome set, but I kept getting errors of this nature:

Exception in thread "Thread-26" java.lang.OutOfMemoryError: Java heap space
        at align2.IndexMaker4$BlockMaker$CountThread.run(IndexMaker4.java:280)

So then I split the reference up into multiple fasta files. How can I align to all of them at the same time?

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 10.7 years ago by melonpwn ▴ 10

0

Entering edit mode

If you do not include a -Xmx parameter, BBMap will (try to) automatically determine how much memory is available and use that much. If you do include it, BBMap will use that much memory; if your reference is too big to fit, it will crash with a java.lang.OutOfMemoryError: Java heap space error. So, if you are using a computer with a large amount of memory, either increase the value of -Xmx to around 85% of the physical memory of that computer, or exclude it entirely to allow autodetection.

BBMap requires approximately 6 bytes per reference base for large genomes. You can get a very precise estimate of how much memory BBMap will need by running stats.sh ref=genome.fasta k=13 (since the exact amount of memory depends on the kmer length). -Xmx24g is specifically for processing of the human genome, which is about 3Gbp, as 3Gbp*6bytes/bp = 24GB. But very small genomes, like a single bacteria, somewhat more memory is needed per bp of reference.

You can reduce the amount of memory needed to approximately half (around 3 bytes per reference base) by using the flag usemodulo, with very slightly lower sensitivity.

If you do not have enough memory for BBMap, I suggest Bowtie2 or bwa-mem - both of those use less memory than BBMap - closer to 1 byte per bp. However, they have lower sensitivity, and cannot detect long indels (or intronic splices), so whether they are a good choice depends on your use-case.

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 10.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Great, thanks for all you replies so far. I am a little bit confused about how to use BBSplit.

If I have, say, 8 genomes, would the command be something like...

bbsplit.sh ref=1.fa, 2.fa, ... , 8.fa in=reads.fq, basename=o%.sam? Do I have to specify build numbers if I'm going to be using fasta files directly as the reference? Will this require enough memory to load all the reference genomes at once? In which case, is there a "usemodulo" flag?

I ask because right now, I'm aligning using BBMap to each genome separately. But then I get reads that mapped to multiple genomes, and I'm not sure how to deal with that.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by melonpwn ▴ 10

0

Entering edit mode

BBSplit does need to load all references simultaneously. And yes, a sample command would be:

bbsplit.sh ref=1.fa,2.fa in=reads.fq basename=o%.sam

Note that there are no spaces in the list of references. All of the references will be indexed under a single build number (if you don't specify one, the default is 1). And you can use the 'usemodulo' flag with BBSplit, but that only reduces the total memory usage by about 50%.

To index and run in separate steps, which is a little more straightforward, the command would be (including the optional usemodulo and build flags):

bbsplit.sh ref=1.fa,2.fa usemodulo build=1
bbsplit.sh in=reads.fq basename=o%.sam usemodulo build=1

So, "build=1" in the first line tells it where to store the index, and "build=1" in the second line tells it which index to use. "usemodulo" has to be specified both when building the index and when mapping, if you want to use it.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by Brian Bushnell 20k

0

Entering edit mode

hello Brain, I downloaded BBMap in windows(i do not anything about Linux and it is so difficult for me to understand it) for align reads (fastq data format). please help me Run BBMap in windows,I see you an expert of this software

Your attention would be really appreciated.

ADD REPLY • link 9.0 years ago by Edalat ▴ 30

2

Entering edit mode

Please take a look at this thread to see how to use BBMap. You will need Java. @Brian is the author of BBMap Suite.

ADD REPLY • link 9.0 years ago by GenoMax 153k