Question

BLASR/Bowtie2 short alignments

0

Entering edit mode

9.4 years ago

Coryza ▴ 430

I'm struggling with the fact that I want to align thousands of ~4KB PacBio sequences to short primer sequences (~20B). (Just seeing which sequences contain the primer sequence). I've tried bowtie2 and blasr so far but both come up with empty results. Trying to change parameters influencing this resulted in nothing so far so that's why I need your help ;) Anyone knows what parameters to use for this case?

Parameters I used now:

bowtie2 -f -S --local -N 1 -L 10 --no-sq --no-unal -p 12 -x <Index> -U <PacBio>
blasr <PacBio> <Reference> -noSplitSubreads -sam -minPctIdentity 90 -minMatch 10 -nproc 6

Bowtie2 BLasr • 3.1k views

ADD COMMENT • link updated 24 months ago by Ram 44k • written 9.4 years ago by Coryza ▴ 430

0

Entering edit mode

How about aligning the primers against the PacBio sequences. That might produce better results. I recall that there are often problems when you try to soft-clip things this much.

ADD REPLY • link 9.4 years ago by Devon Ryan 104k

0

Entering edit mode

I've tried the other way around for blasr, no results. Furthermore: If it would have worked - how then retrieve all hits? Because most likely in that case it will match > 10K reads.

ADD REPLY • link 9.4 years ago by Coryza ▴ 430

Ram · Answer 1 · 2015-07-08

I suggest you try BBDuk for this:

bbduk.sh in=reads.fq outm=matched.fq outu=unmatched.fq ref=primers.fa edist=3 k=20 -da

edist=3 will allow up to 3 edits between the read and the reference (I'm assuming these are raw uncorrected reads). K should be set to whatever the shortest sequence is in your set of primers.

If you want to use mapping instead, then map the primers to the reads as Devon suggested; then you can use pileup.sh (also in the BBMap package) to generate coverage stats from a sam file:

pileup.sh in=file.sam covstats=covstats.txt nzo

The nzo flag will make it only print sequences that have nonzero coverage. So, you'll get one line per read with coverage. The first column will be the read name; so, you can filter the reads using these names with for example filterbyname.sh:

filterbyname.sh in=reads.fq names=names.txt out=filtered.fq include=t

The names file should be just the first column of the covstats output.

score 0 · Answer 2 · 2015-07-08

0

Entering edit mode

9.4 years ago

raghus606 ▴ 10

You could give it a try with BLAT, either way (Primer vs Reads or Reads vs Primer) should be fine, you will only need to filter the data differently.

ADD COMMENT • link 9.4 years ago by raghus606 ▴ 10

0

Entering edit mode

Blat is probably not going to work. From UCSC:

Blat of DNA is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more.

ADD REPLY • link 9.4 years ago by GenoMax 147k

0

Entering edit mode

I have used BLAT for aligning long Pacbio reads to reference genomes, I have also obtained good results when I was aligning 150 b.p illumina reads to particular portions of the genome

ADD REPLY • link 9.4 years ago by raghus606 ▴ 10