BLASR/Bowtie2 short alignments
2
0
Entering edit mode
9.4 years ago
Coryza ▴ 430

I'm struggling with the fact that I want to align thousands of ~4KB PacBio sequences to short primer sequences (~20B). (Just seeing which sequences contain the primer sequence). I've tried bowtie2 and blasr so far but both come up with empty results. Trying to change parameters influencing this resulted in nothing so far so that's why I need your help ;) Anyone knows what parameters to use for this case?

Parameters I used now:

bowtie2 -f -S --local -N 1 -L 10 --no-sq --no-unal -p 12 -x <Index> -U <PacBio>
blasr <PacBio> <Reference> -noSplitSubreads -sam -minPctIdentity 90 -minMatch 10 -nproc 6
Bowtie2 BLasr • 3.1k views
ADD COMMENT
0
Entering edit mode

How about aligning the primers against the PacBio sequences. That might produce better results. I recall that there are often problems when you try to soft-clip things this much.

ADD REPLY
0
Entering edit mode

I've tried the other way around for blasr, no results. Furthermore: If it would have worked - how then retrieve all hits? Because most likely in that case it will match > 10K reads.

ADD REPLY
0
Entering edit mode
9.4 years ago

I suggest you try BBDuk for this:

bbduk.sh in=reads.fq outm=matched.fq outu=unmatched.fq ref=primers.fa edist=3 k=20 -da

edist=3 will allow up to 3 edits between the read and the reference (I'm assuming these are raw uncorrected reads). K should be set to whatever the shortest sequence is in your set of primers.

If you want to use mapping instead, then map the primers to the reads as Devon suggested; then you can use pileup.sh (also in the BBMap package) to generate coverage stats from a sam file:

pileup.sh in=file.sam covstats=covstats.txt nzo

The nzo flag will make it only print sequences that have nonzero coverage. So, you'll get one line per read with coverage. The first column will be the read name; so, you can filter the reads using these names with for example filterbyname.sh:

filterbyname.sh in=reads.fq names=names.txt out=filtered.fq include=t

The names file should be just the first column of the covstats output.

ADD COMMENT
0
Entering edit mode
9.4 years ago
raghus606 ▴ 10

You could give it a try with BLAT, either way (Primer vs Reads or Reads vs Primer) should be fine, you will only need to filter the data differently.

ADD COMMENT
0
Entering edit mode

Blat is probably not going to work. From UCSC:

Blat of DNA is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more.
ADD REPLY
0
Entering edit mode

I have used BLAT for aligning long Pacbio reads to reference genomes, I have also obtained good results when I was aligning 150 b.p illumina reads to particular portions of the genome

ADD REPLY

Login before adding your answer.

Traffic: 2462 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6