Question

Align metagenomic shotgun sequencing reads to primers

0

Entering edit mode

8.9 years ago

Crystal ▴ 70

Hi All,

I'm trying to use BWA to align metagenomics reads (~100bp) to six primers (~15bp).

All the steps work fine but when I convert sam file to readable file, there was no results at all.

Is it appropriate to use BWA to align reads to super short primer sequences or is there other way I can do this?

Thanks

bwa metagenomics alignment • 3.7k views

ADD COMMENT • link updated 8.9 years ago by apt.university ▴ 70 • written 8.9 years ago by Crystal ▴ 70

0

Entering edit mode

Post the commands you are running. And what are you trying to accomplish? What makes sense to me is assemble reads into contigs, then map primers into contigs.

ADD REPLY • link 8.9 years ago by h.mon 35k

0

Entering edit mode

bwa aln index input1p.fastq > 1p.sai; bwa aln index input2p.fastq > 2p.sai; bwa sampe -n 1000 -N 1000 index 1p.sai 2p.sai input1p.fastq input2p.fastq >1.sam;

Above is the code I used, those are just basic bwa code. After that I used a java script to convert sam file into readable file. What I was trying to do was to check if some of our reads matched to those primers. I don't know if bwa can deal with long contigs. Our reads length has already been much longer than primer length.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by Crystal ▴ 70

score 0 · Answer 1 · 2016-01-06

You should map the primers into your reads, not the opposite. If you have only six primers, you could try just to use grep to find the primers on the reads files.

Anyway, I would assemble the reads and then map the primers into the contigs, not against the reads. I have used bowtie for this, but bwa should work as well - but I think bwa mem is better (certainly simpler) than bwa sampe.

Ram · Answer 2 · 2016-01-06

0

Entering edit mode

8.9 years ago

apt.university ▴ 70

BWA is a software package for mapping reads against a large reference genome. BWA is fast but not as sensitive as other programs, ex., blast or needle. To achieve great speed, BWA requires the reads to show low-divergence from the reference (see program description here) and uses some heuristics (tricks) that assume that there is a long stretch of DNA that matches exactly between the ref and the reads. This is called a seed. Depending on the version of BWA you are using, this can be between 19 to 32 bases by default. In your case, you are most likely not getting any hits before your references are shorter than the seed length.

If you are looking for exact matches, then you could use grep, as suggested by @h.mon. This would require putting the sequence and the id on the same line, so that you can retrieve the hits.

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by apt.university ▴ 70

0

Entering edit mode

If you are looking for exact matches, then you could use grep, as suggested by @h.mon. This would require putting the sequence and the id on the same line, so that you can retrieve the hits.

No, just use the 'B' (before) and 'A' (after) flags to return the lines that precede and follow the match.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

@harold.smith.tarheel: -B (or -A) would be ideal if the sequence if formatted as one line. If the sequences are formatted on 2 lines (60 or 80 bases per line, as is often the case with Fasta files), -B 2 (or -A 2) would return DNA sequence from the previous record.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by apt.university ▴ 70

0

Entering edit mode

The code from the OP shows that the input is FASTQ (not FASTA), which specifies a four-line format (identifier, sequence, optional, quality) for each sequence/read. So 'grep -B 1 -A 2' would return the entire entry for the matching sequence.

Note, this assumes that the 15bp primer string is not present in the other lines (unlikely but not impossible), so it might be preferable to add a secondary criterion, such as the initial character(s) of the identifier line (which is usually identical for every read).

Edit: Actually, Sanger PHRED scores do not include 'T', so secondary filtering is unnecessary. The primer string would match only the sequence line (unless the quality scores are old Illumina PHRED+64, or the primer contains only A/C/G).

ADD REPLY • link 8.9 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

Well, thank you so much for your help. So now I just need to use command like "grep -B 1 -A 2 'primer sequence' file.fastq"?

ADD REPLY • link 8.9 years ago by Crystal ▴ 70