Question

Metabarcoding amplicon size too long for paired-end sequencing

1

Entering edit mode

6.7 years ago

lvogel ▴ 30

I am supposed to analyze some metabarcoding reads. However, the forward and reverse reads are unable to be merged due to lack of overlap. I was informed that this is because the sequence was too long, so the forward and reverse sequences couldn't extend far enough to overlap adequately. My question is, what would be the problem of using the forward reads alone, as if they were single-end? My first thought is that they are too short to be a legitimate barcode sequence for identifying taxa. But I'm not sure. It's COI. I gather from Meusnier et al. 2008 that a 95% success rate of species identification was obtained with 250-bp mini barcodes. My forward sequences are that long. But this region by itself has not been tested for specificity. How does this impact its reliability? Thank you for your input.

next-gen metabarcoding • 1.8k views

ADD COMMENT • link updated 6.7 years ago by luckylion07 ▴ 40 • written 6.7 years ago by lvogel ▴ 30

score 3 · Accepted Answer · 2018-04-29

3

Entering edit mode

6.7 years ago

luckylion07 ▴ 40

Using just the forward read is a good idea. Just watch out that depending on your library preparation method read 1 might not correspond to the forward direction but forward and reverse is mixes ~50:50. Here trimming the forward primer on read one and two, using e.g. Cutadapt and then reverse complementing the read 2 can help. If you are concerned about the reliability of the identification with just one direction, you could also analyze read 2 the same way and compare results, or fill in the missing basses in between the reads with Ns to obtain a "full-length sequence". If you do so, however, make sure to apply strict filtering afterward to discard reads of poor quality especially read 2 ends. You can also concatenate sequences, and "reformat" the sequences in your reference database to match these. But this is maybe a bit much effort, using only forward direction should be sufficient in most cases my opinion =)

ADD COMMENT • link 6.7 years ago by luckylion07 ▴ 40

0

Entering edit mode

Thank you. Those are some really good ideas. I would guess then, that the fact that my sequences are shorter than the full mini-barcode, would be reflected in a higher BLAST e-value? But as long as that's also below the cutoff, then it's still fine.

ADD REPLY • link 6.7 years ago by lvogel ▴ 30

1

Entering edit mode

No worries. yes = ) If using only read one, You would also expect that the full sequence is matching the reference database since its amplicon data (with the exception of chimeras). Since all sequences with illumina are sequenced to the same length, you can also use standard metabarcoding pipelines, for processing read one or two. The same goes for concatenating sequences, when Ns are inserted, however, check how the algorithm is dealing with ambiguous bases (most should have trouble with it I would assume).

ADD REPLY • link 6.7 years ago by luckylion07 ▴ 40