Question

Should I merge, then trim PE reads before multiple sequence alignment?

0

Entering edit mode

5.0 years ago

lintonf • 0

Hi everyone!

I have a set of demultiplexed fungal ITS1 reads, and I'd like to run a multiple sequence alignment with these R1 and R2 reads using MUSCLE.

I've used MUSCLE with metagenomic shotgun data, not amplicon data, so I am not sure about how to preprocess my sequences before the MSA.

Should I merge my R1 + R2 reads, trim them, then run them through a MSA? Or can I run R1s and R2s in through their respective MSAs (as raw reads) and see what my results are?

I would love your feedback, thank you!

amplicon alignment MSA ITS1 • 2.0k views

ADD COMMENT • link updated 21 months ago by Ram 44k • written 5.0 years ago by lintonf • 0

score 1 · Answer 1 · 2020-01-26

1

Entering edit mode

5.0 years ago

GenoMax 148k

You can adapter trim the reads before merging them (sounds like your amplicon design will support merging) before doing an MSA. bbduk.sh followed by bbmerge.sh from BBMap suite would be good options.

ADD COMMENT • link 5.0 years ago by GenoMax 148k

0

Entering edit mode

Other "classic" approached include FLASH, PEAR (academic only license), or PANDAseq (alledgedlyn containing FLASH and PEAR) . There's also an older Biostars post here, with some good summary.

If at all, I would trim only with very conservative settings before merging - ideally a merging algorithm handles the quality at a position.

ADD REPLY • link 5.0 years ago by Carambakaracho ★ 3.3k

score 1 · Answer 2 · 2020-01-28

Answering your explicit question, you can follow genomax answer, plus adding a duplicate removal step: there is no sense in having thousands of identical sequences for performing multiple sequence alignment. dedupe.sh from the BBTools / BBMap suite is a good option, or also VSEARCH.

However, as you are sequencing ITS, I believe your goal is taxonomic classification and quantification. If that is the case, I would advise you to follow one of the many established pipelines, such as DADA2 or QIIME2.