I'm trying to automate the building of a reference database for DADA2. As such I'm using esearch to download ~200k fasta sequences for my search term of one gene from GenBank. Many GenBank sequences are joined-up sequences of several genes, and I'd like to trim those unnecessary genes away in order to reduce false positive DADA2 hits later on. I have many known versions of my candidate genes so I can 'just' align my GenBank downloads with those genes, but...
I've thought about using one of my known genes and using Biopython's pairwise alignment functions, but for 200k Genbank sequences that will take forever, and I'll mess it up somehow anyway. Is there any parallelisable tool which 'cleans up' fastas by trimming overhangs using one or several known sequences? My Google-fu is failing me!
(A few minutes later: Current best pipeline I can find is MAFFT/MUSCLE with gaps turned off, followed by trimAl, in a wrapper script to parallelise? curious to hear more!)