Entering edit mode
2.0 years ago
O.rka
▴
740
I'm looking for a command-line tool (preferably conda or pip installable) that can pull out amplicons from a query fasta based on a forward and reverse primer. I've found isPcr but it can't handle my IUPAC primers that have non-ATCG characters.
Does anyone know of a modern tool that can be used for this task?
Nice, seqkit always comes through! Weird, I'm trying to run the universal primers from this study: https://pubmed.ncbi.nlm.nih.gov/30824940/ with a basic yeast strain (https://www.ncbi.nlm.nih.gov/assembly/GCF_000146045.2) and its not working.
Looks like there are mismatches in that .fna file compared to the primer sequences. (Chiming in here because I'm working on something similar and hadn't known seqkit could do this either.) What's odd is that it seems to count the IUPAC as mismatches. The help text for
-m
saysmax mismatch when matching primers, no degenerate bases allowed
. Is that saying that if you enable mismatches, you can't simultaneously have IUPAC codes match?Edit: to be more precise I think there's one mismatch, in the reverse primer. I think this shows that the behavior is what I thought above (you get to specify mismatches, or use IUPAC, but not both):
Edit again: Actually there are two matches for each primer in chromosome 12 here. Note what seqkit says about that:
1. Only one (the longest) matching location is returned for every primer pair.
In this case it gives you the 12,572 nt spanning both of those, instead of one or the other of the separate 3,435 nt regions.Ok so this pretty tricky. I'm wondering what the best tool/method/parameters could be useful for pulling out 18S (partial) -> ITS1 -> 5.8S -> ITS2 -> 28S rRNA (partial) from a genome? These were said to be universal primers for euks but it looks like that might not be the case.
How did you identify how many matches there were from seqkit amplicon?
I honestly did that part in a stupid way: I took the match seqkit gave me for each primer sequence, and searched for that matching sequence in the .fna file in a text editor, noting the position. (Oh and I removed the line-wrapping in the file at the start as a matter of course, knowing 2-line FASTA is handy in general.)
In a less stupid way you could use BLAST maybe?
I don't know much about 16S/18S stuff, but I have a feeling there are probably tools out there that are better suited to what you're looking for. I think
seqkit amplicon
or other primer/adapter trimming tools are expecting to deal with a large number of short reads like from amplicon sequencing, rather than long contiguous sequences like chromosomes where you might have multiple matches. Maybe somebody else will have tips for that, though.Sorry, but I can't help you on that any further - what you are trying here is way more complicated than what I ever used
seqkit amplicon
for. Maybe you are lucky and @shenwei356 fancies an anwer.For 16S / 18S amplicon analysis, there might indeed be specialized tools around. If I need software inspiration, I usually check what tools are used in respective pipelines e.g. nf-core ampliseq or Biobakery. or look around on Github.