Question

What in silico PCR commandline tool do you use that can handle IUPAC characters?

1

Entering edit mode

24 months ago

O.rka ▴ 740

I'm looking for a command-line tool (preferably conda or pip installable) that can pull out amplicons from a query fasta based on a forward and reverse primer. I've found isPcr but it can't handle my IUPAC primers that have non-ATCG characters.

Does anyone know of a modern tool that can be used for this task?

pcr amplicon primer • 2.0k views

ADD COMMENT • link updated 8 months ago by Ram 44k • written 24 months ago by O.rka ▴ 740

score 2 · Answer 1 · 2022-12-20

2

Entering edit mode

24 months ago

Matthias Zepper 5.0k

Never used it with IUPAC bases, but according to the help text seqkit amplicon should allow for degenerate bases.

ADD COMMENT • link 24 months ago by Matthias Zepper 5.0k

1

Entering edit mode

Nice, seqkit always comes through! Weird, I'm trying to run the universal primers from this study: https://pubmed.ncbi.nlm.nih.gov/30824940/ with a basic yeast strain (https://www.ncbi.nlm.nih.gov/assembly/GCF_000146045.2) and its not working.

(base) seqkit amplicon -F GGCTACCACATCYAARGAAGGCAGCAG -R TCGGCAGGTGAGTYGTTRCACAYTCCT GCA_000146045.2_R64_genomic.fna
[INFO] 1 primer pair loaded

ADD REPLY • link 24 months ago by O.rka ▴ 740

2

Entering edit mode

Looks like there are mismatches in that .fna file compared to the primer sequences. (Chiming in here because I'm working on something similar and hadn't known seqkit could do this either.) What's odd is that it seems to count the IUPAC as mismatches. The help text for -m says max mismatch when matching primers, no degenerate bases allowed. Is that saying that if you enable mismatches, you can't simultaneously have IUPAC codes match?

$ seqkit amplicon -M -m 4 -F GGCTACCACATCYAARGAAGGCAGCAG -R TCGGCAGGTGAGTYGTTRCACAYTCCT GCA_000146045.2_R64_genomic.fna | grep "^>"
[INFO] 1 primer pair loaded
>BK006945.2 TPA_inf: Saccharomyces cerevisiae S288C chromosome XII, complete sequence mismatches=6(2+4)

Edit: to be more precise I think there's one mismatch, in the reverse primer. I think this shows that the behavior is what I thought above (you get to specify mismatches, or use IUPAC, but not both):

$ seqkit amplicon -M -F GGCTACCACATCYAARGAAGGCAGCAG -R TCGGCNGGTGAGTYGTTRCACAYTCCT GCA_000146045.2_R64_genomic.fna | grep "^>"
[INFO] 1 primer pair loaded
>BK006945.2 TPA_inf: Saccharomyces cerevisiae S288C chromosome XII, complete sequence mismatches=0(0+0)

Edit again: Actually there are two matches for each primer in chromosome 12 here. Note what seqkit says about that: 1. Only one (the longest) matching location is returned for every primer pair. In this case it gives you the 12,572 nt spanning both of those, instead of one or the other of the separate 3,435 nt regions.

ADD REPLY • link 24 months ago by Jesse ▴ 850

0

Entering edit mode

Ok so this pretty tricky. I'm wondering what the best tool/method/parameters could be useful for pulling out 18S (partial) -> ITS1 -> 5.8S -> ITS2 -> 28S rRNA (partial) from a genome? These were said to be universal primers for euks but it looks like that might not be the case.

How did you identify how many matches there were from seqkit amplicon?

ADD REPLY • link 24 months ago by O.rka ▴ 740

1

Entering edit mode

I honestly did that part in a stupid way: I took the match seqkit gave me for each primer sequence, and searched for that matching sequence in the .fna file in a text editor, noting the position. (Oh and I removed the line-wrapping in the file at the start as a matter of course, knowing 2-line FASTA is handy in general.)

In a less stupid way you could use BLAST maybe?

$ cat primers.fa
>fwd
GGCTACCACATCYAARGAAGGCAGCAG
>rev
TCGGCAGGTGAGTYGTTRCACAYTCCT
$ blastn -task blastn-short -query primers.fa -subject GCA_000146045.2_R64_genomic.fna  -outfmt 6 | awk '$3>80 && $4>20'
fwd     BK006945.2      92.59   27      2       0       1       27      457330  457304  1e-08   48.4
fwd     BK006945.2      92.59   27      2       0       1       27      466467  466441  1e-08   48.4
rev     BK006945.2      85.19   27      4       0       1       27      453896  453922  3e-05   36.8
rev     BK006945.2      85.19   27      4       0       1       27      463033  463059  3e-05   36.8

I don't know much about 16S/18S stuff, but I have a feeling there are probably tools out there that are better suited to what you're looking for. I think seqkit amplicon or other primer/adapter trimming tools are expecting to deal with a large number of short reads like from amplicon sequencing, rather than long contiguous sequences like chromosomes where you might have multiple matches. Maybe somebody else will have tips for that, though.

ADD REPLY • link 24 months ago by Jesse ▴ 850

2

Entering edit mode

Sorry, but I can't help you on that any further - what you are trying here is way more complicated than what I ever used seqkit amplicon for. Maybe you are lucky and @shenwei356 fancies an anwer.

For 16S / 18S amplicon analysis, there might indeed be specialized tools around. If I need software inspiration, I usually check what tools are used in respective pipelines e.g. nf-core ampliseq or Biobakery. or look around on Github.

ADD REPLY • link 24 months ago by Matthias Zepper 5.0k