Question

Variant Caller for Bacterial Amplicon Sequenced with Nanopore

2

Entering edit mode

22 days ago

gtiegs ▴ 20

I am trying to perform variant calling on amplicon reads generated by nanopore sequencing, but I am having problems finding a tool that fits my experiment.

Sample preparation: To prepare my samples I extracted DNA from wildtype and mutant bacteria (pure cultures), performed PCR, then mixed the mutant and wildtype amplicons to 1:50 (mut:wt) for sequencing.

Goal: My goal is to see if I can pick out the known mutant SNPs/indels from the reference wildtype sequence.

Context:

My amplicon is 1.2kb long.
My PI chose Nanopore over Illumina sequencing because this region of the genome is highly repetitive.
I am doing all my analysis on an HPC cluster.

Progress: So far I have used minimap2 to align the sequences and create SAM files. I then used samtools to convert SAM to BAM, which I viewed on IGV and I was able to pick out some of the target mutations visually (~1-5% of aligned reads). When it comes to variant calling, I am using bcftools (configured for ont data), but have only produced empty vcf files (header but no SNPs).

Here is my script for variant calling with bcftools:

# Define reference and parameters
REF="/path/to/reference"  # Reference sequence
CONFIG="ont"       # bcftools profile for ONT data
PRIOR="0.01"        # Prior for SNPs 

# Run bcftools mpileup for all .bam files in the current directory
for BAM in *.bam; do
 BASENAME=$(basename "$BAM" .sorted.bam)  # Extract the base name
 VCF="bcftools_output/${BASENAME}_variants.vcf"
 bcftools mpileup -Ou -f $REF -X $CONFIG -d 1000 $BAM | \   # Max read depth increased from 250 to 1000 to account for higher coverage
 bcftools call -mv --ploidy 1 -P $PRIOR -Ov -o $VCF   # Set as haploid (1) and prior to 0.01 to account for low freq SNPs
 echo "Processed $BAM"
done

I have tried medaka (medaka_variant) and the EPI2ME wf-amplicon pipeline as well with no luck. Unfortunately, I don't have access to signal-level data so I don't believe Nanopolish or Longshot would be applicable. This tool from minimap2 looks promising, but it ignores alignments <50kb and my gene is only 1.2kb long. I'm sure there are ways to customize, but I am a beginner at programming and wouldn't know where to start.

Questions:

Are there any elements in my script or general workflow that might be causing my lack of results?
Should I be filtering my reads/alignment quality before variant calling?
Are there more parameters I should be considering when tailoring tools?
Are there tools I haven't mentioned that would be worth looking into?

Any opinions/guidance/suggestions would be greatly appreciated!

Bacterial-Genome Variant-Calling Nanopore Amplicon-Sequencing • 626 views

ADD COMMENT • link updated 20 days ago by colindaven 7.0k • written 22 days ago by gtiegs ▴ 20

1

Entering edit mode

I don't have access to signal-level data

Do you know how the data you have was generated? Was it basecalled with "high or super accuracy" models/with dorado?

ADD REPLY • link 22 days ago by GenoMax 148k

0

Entering edit mode

Thanks for your response! The service we used is called Plasmidsaurus. Here's what I could find on their website:

"We sequence each sample with Oxford Nanopore long reads to very high depth before generating a consensus/assembly using the latest basecalling and polishing software:

We construct an amplification-free long-read sequencing library using the newest v14 library prep chemistry, including minimal fragmentation of the input linear DNA in a sequence-independent manner We sequence the library with a primer-free protocol using the most accurate R10.4.1 flow cells (raw data is delivered in .fastq format). We use the re-assembled raw reads to generate a high-accuracy linear consensus sequence from the raw reads. For standard linear/PCR samples, we will also return a set of feature annotations."

ADD REPLY • link 22 days ago by gtiegs ▴ 20

0

Entering edit mode

That does not completely answer my question. Do you see anything in the headers of your fastq files that look like dna_r10.4.1_e8.2_400bps_hac@v5.0.0? This gives information about the model used for basecalling.

ADD REPLY • link 22 days ago by GenoMax 148k

0

Entering edit mode

The first line of my fastq file looks like this: @1_9TNZQD_1. 9TNZQD_1 is the name of the sequence run, and no other information is listed. Will I need this information in order to do my analysis?

ADD REPLY • link 22 days ago by gtiegs ▴ 20

0

Entering edit mode

You may want to ask Plasmidsarus what model is used for the generation of your data.

then mixed the mutant and wildtype amplicons to 1:50 (mut:wt) for sequencing.

Why was mutant sample set at such a low relative conc if that is where you want to identify the mutations. Are these expected to be rare? Perhaps that is why you are unable to call them reliably.

ADD REPLY • link 21 days ago by GenoMax 148k

0

Entering edit mode

This experimental design is usually done for validation studies, either validating the sequencing or the bioinformatics. I'm about to do the exact same experiment for HIV variant calling pipeline.

The PI could be off his chops and I can't read their mind. I wouldn't be surprised if have little to no experience with bioinformatics or sequencing...

ADD REPLY • link 21 days ago by Mark ★ 1.6k

score 0 · Answer 1 · 2024-11-30

bcftools is not the greatest for variant calling on ONT data. The current state of the art is Clair3 and then Medaka. I believe nowadays the basecalling model is embedded in the fastq file read headers which you should able to inspect (zcat data.fastq.gz | less).

My advice would be:

Do not remove duplicate reads (you haven't but needs mentioning) because you are performing amplicon sequencing.
Before mapping, filter short reads. If your amplicon is ~1.2kb, and assuming library prep was using ligation kit, filter to keep reads between 900-1300bp. Have a look at read length distribution fastcat tool from ont https://github.com/epi2me-labs/fastcat. If rapid barcoding (this is probably the case) kit was used, then you need to drop the lower limit to minimum 200 (do not go below 200bp).
Use the minimap2 ont preset (either -ax map-ont or -ax lr:hq)
Use Clair3 for variant calling, it will detect the right model to use.
Apply sane threshold filtering to the variants. eg min number of reads (or depth), min quality threshold.
Pray your PI is not a dumbass, that he knows what he's doing.

Did you inspect the outputs of wf-amplicon, specifically the BAM file generated? THeir pipelines are generally good overall.

Finally, at 2% it might not be possible to detect minority variants using nanopore sequencing due to the higher per read error rate.

My questions to you are:

Echoing GenoMax: what is the reasoning behind this experimental design?
Do you have replicates? (either biological or technical)
Do you have replicates with high ratios of wt/mut? eg 1:10, 1:20? This will help you set the threshold at which you can detect minority variants.
What was the depth of the sequencing experiment?

I would design my experiment in this fashion:

Different dilutions, eg 1:10, 1:20, 1:40, 1:50 to detect at which point I can accurate detect replicates
Perform sequencing in duplicates (this is especially important for the low level minority variant detection eg 1:50
Sequence the neat plasmids. Yes yes I am sure your PI thinks he knows what the plasmids are but you need to sequencing them using the same platform. This will help with seeing sequencing platfrom errors as well.

Well done on posting a clear concise question and providing as much information as possible. You're going to make a great bioinformatician!

score 0 · Answer 2 · 2024-12-01

0

Entering edit mode

20 days ago

colindaven 7.0k

Longshot is applicable, it does not require signal level data - https://github.com/pjedge/longshot. It is also simple to run.

Deepvariant works on ONT data but I don't know if WGS is required.

ADD COMMENT • link 20 days ago by colindaven 7.0k