I am trying to perform variant calling on amplicon reads generated by nanopore sequencing, but I am having problems finding a tool that fits my experiment.
Sample preparation: To prepare my samples I extracted DNA from wildtype and mutant bacteria (pure cultures), performed PCR, then mixed the mutant and wildtype amplicons to 1:50 (mut:wt) for sequencing.
Goal: My goal is to see if I can pick out the known mutant SNPs/indels from the reference wildtype sequence.
Context:
- My amplicon is 1.2kb long.
- My PI chose Nanopore over Illumina sequencing because this region of the genome is highly repetitive.
- I am doing all my analysis on an HPC cluster.
Progress: So far I have used minimap2 to align the sequences and create SAM files. I then used samtools to convert SAM to BAM, which I viewed on IGV and I was able to pick out some of the target mutations visually (~1-5% of aligned reads). When it comes to variant calling, I am using bcftools (configured for ont data), but have only produced empty vcf files (header but no SNPs).
Here is my script for variant calling with bcftools:
# Define reference and parameters
REF="/path/to/reference" # Reference sequence
CONFIG="ont" # bcftools profile for ONT data
PRIOR="0.01" # Prior for SNPs
# Run bcftools mpileup for all .bam files in the current directory
for BAM in *.bam; do
BASENAME=$(basename "$BAM" .sorted.bam) # Extract the base name
VCF="bcftools_output/${BASENAME}_variants.vcf"
bcftools mpileup -Ou -f $REF -X $CONFIG -d 1000 $BAM | \ # Max read depth increased from 250 to 1000 to account for higher coverage
bcftools call -mv --ploidy 1 -P $PRIOR -Ov -o $VCF # Set as haploid (1) and prior to 0.01 to account for low freq SNPs
echo "Processed $BAM"
done
I have tried medaka (medaka_variant) and the EPI2ME wf-amplicon pipeline as well with no luck. Unfortunately, I don't have access to signal-level data so I don't believe Nanopolish or Longshot would be applicable. This tool from minimap2 looks promising, but it ignores alignments <50kb and my gene is only 1.2kb long. I'm sure there are ways to customize, but I am a beginner at programming and wouldn't know where to start.
Questions:
- Are there any elements in my script or general workflow that might be causing my lack of results?
- Should I be filtering my reads/alignment quality before variant calling?
- Are there more parameters I should be considering when tailoring tools?
Are there tools I haven't mentioned that would be worth looking into?
Any opinions/guidance/suggestions would be greatly appreciated!
Do you know how the data you have was generated? Was it basecalled with "high or super accuracy" models/with
dorado
?Thanks for your response! The service we used is called Plasmidsaurus. Here's what I could find on their website:
"We sequence each sample with Oxford Nanopore long reads to very high depth before generating a consensus/assembly using the latest basecalling and polishing software:
We construct an amplification-free long-read sequencing library using the newest v14 library prep chemistry, including minimal fragmentation of the input linear DNA in a sequence-independent manner We sequence the library with a primer-free protocol using the most accurate R10.4.1 flow cells (raw data is delivered in .fastq format). We use the re-assembled raw reads to generate a high-accuracy linear consensus sequence from the raw reads. For standard linear/PCR samples, we will also return a set of feature annotations."
That does not completely answer my question. Do you see anything in the headers of your fastq files that look like
dna_r10.4.1_e8.2_400bps_hac@v5.0.0
? This gives information about the model used for basecalling.The first line of my fastq file looks like this: @1_9TNZQD_1. 9TNZQD_1 is the name of the sequence run, and no other information is listed. Will I need this information in order to do my analysis?
You may want to ask Plasmidsarus what model is used for the generation of your data.
Why was mutant sample set at such a low relative conc if that is where you want to identify the mutations. Are these expected to be rare? Perhaps that is why you are unable to call them reliably.
This experimental design is usually done for validation studies, either validating the sequencing or the bioinformatics. I'm about to do the exact same experiment for HIV variant calling pipeline.
The PI could be off his chops and I can't read their mind. I wouldn't be surprised if have little to no experience with bioinformatics or sequencing...