Hello everyone, First, I am so sorry for this long and very amateur question. I am trying to build a pipeline for SNP calling for Oxford Nanopore MinION based long reads. I need to test the pipeline but apparently the number of test data is really low. I only have Na12878 data from this address: https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md
I downloaded the FAST5 data coded as "FAB43577" (it is said that data has 427,215 reads and 2,776,702,333 bases). I used Guppy V5.0.1 as basecaller with the command:
guppy_basecaller -i /home/huk/Desktop/nanopore_data/na12878_fast5/data2/UCSC/FAB43577-3574887596_Multi -s /home/huk/Desktop/nanopore_data/na12878_fast5/data2/guppy_out -c dna_r9.4.1_450bps_fast.cfg --trim_barcodes --trim_strategy dna --num_callers 1 --cpu_threads_per_caller 12
Then I merged all FASTQ files inside the "pass" folder of Guppy results with "cat" command and obtained single FASTQ.
minimap2 -ax map-ont -t 12 /home/huk/Desktop/references/hg38/hg38.mmi /home/huk/Desktop/nanopore_data/na12878_fast5/data2/guppy_out/pass/all_data.fastq --MD > /home/huk/Desktop/nanopore_data/na12878_fast5/data2/minimap_output/mapped_12878_2.md.sam
I transformed the SAM file to BAM file with samtools. I indexed and sorted the file as well. Then I used longshot for variant calling only on chr20 via the command:
longshot --bam /home/huk/Desktop/nanopore_data/na12878_fast5/data2/minimap_output/mapped_12878_2_sorted_md.bam --ref /home/huk/Desktop/references/hg38/hg38.fa -F -r chr20 --out /home/huk/Desktop/nanopore_data/na12878_fast5/data2/vcf_output/longshot_result.vcf
My final VCF have 827 (without filtering) variants. I downloaded the high confidence VCF file of NA12878 from this link https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/GRCh38/supplementaryFiles/
In this VCF, chr20 have about 67957 SNPs. I compared the variants and only 8 of them are common in both VCFs.
I also used nanopolish index and nanopolish variants for variant calling but the final VCF is completely empty (only headers and comments of standard VCF).
I am not sure why I have very low number of variants. If anyone can give me a hint or tell me what I am doing wrong I would be really grateful. I am completely stuck here. If you know another test data (if there is) for variant calling of Oxford Nanopore MinION, it would be awesome too.
Thank you in advance.
Maybe is off-topic but if you are testing a pipeline for SNP calling with ONT-reads you should use the high-accuracy mode (
dna_r9.4.1_450bps_hac.cfg
) instead of the in fast mode (dna_r9.4.1_450bps_fast.cfg
)Actually high accuracy mode is taking too long. But I also tested the pipeline on basecalled FASTQ reads that I downloaded from same web page. Results are the same.