Question

Count SNP per read

0

Entering edit mode

17 months ago

Saran ▴ 50

Hello,

I have 6 samples of near-whole length sequenced hiv fasta files. The samples cover three different time points and there is a gene +/- sample per time-point. The samples have different numbers of reads of around ~8000 bp. I have performed a multi-seq alignment and variant calling yet I would like to get the number of SNPs per sequence/read as each one represents a separate HIV sequence. Is this possible?

Thanks, Sara

VCF SNP RNA-Seq MSA • 981 views

ADD COMMENT • link updated 17 months ago by colindaven 7.0k • written 17 months ago by Saran ▴ 50

score 0 · Answer 1 · 2023-06-19

I am not aware of a way to do this using VCF, but perhaps read-backed phasing would help here.

One route is to write a script in biopython etc to read through your bam files and check for SNPs using the NM tag (number of mismatches) and parsing the CIGAR string to get the SNP positions.

Another route might not be the most efficient answer, but it is simple (provided you don't have millions of reads, then you would get into trouble with your HPC filesystem manager).

You can write a bash script to do a single alignment per read (eg split your fastq into single reads beforehand, eg using the unix split command). I believe you can use freebayes to call SNPs on single reads. However, they might not be high confidence SNPs, so, especially if you have Nanopore data, you'll need to exclude dodgy regions like homopolymers, indels, and maybe repeat motifs. See if they match the SNPs you are already finding from your all-read alignments.