Count SNP per read
1
0
Entering edit mode
17 months ago
Saran ▴ 50

Hello,

I have 6 samples of near-whole length sequenced hiv fasta files. The samples cover three different time points and there is a gene +/- sample per time-point. The samples have different numbers of reads of around ~8000 bp. I have performed a multi-seq alignment and variant calling yet I would like to get the number of SNPs per sequence/read as each one represents a separate HIV sequence. Is this possible?

Thanks, Sara

VCF SNP RNA-Seq MSA • 981 views
ADD COMMENT
0
Entering edit mode
17 months ago

I am not aware of a way to do this using VCF, but perhaps read-backed phasing would help here.

One route is to write a script in biopython etc to read through your bam files and check for SNPs using the NM tag (number of mismatches) and parsing the CIGAR string to get the SNP positions.

Another route might not be the most efficient answer, but it is simple (provided you don't have millions of reads, then you would get into trouble with your HPC filesystem manager).

You can write a bash script to do a single alignment per read (eg split your fastq into single reads beforehand, eg using the unix split command). I believe you can use freebayes to call SNPs on single reads. However, they might not be high confidence SNPs, so, especially if you have Nanopore data, you'll need to exclude dodgy regions like homopolymers, indels, and maybe repeat motifs. See if they match the SNPs you are already finding from your all-read alignments.

ADD COMMENT

Login before adding your answer.

Traffic: 2510 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6