I am working on a research project to define novel variants in a compact gene of interest from patient samples using long-read single-molecule oxford nanopore amplicon sequencing. A ~6.1 kb fragment was cleanly isolated by PCR and sent to plasmidsaurus for amplicon sequencing. The summary files show SNPs that are clearly heterozygous indicating two alleles for each patient. I would like to now phase each variant to determine with certainty whether each SNP is found in trans on opposing alleles or present on the same allele in cis.
As a background I am a molecular biologist self-trained in command line tools and competent with illumina short-read RNA-Seq and WGS. However, I do not have experience with single-molecule long-read sequencing.
If a knowledgeable expert in the community could direct me to an optimal pipeline starting from raw .fastq reads to do QC, Trimming (if necessary?), alignment to reference gene PCR amplicon sequence and phasing that would be fantastic. With some direction to which tools to use I can probably figure out the command line, although any consideration to critical command line options is appreciated. I will post coding problems below if issues arise.
I’ve considered a partial solution to just use a grep text search to extract fastq reads that a) contain the amplicon and b) separate based on SNPs. But I know a more elegant solution must exist. My other option is to just clone each allele into a plasmid and sequence enough to get each allele separate.
Thanks in advance, --EK
This is what I came up with so far. Mapping PCR amplicons to a specific chromosome and then phasing alleles.