Hi there,
My data consists of multiplexed MiSeq sequence data created from amplicons of ~500 bp loci (7 loci total) of interest from 60 populations (each pop consisting of pools of ~400 individual worms of a parasitic worm species). The sequences are barcoded by population and each population's read set (fastqs) have been mapped with bowtie2 to the 7 reference amplicon loci (ie have 60 bam files, one for each population).
Though these loci are only ~500 bp long, both intra- and inter-populations levels of genetic polymorphism are extremely high in this worm, with roughly 20-50 sites within each population showing variants at variable frequency.
I would like to segregate out the unique "haplotypes" that exist at each locus within each population. Note I do not mean to l associate haplotypes across loci (impossible as these are pooled data), only segregate out unique haplotypes that exist at each single ~500 bp locus. The picture below gives a good idea how the .bam alignments look and the level of polymorphism that exists within. Each SNV tends to associate with many other SNVs across the locus which is why I call them haplotypes. But really I just want to segregate out and count groups of unique amplicons, and ideally report their frequency in each of the alignments.
Unfortunately I'm having a hard time finding any tools or methods that can accomplish this task. Most tools that call haplotypes deal with diploid data (not populations), with relatively low numbers of variant sites across a much larger regions (linking SNPs across whole genome sequence data).
I can't seem to find a tool that can essentially segregate out all the unique amplicons that exist within large deep-sequenced amplicons datasets above a certain threshold. This includes any tool within GATK, VCF/BCF tools, or standalone tools including Illumina's own suggested tools as recommended on BaseSpace, which is strange because deep re-sequencing of amplicons is something that Illumina advertises on their website.
Any help or guidance as to how to proceed would be very much appreciated. Perhaps I'm just looking in the wrong places. However I'm a bit lost as to how to proceed!
Thanks so much.
Hi arezansoff, I have a very similar question. Im handling with two genes amplified from a pooled DNA (several individuals of a population), in a total of 80 populations. So I start with 80 .fastq files. My goal is obtaining the haplotypes of those genes in each population. Did you realize a good way to do that? Regards Ademir
You may want to post this as a new question. Original poster in this thread has not been seen on Biostars for a long time.