Question

Detecting known, low frequency SNPs in bacterial populations

0

Entering edit mode

5.8 years ago

jleehan ▴ 120

Hi all,

I'm using NGS data to observe how a community of congenic bacteria evolve over time. Basically, I'm starting by inputting several different alleles of one gene into the same medium and seeing how that population changes over time by sequencing the DNA of that gene.

As far as I can tell, all the SNP calling pipelines appear to be more for genome wide discovery than something like what I am doing.

Currently, I am using VarScan and LoFreq to determine the allele frequencies in this community. The agreement between the two pipelines is decent, but as of now VarScan doesn't seem to be able to detect samples that are present in less than 20% of the population (despite what was reported in this paper and all the other programs are reported to be even worse) and I would like to be able to use at least two different programs to verify the presence of these alleles.

Does anyone know of a better low frequency SNP calling pipeline or perhaps a program that allows you to call SNPs based on known alleles (this would actually be the ideal, but I haven't seen anything like this anywhere that I've read)?

Thanks!

P.S. I'm relatively new to NGS work so please bear with me if I've made some egregious oversight here.

snp next-gen • 1.7k views

ADD COMMENT • link updated 5.8 years ago by Chris Miller 22k • written 5.8 years ago by jleehan ▴ 120

0

Entering edit mode

What parameters did you use for VarScan?

ADD REPLY • link 5.8 years ago by Asaf 10k

0

Entering edit mode

I used the most basic parameters as I have never used VarScan before:

VarScan mpileup2snp $i -o ${SAMPLE}.snp

That's the format I'm using

ADD REPLY • link 5.8 years ago by jleehan ▴ 120

0

Entering edit mode

Weird, it should report variants with > 0.01 frequency. On the other hand, it requires at least 2 variant reads out of at least 8 reads that cover the position so if the coverage is low you might lose it. Another option to check is the --strand-filter which might filter out low count reads even further. Take a look at the parameters: http://varscan.sourceforge.net/using-varscan.html

ADD REPLY • link 5.8 years ago by Asaf 10k

0

Entering edit mode

It should be fairly high coverage, considering I am only sequencing a 505 and 564 bp region and each multiplex is giving me about 1.7M reads, but I will check look into the parameters more

ADD REPLY • link 5.8 years ago by jleehan ▴ 120

score 1 · Answer 1 · 2019-01-28

1

Entering edit mode

5.8 years ago

Chris Miller 22k

As noted above, the parameters for varscan matter a lot. Relaxing them substantially will probably be helpful
If you have known positions, you can just use bam-readcount to query those directly, then apply whatever post-hoc filtering or statistics you prefer.

ADD COMMENT • link 5.8 years ago by Chris Miller 22k

0

Entering edit mode

Hi Chris, thanks for your input!

I haven't heard of this bam-readcount program before, but it sounds very exciting. I can't seem to find online what the exact output of that program would be. Do you happen to know how that would fit into a VarScan pipeline? My current pipeline (not including qc steps) is: bcl2fastq > bowtie2 > samtools index > view > sort > mpileup > VarScan

Essentially I'm asking: what would I need to do with whatever output bam-readcount provides to continue along my pipeline?

ADD REPLY • link 5.8 years ago by jleehan ▴ 120

0

Entering edit mode

The output is specified in the readme here: https://github.com/genome/bam-readcount It gives you counts of each base at that position, along with some extra qc information (on quality, strand bias, etc)

You will have to look at those numbers and formulate a way to set a minimum level of detection that is sensible given your experimental design. (is one variant read enough? If not, how many? what's the error rate? can you combine multiple SNP sites to increase power, etc)

ADD REPLY • link 5.8 years ago by Chris Miller 22k