LoFreq is a fast and sensitive variant-caller for inferring single-nucleotide variants (SNVs) from high-throughput sequencing data. It is designed to robustly call low-frequency variants by exploiting base-call quality values. LoFreq has been used to call rare variants in viral and bacterial sequencing datasets and can be used to study mitochondrial heteroplasmy and rare somatic mutations in heterogeneous tumors.
LoFreq makes full use of base-call qualities, which are usually ignored by other methods or only used for filtering. It is very sensitive; most notably, it is able to predict variants below the average base-call quality. Each SNV call is assigned a p-value which allows for rigorous control of false positive rates. Even though it uses no approximations or heuristics it is very efficient due to several run-time optimizations. LoFreq is generic and fast enough to be applied to high-coverage data but also longer genomes. It takes a minute to analyse a Dengue genome sequencing data of nearly 4000X coverage, roughly one hour to call SNVs on a 600X coverage E.coli genome and 1.5 hours to run on a 100X coverage human exome data.
ADD COMMENT
• link
updated 17 months ago by
Ram
44k
•
written 12.1 years ago by
Andreas
★
2.5k
0
Entering edit mode
Hi Andreas, congrats on the paper, that ROC is very convincing. In the paper, you use SNVer's individual mode. Have you compared the metrics for pooled variant calling between LoFreq and SNVer?
I ask because I'm still looking for the best way to call low frequency variants in pools. Another question, is (how?) do you handle multiple pools? e.g. 50 case pools and 50 control pools, each with a given ploidy?
LoFreq makes no assumptions on ploidy and the number of individuals in a pool. The main advantage here is that you can therefore apply it to anything, even - or especially - if you don't know the number of individuals in your sample. In fact, you might even want to use LoFreq's results to estimate the pool size. Examples for this would be RNA virus quasispecies, heteroplasmy, bacterial subpopulations, (heterogenous) cancer samples etc. The advantage of methods specialized in pooled data with known number of individuals is that they can incorporate the number of individuals as prior. You can certainly run LoFreq on this kind of data as well and it will call variants correctly, but I'm not sure how much more sensitive (if at all) methods with corresponding priors built-in are and this type of comparison wouldn't really compare like with like. Go ahead and give it a go. I'd be interested to hear about the results and I'm also happy to help where needed.
Hi Andreas,
I was wondering if you could clarify something up for me.
In your algorithm for SNV calling, the sum of the tail of the pmf for your probability tells us there is a variant at a particular base position if the P-value that arises from it is below a certain threshold. what's implied, if I understand correctly, is that the probability of there being as many or more errors as there are variant bases is unlikely and therefore we can conclude that at least one of the mismatches at that loci is an actual variant. However, once you've established that a variant or multiple variants do exist in at a base position, how do you go about setting them apart? What I mean is, if we are dealing with pooled variant calling and there are multiple legitimate variants at one loci (say one is germline, one somatic), does your algorithm call each one of these specifically?
Thanks
sorry for the late reply. I seemed to have overlooked the Biostar notification.
To answer your question: All possible variants are checked and their counts tested against the pmf. So yes, LoFreq can call multiple variants per position.
Hi Andreas. Congrats on the paper from me as well. I have a more practical question. For a large amount of data would it be acceptable/advisable to partition the genome using the '--regions=REGION_BED' option and then merge the result? For example, dividing the human genome into 50 approximately equal pieces. Downsides to this approach?
yes, sure you can do this and run a couple of instances of lofreq_snpcaller.py simultaneously, each pointing to a different region/bed-file (option -l). In those cases it's best to set a Bonferroni factor and pass it to the script (option --bonf). Otherwise it will be determined automatically based on the number of tests performed, i.e. different values would be used for each run.
We'll release a completely rewritten version soon that will be able to run in parallel.
Any advice on how I would determine the value to set it at? The docs describe the default as "a stringent seqlen*3". Is seqlen the size of my genome? Or perhaps I should pre-compute the genome wide position count meeting some minimum coverage level?
Both approaches are actually reasonable. In theory you should use the number of tests performed, which is the sequence length multiplied by three (for three alternate bases). That is the conservative default.
your understanding of the algorithm's core is almost correct, with the only exception of "therefore we can conclude that at least one of the mismatches at that loci is an actual variant.": We compute the pmf for each loci and then test each number of observed non-ref bases at that loci using the pmf. If you get a significant p-value (after multiple testing correction) those non-ref bases constitute "a" variant at this loci. This is completely agnostic of any ploidy level and pooling / known number of haplotypes. We only report an allele frequency.
LoFreq was initially designed for RNA virus samples which have by definition an unknown number of haplotypes (quasispecies), i.e. you expect lots of low frequency variants. At the moment LoFreq makes no attempt to "tell those apart", which would amount to calling haplotypes. LoFreq will test all possible non-reference base at a position and give you the allele-frequency if deemed significant. We make no distinction between pooled and non-pooled samples.
Its somatic SNV calling logic is an extension to this generic idea, in which SNVs are called in both samples and the ones unique to one sample (usually tumour) are used for a subsequent binomial test, which checks whether the allele frequency in the other (e.g. normal) sample, can explain the number of bases in the tumour sample.
I hope that answered your question. If not, feel free to mail me to discuss this further.
Andreas
ADD COMMENT
• link
updated 3.2 years ago by
Ram
44k
•
written 11.7 years ago by
Andreas
★
2.5k
Hi Andreas, congrats on the paper, that ROC is very convincing. In the paper, you use SNVer's individual mode. Have you compared the metrics for pooled variant calling between LoFreq and SNVer?
I ask because I'm still looking for the best way to call low frequency variants in pools. Another question, is (how?) do you handle multiple pools? e.g. 50 case pools and 50 control pools, each with a given ploidy?
thanks.
Thanks Brent.
LoFreq makes no assumptions on ploidy and the number of individuals in a pool. The main advantage here is that you can therefore apply it to anything, even - or especially - if you don't know the number of individuals in your sample. In fact, you might even want to use LoFreq's results to estimate the pool size. Examples for this would be RNA virus quasispecies, heteroplasmy, bacterial subpopulations, (heterogenous) cancer samples etc. The advantage of methods specialized in pooled data with known number of individuals is that they can incorporate the number of individuals as prior. You can certainly run LoFreq on this kind of data as well and it will call variants correctly, but I'm not sure how much more sensitive (if at all) methods with corresponding priors built-in are and this type of comparison wouldn't really compare like with like. Go ahead and give it a go. I'd be interested to hear about the results and I'm also happy to help where needed.
Andreas
Hi Andreas, I was wondering if you could clarify something up for me. In your algorithm for SNV calling, the sum of the tail of the pmf for your probability tells us there is a variant at a particular base position if the P-value that arises from it is below a certain threshold. what's implied, if I understand correctly, is that the probability of there being as many or more errors as there are variant bases is unlikely and therefore we can conclude that at least one of the mismatches at that loci is an actual variant. However, once you've established that a variant or multiple variants do exist in at a base position, how do you go about setting them apart? What I mean is, if we are dealing with pooled variant calling and there are multiple legitimate variants at one loci (say one is germline, one somatic), does your algorithm call each one of these specifically? Thanks
Hi Myk,
sorry for the late reply. I seemed to have overlooked the Biostar notification.
To answer your question: All possible variants are checked and their counts tested against the pmf. So yes, LoFreq can call multiple variants per position.
Andreas
Hi Andreas. Congrats on the paper from me as well. I have a more practical question. For a large amount of data would it be acceptable/advisable to partition the genome using the '--regions=REGION_BED' option and then merge the result? For example, dividing the human genome into 50 approximately equal pieces. Downsides to this approach?
Hi Malachi,
yes, sure you can do this and run a couple of instances of lofreq_snpcaller.py simultaneously, each pointing to a different region/bed-file (option -l). In those cases it's best to set a Bonferroni factor and pass it to the script (option --bonf). Otherwise it will be determined automatically based on the number of tests performed, i.e. different values would be used for each run.
We'll release a completely rewritten version soon that will be able to run in parallel.
Andreas
Any advice on how I would determine the value to set it at? The docs describe the default as "a stringent seqlen*3". Is seqlen the size of my genome? Or perhaps I should pre-compute the genome wide position count meeting some minimum coverage level?
Both approaches are actually reasonable. In theory you should use the number of tests performed, which is the sequence length multiplied by three (for three alternate bases). That is the conservative default.