I have been testing some bioinformatics pipelines for finding variants in SARS-Cov-2 sequencing samples.
This is one of them. And this another one.
I used ART to simulate sequencing data for a sequence of the reference of SARS-Cov-2 to which I added some indels and SNPs.
Both pipelines fail to find all the variants in some cases.For example, both fail sometimes to find insertions of 6 nucleotides.
May I be doing something wrong or this is what should be expected? Any recommendation for any other pipeline?
Usually one applies several pre- and post-variant calling filters, to avoid false variants. Inspect the pipelines you are using (their documentation or their source code) to find such filters.
Are you simulating amplicon data? Did you notice the second pipeline is geared towards amplicon sequencing of SARS-CoV-2?
Instead of running the entire pipeline start with generating an alignment with bwa then call variants with freebayes.
You can do that in just a few lines of code (just typing this out so typos may be present):
bwa mem index.fa read1.fq read2.f1 | samtools sort > output.bam
samtools index output.bam
freebayes -f reference.fa output.bam > output.vcf
# evaluate the vcf here
# apply some filtering on vcf (perhaps with vcftools) and evaluate again
Now start looking at how to tune both freebayes and how to further filter the resulting VCF file.
It will give you a sense of what do the parameters do, and where the tradeoffs are. Now swap the variant caller to a different one.
Are you not even calling variants? Are you losing variants at the filtering step etc.
You will gain a much deeper understanding this way.
Variant calling on SARS-COV-2 is much simpler than variant calling on human genomes as the genome is far more densely packed with information, there are very few insertions and deletions, the coverages are usually quite high.
Indeed, I only executed the bwa step (which is the same as yours) and then a pileup and varscan. But looks like freebayes works far better. It works perfectly for the tests in which the other ones failed. It just seems it could have problems with large repeated insertions for now. I will make some more tests.
Usually one applies several pre- and post-variant calling filters, to avoid false variants. Inspect the pipelines you are using (their documentation or their source code) to find such filters.
Are you simulating amplicon data? Did you notice the second pipeline is geared towards amplicon sequencing of SARS-CoV-2?
I think it's not exclusively for amplicon data: "Optionally, if the data has been obtained through amplicon (...)"