Question

SARS-Cov-2 variant calling

0

Entering edit mode

4.4 years ago

juanjo75es ▴ 130

I have been testing some bioinformatics pipelines for finding variants in SARS-Cov-2 sequencing samples. This is one of them. And this another one. I used ART to simulate sequencing data for a sequence of the reference of SARS-Cov-2 to which I added some indels and SNPs. Both pipelines fail to find all the variants in some cases.For example, both fail sometimes to find insertions of 6 nucleotides. May I be doing something wrong or this is what should be expected? Any recommendation for any other pipeline?

sequencing variants sars-cov-2 • 2.3k views

ADD COMMENT • link updated 4.4 years ago by Istvan Albert 103k • written 4.4 years ago by juanjo75es ▴ 130

0

Entering edit mode

Usually one applies several pre- and post-variant calling filters, to avoid false variants. Inspect the pipelines you are using (their documentation or their source code) to find such filters.

Are you simulating amplicon data? Did you notice the second pipeline is geared towards amplicon sequencing of SARS-CoV-2?

ADD REPLY • link 4.4 years ago by h.mon 35k

0

Entering edit mode

I think it's not exclusively for amplicon data: "Optionally, if the data has been obtained through amplicon (...)"

ADD REPLY • link 4.4 years ago by juanjo75es ▴ 130

score 2 · Answer 1 · 2021-03-26

Instead of running the entire pipeline start with generating an alignment with bwa then call variants with freebayes.

You can do that in just a few lines of code (just typing this out so typos may be present):

bwa mem index.fa read1.fq read2.f1 | samtools sort > output.bam
samtools index output.bam
freebayes -f reference.fa output.bam > output.vcf

# evaluate the vcf here

# apply some filtering on vcf (perhaps with vcftools) and evaluate again

Now start looking at how to tune both freebayes and how to further filter the resulting VCF file.

It will give you a sense of what do the parameters do, and where the tradeoffs are. Now swap the variant caller to a different one.

Are you not even calling variants? Are you losing variants at the filtering step etc.

You will gain a much deeper understanding this way.

Variant calling on SARS-COV-2 is much simpler than variant calling on human genomes as the genome is far more densely packed with information, there are very few insertions and deletions, the coverages are usually quite high.