Best tool for variant calling
2
9
Entering edit mode
6.5 years ago

Hi,

I am attempting to reanalyze some old data. Recently, an updated genome was released for the organism I work with. I had to map the reads to the new genome and what not. So what I have now is a bam file but now I want to do some variant calling so I can generate a vcf.

I know there is a lot of programs out there to do this but I am attempting to find the best one. I have tried using Platypus but the vcf file that it generates never contains any genome data. That is probably my fault but I can't seem to figure it out and there isn't much support. I tried VarDict but when I go to look at the vcf file, I get an error of "failure to parse tbx_vcf". I can really seem to find why I am getting that error. I have tried Freebayes and that has worked for me, but I have an issue with indels. Freebayes seems to call SNPs instead of acknowledging indels. I know there is GATK, but it seem like such a tedious piece of software. It looks like you have to go through 15 steps to just get anything worthwhile. I could be looking at the wrong examples. What I have notice about GATK is they try to make it so simple that it ends up just making things complicated.

Could anyone maybe give some suggestions or maybe some insight? Any help is very appreciated

Thanks!!

SNP genome alignment next-gen • 19k views
ADD COMMENT
0
Entering edit mode

I doubt there is an objectively best variant caller for each situation. You also seem to put a lot of focus on ease of use, so that's apparently also important....

ADD REPLY
0
Entering edit mode

I'm not so worried about ease of use. I try to trouble shoot and read about the different software but sometimes there just isn't enough support out there to help figure out your issue. I just need something that is going to work really.

ADD REPLY
0
Entering edit mode

Google Brain Team recently released DeepVariant. We implemented a reproducible version that was submitted to NF-Core (https://lifebit.page.link/cmRv).

We also made it available in Lifebit (https://lifebit.page.link/fiPH) if you want to try it with example parameters.

In practice, DeepVariant first builds images based on the BAM file, then it uses a DeepLearning image recognition approach to obtain the variants and eventually it converts the output of the prediction in the standard VCF format.

ADD REPLY
1
Entering edit mode

Just leaving this here for matters of completeness.

ADD REPLY
0
Entering edit mode

Has DeepVariant been benchmarked against the Gold Standards in clinical genetics, though? This question remained unanswered here, too: Why we ran DeepVariant as a Nextflow pipeline over cloud.

ADD REPLY
0
Entering edit mode

Google Brain Team recently released DeepVariant

Its was ~1 year ago...

ADD REPLY
1
Entering edit mode

Hi, @Kevin Blighe I'm just adding a small contribution here, because I am now trying to understand how DeepVariant works: couldn't find any reference about its benchmarking in clinical genetics, but I came across this comparison study: DeepVariant x GATK4 x SpeedSeq.

ADD REPLY
0
Entering edit mode

Thanks. Information moves quickly and my answer below is already somewhat out of date due to the fact that the GATK to which I was referring was GATK v3. GATK v4 may have improved a lot.

ADD REPLY
20
Entering edit mode
6.5 years ago

Edit September 10, 2019:

Although I have used samtools / bcftools mpileup, I have doubts surrounding its false positive rate; although, this can be controlled by setting a minimum total position read depth for each variant call (e.g. 30 reads). mpileup is neither fine-tuned to identify indels. Were I to implement a clinical pipeline in the future for calling germline variants, I would likely run both GATK and the method described below with mpileup, and then cross compare.

-----------------

Original answer

While the GATK is popular, from my experience from analysing both clinical and research NGS data, GATK is not the most sensitive for detecting germline single nucleotide variants (SNVs). Sensitivity here is gauged by comparing NGS results to those of Sanger sequencing.

As you've mentioned, there are many variant callers out there and each has advantages, including the GATK. A fool would be the person who would come here and say that this or that particular program / pipeline is the best, as they are each good in different scenarios.

My 'go to' pipeline is the 'old faithful':

  1. BWA mem for alignment (if read lengths > 70bp)
  2. Remove low MAPQ reads
  3. Mark and remove PCR / optical duplicates
  4. Produce random BAM subsets per sample (usually 3 subsets at 25%, 50%, and 75% 'randomly' selected reads)
  5. Call variants with BCFtools mpileup (main BAM and 3 subsets together) piped into BCFtools call

The final command would be something like:

bcftools mpileup \
  --redo-BAQ \
  --min-BQ 30 \
  --per-sample-mF \
  --output-tags DP,AD \
  -f "${Ref_FASTA}" \
  --BCF Full.bam 75pcReads.bam 50pcReads.bam 25pcReads.bam | \
bcftools call \
  --multiallelic-caller \
  --variants-only \
  -Ob > Variants.bcf

This has greater sensitivity for SNVs when compared to other pipelines that I've tested, which includes GATK and Illumina's BaseSpace pipeline. Granted, there are many programs and pipelines out there, and one would ideally have to test all of them, and on an expanded number of samples.

For somatic variant calling, I use Lancet.

Kevin

ADD COMMENT
0
Entering edit mode

I appreciate the help! That was the answer I was looking for really. I have worked with samtools and that works great for me. I will give mpileup a shot!

ADD REPLY
0
Entering edit mode

Yes, don't worry about it. Sometimes people look at you a bit weird when you say that you're using SAMtools for variant calling, but the truth is that it has improved over the years and has some great minds behind its development. What I mention in my answer is just my experience of having processed DNA-seq data in research, clinical, and private settings for 6-7 years. The only other caveat that I'll mention is that, if you are looking for somatic variants, then you may consider another option.

Finally, once you call your variants, further filtering can be done with a Perl executable called vcfutils.pl varFilter, which comes bundled with BCFtools (look in the misc folder under BCFtools root).

If you want to add more tags to your final file, too, then that is also possible with BCFtools plugin. See here: A: How to use bcftools to calculate AF INFO field from AC and AN in VCF?

ADD REPLY
0
Entering edit mode

I concurr. This is my experience too. What version of samtools are you using?

ADD REPLY
0
Entering edit mode

Version has varied over the years, currently, 1.3.1. I believe we are supposed to use bcftools mpileup now. Funnily enough, private companies, caught up in the notion of 'licensed software means better software', seem convinced that GATK is the supreme caller.

Look at it this way, the pipeline that I mentioned above proved better (sensitivity and specificity to a gold standard) in a clinical setting when compared to NextGene by Sophia Genetics (>£20,000 license fee per year) and GATK (unknown license fee per year).

ADD REPLY
0
Entering edit mode

What is the purpose of passing the BAM and subsets thereof?

Would that be necessary if I have, for instance, 10 BAMs from different samples?

Thanks

ADD REPLY
0
Entering edit mode

In my work in the clinical setting, I noted that GATK frequently missed true-positive variants that had been validated with Sanger sequencing. The way around this was to sample reads from a BAM file and created BAM subsets for each sample. Calling variants on each of these then allowed us to 'recover' all true-positive variants.

After ditching GATK, we found that bcftools mpileup could actually easily recover these 'missed' variants by GATK, however, we decided to continue to perform the sampling of reads on each sample, just in case.

If you have 10 samples, you can just pass them all to the same mpileup command. Variants will then reported in the output VCF/BCF for all samples.

ADD REPLY
0
Entering edit mode

Regarding this point 4: Produce random BAM subsets per sample (usually 3 subsets at 25%, 50%, and 75% 'randomly' selected reads) Could you please explain the procedure or provide the command used?

Thanks Najeeb

ADD REPLY
2
Entering edit mode

Sure thing. Here are the steps. Assume that you are starting with an aligned, sorted (and QC'd BAM) called Aligned_Sorted_PCRDuped_FiltMAPQ.bam, which is indexed by SAMtools:

echo "#######################################################" ;
echo "#Analysis step 6:  Downsampling / random read sampling#" ;
echo "#######################################################" ;
java -jar picard.jar DownsampleSam INPUT=Aligned_Sorted_PCRDuped_FiltMAPQ.bam \
  OUTPUT=Aligned_Sorted_PCRDuped_FiltMAPQ_75pcReads.bam \
  RANDOM_SEED=50 PROBABILITY=0.75 VALIDATION_STRINGENCY=SILENT ;
samtools index Aligned_Sorted_PCRDuped_FiltMAPQ_75pcReads.bam ;

java -jar picard.jar DownsampleSam INPUT=Aligned_Sorted_PCRDuped_FiltMAPQ.bam \
  OUTPUT=Aligned_Sorted_PCRDuped_FiltMAPQ_50pcReads.bam \
  RANDOM_SEED=50 PROBABILITY=0.5 VALIDATION_STRINGENCY=SILENT ;
samtools index Aligned_Sorted_PCRDuped_FiltMAPQ_50pcReads.bam ;

java -jar picard.jar DownsampleSam INPUT=Aligned_Sorted_PCRDuped_FiltMAPQ.bam \
  OUTPUT=Aligned_Sorted_PCRDuped_FiltMAPQ_25pcReads.bam \
  RANDOM_SEED=50 PROBABILITY=0.25 VALIDATION_STRINGENCY=SILENT ;
samtools index Aligned_Sorted_PCRDuped_FiltMAPQ_25pcReads.bam ;

echo "Done." ;
echo -e "\n\n" ;

echo "###################################" ;
echo "#Analysis step 7:  Variant calling#" ;
echo "###################################" ;
bcftools mpileup --redo-BAQ --min-BQ 30 --per-sample-mF --output-tags DP,AD -f "${Ref_FASTA}" \
  --BCF Aligned_Sorted_PCRDuped_FiltMAPQ.bam \
  Aligned_Sorted_PCRDuped_FiltMAPQ_75pcReads.bam \
  Aligned_Sorted_PCRDuped_FiltMAPQ_50pcReads.bam \
  Aligned_Sorted_PCRDuped_FiltMAPQ_25pcReads.bam | \
  bcftools call --multiallelic-caller --variants-only -Ob > Aligned_Sorted_PCRDuped_FiltMAPQ.bcf ;

Please review all parameters that I use here, along with their values.

If you want to add more information to the final BCF, take a look here: A: How to use bcftools to calculate AF INFO field from AC and AN in VCF?

ADD REPLY
0
Entering edit mode

Hi, can you show your script for whole pipeline (from fastq files)?

ADD REPLY
1
Entering edit mode

Hey, the entire pipeline is here: https://github.com/kevinblighe (see ClinicalGradeDNAseq).

You may want to spend a few hours tidying up the code to adapt it to your own pipeline.

ADD REPLY
0
Entering edit mode

Hi Kevin,

Do you suggest the same pipeline across PANEL and exome data as well?

Thanks Najeeb

ADD REPLY
0
Entering edit mode

For single nucleotide germline variants, yes. Not for indels.

ADD REPLY
1
Entering edit mode

If the pipeline is too difficult to implement, then do not worry too much about it. It was designed from within a clinical environment. Try one of the more generic pipelines like GATK if you are based in research.

ADD REPLY
0
Entering edit mode

Hello!

  • Do you explicitly remove the duplicates? As far as I understand, if duplicates were marked, mpileup skips them.

  • If the flag --min-MQ is included, I guess step 2 is not necessary, right?

ADD REPLY
1
Entering edit mode

I prefer to mark the duplicates and then completely remove (expunge) them from the BAM. This helps to avoid later ambiguity about whether another downstream program is counting them or not.

Again, the same for --min-MQ: I prefer to expunge these reads.

ADD REPLY
0
Entering edit mode

Thanks!

do you filter secondary and supplementary alignments? or are those assumed to be removed by filtering by mapping quality?

ADD REPLY
1
Entering edit mode

Good question... I do not do anything with those. I assume that they will be removed but cannot confirm that.

ADD REPLY
0
Entering edit mode

Thanks again!

just one more question to understand --per-sample-mF.

I read that m corresponds to the min number of gapped reads for indel candidates (default: 1) and F corresponds to the min fraction of gapped reads (default: 0.002).

Does this mean that --per-sample-mF is applicable only to indel discovery?

ADD REPLY
1
Entering edit mode

Hey serpalma.v. Yes, that is the interpretation, i.e., --per-sample-mF is only for indel candidates. Of course, mpileup is not the greatest for indel detection. I would use pindel for indels; mpileup is great for single nucleotide variants.

ADD REPLY
0
Entering edit mode

Kevin, thank you for your complete response! I hope this message finds you well.

I was wondering if you have experience in variant calling for polyploids? I have a sequenced (WGS data) bi-parental population of 95 sugarcane individuals (including both parents). I already aligned them (Bowtie2 and Picard to generate sorted bams) to their reference genome (a specific cultivar) with high percentage of alignment but now I need to call for variants. I am using NGSEP (https://github.com/NGSEP/NGSEPcore) (using MultisampleVariantsDetector instruction) to call for SNVs, small and large indels, STRs, inversions, and CNVs. However, it is taking a lot of time and I would like to have an alternative, e.g. I launched just 20 individuals last Thursday, September 19th and it has not ended yet.

Any help is really appreciated, thank you!

ADD REPLY
0
Entering edit mode

It might be worth creating your own question. Seem rather specific.

ADD REPLY
1
Entering edit mode

Agreed. Please open a new question.

ADD REPLY
0
Entering edit mode
ADD REPLY
2
Entering edit mode
6.5 years ago

For me, GATK Best Practises are the standard

Example here to call somatic mutations

Note that, If you want to have the most reliable results as possible you will have to go through these 15 steps because each step is significant.

ADD COMMENT
0
Entering edit mode

I get it. That issue I have is that it is significantly harder gather all the data and have it in the correct format to use GATK. It starts to get difficult when you work with a "unique" species.

That example is kind of what I am talking about. It is a great picture of a workflow but doesn't tell you anything about the software that workflow uses.

ADD REPLY

Login before adding your answer.

Traffic: 2775 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6