Question

Building reference dbSNP file using WGS samples

1

Entering edit mode

13 months ago

analyst ▴ 50

Dear scientific community,

I have to call variants from WGS samples of citrus. I used GATK pipeline for post processing of aligned reads but reference dbSNP file is not available for citrus sinensis. I am using bootstraping method. Removed duplicates and called variants using FreeBayes to build reference vcf file to be used for BQSR step later. I got merged vcf file of 84.4GB size using FreeBayes. Is it normal to have such a large file? What I mean is normally we get vcf file from dbSNP of around 1GB size. How can I reduce its size or is it okay to go with it?

Your suggestions will be highly appreciated.

Thanks!

NGS WGS Variant-calling non-model Bootstrapping • 2.4k views

ADD COMMENT • link 11 months ago by analyst ▴ 50

score 2 · Answer 1 · 2023-10-16

2

Entering edit mode

13 months ago

Brian Bushnell 20k

It's hard to say whether you should reduce the size, since that's based as much on the length of the VCF fields as the actual number of variants...

But, in my testing BBTools' callvariants.sh had a substantially lower false positive rate than FreeBayes. You may want to look at some of the low-scoring or low-AF variants to see if they seem real to you. If not, filter them out... or try using CallVariants to generate your VCF instead.

ADD COMMENT • link 13 months ago by Brian Bushnell 20k

0

Entering edit mode

Thank you so much Brian Bushnell. I will try callvariants.sh

ADD REPLY • link 13 months ago by analyst ▴ 50

score 2 · Answer 2 · 2023-10-17

2

Entering edit mode

13 months ago

colindaven 7.0k

First, is dbSNP actually required for GATK ? Can you avoid this step, since dbSNP is a established resource and not something you can just reproduce that easily.

You can do some extensive quality filtering to reduce the size of your file. Eg, by the QUAL field (read the Freebayes github docs), reduce to biallelic SNPs only perhaps, eyeball the vcf to find more low quality SNPs. Consider whether you need indels, and which sizes.

I think these measures might help you to reduce the size of your VCF, but I'm unsure if they will meet your needs of reproducing dbSNP.

ADD COMMENT • link 13 months ago by colindaven 7.0k

0

Entering edit mode

Thank you so much for answering colindaven. Actually I meant reference vcf file like those available in dbSNP database. GATK requires reference variation file for base quality score recalibration. variation file for citrus is not available in dbSNP yet so I need to build variation file first that will be further used as a reference for BQSR step in GATK.

ADD REPLY • link 13 months ago by analyst ▴ 50

1

Entering edit mode

If you are interested in quality score recalibration, I recommend this protocol. Simplified version:

#Map reads
bbmap.sh in=reads.fq out=mapped.sam ref=ref.fa

#Call variants (set ploidy to whatever is appropriate for the organism)
callvariants.sh in=mapped.sam ref=ref.fa out=vars.vcf ploidy=2

#Generate recalibration matrices cognizant of actual variations
calctruequality.sh in=mapped.sam vcf=vars.vcf

#Recalibrate
bbduk.sh in=mapped.sam out=recal.sam recalibrate

#Re-call variants
callvariants.sh in=recal.sam out=varsRecal.vcf ref=ref.fa ploidy=2

The full recommend protocol including preprocessing and error-correction is in bbmap/pipelines/variantPipeline.sh but this is the short version for just quality-score recalibration. Since modern Illumina sequencers give almost all bases the same quality score (to keep the gzipped file size down, which was clearly decided by senior management who have zero bioinformatics experience), this step is quite useful for variant-calling.

I wrote some tools to calculate the accuracy of quality-score recalibration, because I wanted it tweak it to be as accurate as possible... if you run:

reformat.sh in=mapped.sam qhist=qhist.txt qahist=qahist.txt

...you get tsv output designed to be easy to plot in a spreadsheet, so you can compare the before and after quality-score accuracy. If you're not sure which tool to use for quality-score recalibration I encourage you to compare BBTools and GATK and pick whichever gives better results.

ADD REPLY • link 13 months ago by Brian Bushnell 20k

0

Entering edit mode

Thank you so much for helping Brian Bushnell!

bbmap.sh in=reads.fq out=mapped.sam ref=ref.fa

Do I need to align reads using bbmap.sh again as I have already aligned through bwa and deduplicated through MarkDuplicates. Can I use these deduplicated mapped reads to call variants for this step:

callvariants.sh in=mapped.sam ref=ref.fa out=vars.vcf ploidy=2

And is there any threading option to increase processing speed for bbmap bash scripts?

ADD REPLY • link 13 months ago by analyst ▴ 50

1

Entering edit mode

All BBTools are multithreaded and will automatically detect the number of hardware threads available on your system so there is no reason to tell them how many threads to use unless you are sharing a node with someone else. For example, if you use Slurm and request 8 cores, I recommend you use the "t=8" flag to ensure BBTools uses 8 threads. It's not really necessary since Slurm will already restrict you to 8 cores, but it makes things more efficient if you don't use way more threads than you have hardware support for. Also, sometimes systems have hyperthreading which is generally not useful at all in integer-dominant programs (basically all of bioinformatics) so if you have 8 cores with hyperthreading, the system will report 16 cores, so BBTools will use 16 threads, and that's fine. It's just that setting "t=8" will use 8 threads instead, which is the actual number of cores, and that will typically be slightly faster and use slightly less memory.

TL;DR: Threads are automatic, don't worry about it. Everything in BBTools is multithreaded and handles it automatically. It's all fast even if you use a single thread, though. Especially CallVariants.

Since you've already aligned your reads with bwa, you don't need to align them again. You will get a better result for variant-calling (particularly long indels) from reads aligned with BBMap, but you don't have to.

Also, it's worth noting that if you want to save time, particularly on Novaseq platforms which have a huge number of duplicates due to their chemistry, I recommend deduplication prior to mapping:

clumpify.sh in=reads.fq.gz out=deduped.fq.gz dedupe passes=3

or

clumpify.sh in=r1.fq.gz in2=r2.fq.gz out=d1.fq.gz out2=d2.fq.gz dedupe passes=3

Personally I find paired reads interleaved in a single file incredibly convenient, but I realize that most of the world keeps them in twin files. Anyway, BBTools handles either format and strictly speaking twin files is up to twice as fast, but I still prefer interleaved.

Since you've already deduplicated your reads it's fine, but in the future, you can avoid wasting time mapping identical reads by doing this ahead of time. It also makes the resulting file much smaller (gzip compression is more efficient on Clumpified files) and overall makes mapping faster too (due to cache locality, since Clumpify puts nearby reads next to each other in the file, you have fewer memory random accesses).

ADD REPLY • link 12 months ago by Brian Bushnell 20k

0

Entering edit mode

Thank you so much for being so helping. I will follow your valuable suggestions.

ADD REPLY • link 12 months ago by analyst ▴ 50

0

Entering edit mode

Dear @Brian Bushnell I used callvariants.sh for variant calling I would must say its a highly efficient variant caller. I run it on 70 samples using this command:

bash /home//Apps/bbmap/callvariants.sh list=file ref=GCF_022201045.2_DVS_A1.0_genomic.fna vcf=reference.vcf ploidy=2

It outputs name of first sample in vcf file right after FORMAT Field. My question is that is it considering all input samples for variant calling and outputting name of first sample in the list or it is considering only first file for variant calling out of list of samples?

Thanks!

ADD REPLY • link 12 months ago by analyst ▴ 50

1

Entering edit mode

If you want them tracked as individual samples, you need to add the "multisample" flag. Otherwise they will all be treated as a single sample that happened to have multiple files.

ADD REPLY • link 12 months ago by Brian Bushnell 20k

0

Entering edit mode

Thanks Brian, I don't want them tracked as individual samples but a single vcf file that can be used as a reference vcf to find base quality score recalibration of all samples.

Will this command work in that case?

bash /home//Apps/bbmap/callvariants.sh list=file ref=GCF_022201045.2_DVS_A1.0_genomic.fna vcf=reference.vcf ploidy=2

calctruequality.sh in=sample1.bam vcf=reference.vcf

calctruequality.sh in=sample2.bam vcf=reference.vcf

or should I make individual vcf file for each sample?

bash /home//Apps/bbmap/callvariants.sh in=sample1.bam ref=GCF_022201045.2_DVS_A1.0_genomic.fna vcf=sample1.vcf ploidy=2

bash /home//Apps/bbmap/callvariants.sh in=sample2.bam ref=GCF_022201045.2_DVS_A1.0_genomic.fna vcf=sample2.vcf ploidy=2

ADD REPLY • link 12 months ago by analyst ▴ 50

2

Entering edit mode

The best approach depends on:

Sample depth;
Whether each file is from a different genetic sample;
Whether each sample is from a different Illumina run/lane.

Base quality recalibration needs to be done on a per-lane basis to be accurate. And there needs to be sufficient depth to call most of the variants (let's say, minimum of 3x per ploidy, ideally >10x). So if you have a diploid with at least 6x average coverage per sample, you would do this for each sample (assuming paired reads interleaved in a single file):

#Adapter-trim reads; adapter sequence will ruin your recalibration
bbduk.sh in=reads.fq out=trimmed.fq ktrim=r k=21 mink=9 hdist=1 ref=adapters tpe tbo

#Align glocally with no soft-clipping
#The histograms are optional but often nice to know how accurate the quality was before recalibration
bbmap.sh in=trimmed.fq ref=citrus.fa out=mapped.sam qhist=qhistBefore.txt qahist=qahistBefore.txt

#Call variants initially
#We will skip this step since CalcTrueQuality can do variant-calling itself if you don't give it a vcf
#callvariants.sh in=mapped.sam ref=citrus.fa out=vars.vcf ploidy=2

#Generate recalibration matrices
#This only needs to be done once per lane but it's simplest to do it once per sample
#You don't need to set an output file; it writes to the "ref/" subdirectory so don't run 2 instances concurrently in the same directory
calctruequality.sh in=mapped.sam ref=citrus.fa ploidy=2 callvars

#Recalibrate the sample
#Again, the histograms are optional but nice to look at
bbduk.sh in=mapped.sam out=recal.sam recalibrate ordered histogramsafter qhist=qhistAfter.txt qahist=qahistAfter.txt

#Then call variants for real if you want
recallvariants.sh in=recal.sam ref=citrus.fa out=recal.vcf

#Or you can recalibrate the raw reads, redo preprocessing, remap them, etc. but unless you have severe quality accuracy problems that's not usually needed.

Now if you really want to recalibrate all of them from a single vcf you can, but it would be much less accurate if the samples are genetically distinct, and the recalibration matrix still needs to be made on a per-lane basis. To recalibrate all of them from a single VCF you would need to make the VCF like this:

#This is basically what you already did but note the "rarity=0.05" flag, to retain variants that don't occur in most samples
callvariants.sh ref=citrus.fa out=all.vcf ploidy=2 rarity=0.05 in=mapped1.sam,mapped2.sam,...

#Then you still need to run these steps once per sample.
#technically for this you can run it once per lane with all the samples from that lane, but in practice that's a pain
calctruequality.sh in=mapped.sam ref=citrus.fa vcf=all.vcf

#Recalibrate each sample
bbduk.sh in=mapped.sam out=recal.sam recalibrate ordered

ADD REPLY • link 12 months ago by Brian Bushnell 20k

0

Entering edit mode

Average depth is 18
Each file represents different sample of same specie.
To find about run/lane, I found this kind of header information in reads

Sample 1 forward read

@E100069507L1C001R0020000611/1

Sample 1 reverse read

@E100069507L1C001R0020000611/2

Sample 2 forward read

@E100069507L1C001R0020001366/1

Sample 2 reverse read

@E100069507L1C001R0020001366/2

ADD REPLY • link 11 months ago by analyst ▴ 50

0

Entering edit mode

I got this error while extracting snps out of vcf file generated through callvariants.sh.

htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 134: Duplicate allele added to VariantContext: C, for input source: merged_filtered.vcf

I observed that C is present both in REF and ALT filelds in vcf file.

enter image description here

Since both REF and ALT alleles are same i.e., C (not variant) then why is it in vcf file?

Looking forward to your valuable answer.

Thanks

ADD REPLY • link 11 months ago by analyst ▴ 50

0

Entering edit mode

I've never seen that before so I'm not quite sure how to replicate it. At a minimum, could you give the exact command you used to generate the corrupted file? And if possible, would I be able to access the data so I can try to replicate this occurrence?

ADD REPLY • link 11 months ago by Brian Bushnell 20k

0

Entering edit mode

Dear Brian I realized that problem is with callvariants.sh script because when I called variants using freebayes I did not find aforementioned error. Kindly upgrade the script.

Many thanks for your help!

ADD REPLY • link 11 months ago by analyst ▴ 50