Hello, it's not clear to me if the multisample mode is actually a join calling, or if it's simply the equivalent of bcftools norm + bcftools merge. I am specifically interested in NOT doing join calling. I am also wondering if base recalibration is still necessary? I find tutorials where it's done, but the doc of callvariants doesn't say anything about it. I am aware quality recalibration in itself is not trivial and there are probably pros and cons in doing it.
Sorry the questions are rather general, I hope it's acceptable. I know some people here use bbtools, so I hope to get an answer.
Thanks for any feedback you could give me.
Alex
Should all aligned files be supplied to
calctruequality.sh
at the same time? Then each file is recalibrated one at a time?I was about to ask the same question.
Oh actually from what I understand from the doc, if the sam come from the demultiplexing of the same sequencing it advices to supply everything at once.
Specifically, if the samples come from the same lane you can run them together with calctruequality.sh, however, in that case, variant-calling will be done by calctruequality on the merged samples unless you supply a VCF. So if you want to run calctruequality on multiple samples at once, it's best to first call variants on them, merge the vcfs (or use callvariants' multisample flag), then supply that vcf to calctruequality instead of giving it the callvars flag. In other words:
But I think it's simpler to run calctruequality once per sample.
Can you explain a bit more as to what you mean?
In your original answer you were calculating quality first before calling variants.Why has the order of commands changed above? This discussion is timely since it is going to help with something I am doing now.
I edited the post's code section for clarity. Calctruequality does not understand lanes or samples, just alignments. So if you give it multiple sam files and set the 'callvars' flag, it will call variations assuming all input reads are from the same sample, which can wipe out some alleles found in only a minority of samples. I may add support for CallVariants' rarity flag to help circumvent this, but for now, if you want to recalibrate quality scores, you should either run calctruequality on one sample at a time (in which case you can use the callvars flag), or else make a vcf prior to calctruequality which contains all the variants for all samples. This initial vcf just tells it which locations to ignore when counting mismatches and can be discarded after running calctruequality; it's just part of the boot-strapping process. That said, it should, hopefully, be very similar to final VCF.
In the interest of making sure this is done right, if we run
calctruequality
for multiple samples as separate processes, are these all going to try and update the same matrix file inref
? Should separate dirs be created for each sample run usingpath=/ref/sample
option?Edit:
I ran a couple of additional `calctruequality` jobs on additional samples but they appear to have not touched the matrix files created in `ref/qual` earlier. Jobs completed and wrote logs.I do see the files getting overwritten. Perhaps I was impatient.Every time you run calctruequality, it should overwrite the matrix files in /ref/qual/
I just tested it, and it does overwrite for me... but yes, you can add "path=sample_a" (for both calctruequality and bbduk) and it will write or read the quality matrices from /sample_a/ref/qual/*.
Awesome, thanks Brian! A last question if you don't mind. I know some tools try by themselves to "normalize" variants. I am aware variant normalisation is actually a very complicated concept. What about your tool? Does it "try to normalise" or do you still advice to use bcftools norm? (Or maybe there is a bbtools utility for that)? Thanks so much for your answers. Bbtools is such an awesome suite of utilities, I really love it.
BBMap already should be producing normalized cigar strings, and CallVariants calls variants directly from the cigar strings with no attempts to impute or alter them; so, if you're calling from reads aligned with a different aligner, you may want to normalize the vcf with another tool; I have not written one. You CAN use the realign flag in CallVariants, though.
You should probably use normalization on the output CallVariants and all other variant callers output in any situation when you are merging them into a single VCF or planning plan to compare them directly to each other.
bbduk returns an error resulting in a very small output bam (a few ko)
I tried by adding the ref flag as suggested but it doesn't help.
The reads were mapped with bbmap.sh.
Any idea? Thanks
EDIT: I tried this solution calctruequality in bbmap
no more error message but the bam file still shrinks, from 22 Gb to 70 Mo.
The key is this error message:
I'll see if I can fix this quickly... it looks like it should not be occurring. Probably it would go away if MD tags were present in the reads. You can add the BBMap flag "mdtag" during mapping; I'll probably make that the default.
By the way, what version of BBTools are you using?
Thanks for your answer. I can share the files with you if this helps. It's the version 38.00
EDIT: trying the 39.05, will report back by tomorrow if it works.
Just checked, and it works fine for me. This is an old bug that was fixed a while ago, so 39.05 should work fine for you too. Also, you don't need to add a "ref" flag for bbduk in this case; that will just waste time and memory.
With 39.05, no bug. However, the bam files still shrink, for example, from 22 Go to 1,6 Go. If I understand well, bbduk is also used to remove reads that are contaminants, could it be there is a bug and instead of recalibrating the bases, it's actually ditching reads? Here is the output from the console from one of my bam:
Why is it counting "Contaminants"? I know bbduk can be used for contaminant filtering, but I understand that in recalibration mode, it just rewrites the input bam with the recalibrated base score. Or is it a "two in one"? And if this is the case, is there a way to toggle off contaminant filtering?
Right, it's evicting all reads that match the reference. You need to remove the "ref=../ref.fsa" flag. If there is a reference, the default behavior of BBDuk is to send everything matching the reference to "outu" and everything not matching the reference to "out".
Thanks so much Brian. Your software suite is fantastic; the more I use it, the more I love it. It's incredible all the problems you figured out and how to solve them. I have a very last question: in multisample mode in callvariants.sh, how is the QUAL score reported? Is it simply the highest? You don't recompute it?
I am sorry for how my questions are spread out over several days. I have some personal health issues preventing me from dedicating myself fully to my job here. Thanks again Brian.
The qual score is the max of all of the sample qual scores, and the pass filter is pass if any sample passes.
Oh ok, so mmm, I will work file by file since I want to be sure for every file. I have to know if sample A passes but sample B doesn't. Thanks, Brian