Hello,
My situation is as follows:
I have two groups of reads/Individuals that differ in terms of indels (one group has the indels, the other doesn't). I already Mapped them and generated bam files. So, now I am struggling to get any form of usable VCF for my purposes. I am a little bit used to FST analysis of PoolSeq reads and the comparison of pupulations in regard to SNPs and indels. In that case I usually went with straight forward mpileup and sync and then used R for the FST analysis and IGV for visualisation of the differences. But here I have several hundred individual's to look at.
What would you suggest? So far I came up with something like this:
#!/bin/bash
(...)
module load BCFtools/1.3.1
module load SAMtools/1.3.1
# List all BAM files in current directory
bam_files=$(ls *.bam | tr '\n' ' ')
# Create a space-separated string of bam files
BAM_FILES=$(echo $bam_files | tr ' ' '\n' | paste -s -d ' ')
# Run mpileup on all bam files and pipe the output to BCFtools
samtools mpileup -g -B $BAM_FILES | bcftools view -Ou - > all_samples.bcf
Now this resulted in a bcf file which 1. seems to be suspiciously small and 2. I don't really know how to use it. I know I can assess the quality of the variant calls with a tool like bcftools stats to generate summary statistics for the BCF file, such as the number of variants, transition/transversion ratios, and other quality metrics. But is this useful here? In the end I just need a straight forward way to asses the indels of the one group of individuals of my samples. Is pooling here an option too ? Or am I maybe completely wrong ? Thanks
See: https://samtools.github.io/bcftools/howtos/variant-calling.html
The
-O
option controls what kind of file you get. If you want regular VCF files then take a look at the inline help.-Oz
is what you likely want to use.now it didn't work at all... could you recommend a script approach and give me some code advise? that would be very helpful :) thanks
what do you think about this?
GenoMax is way more experienced than I am and they're helping you so I'd wait a bit longer. BTW, this might be a good time to learn snakemake - your workflow is growing and loops will only get you so far. Snakemake's dry-run option can be super useful.
For example, a simple rule for the code you wrote above would be (not tested, test with
snakemake -np <sample_name>.vcf.gz.tbi
:Save that to
Snakefile
and then you can simply run:Once the dry run looks good, run
in the SLURM script. You can also invoke
snakemake
with the--slurm
option to have snakemake submit the job for you, but let's take things one step at a time.Thanks! I definitely gonna give it a try. Would you also recommend this for my short term success in terms of this task too? I definitely wanna go deeper in all of this and snakemake looks pretty helpful in the long run, but I would like to get a little forward with this script too. right know I'm getting this errors for all of the samples :
This is the script I am using :
You are using an ancient version of bcftools (from 2016). Current version is 1.17. Please ask your system admins to upgrade, if that is the only module available. Version numbers on Samtools go
1.2,1.3,1.4...,1.10,1.11
etc. So if you have a module that sayssamtools/1.17
that is the latest.It is also not efficient to use a for loop inside a SLURM job. You should use a for loop outside to submit multiple jobs in parallel for all samples.
I agree with GenoMax - you were using array jobs and now you've taken 2 steps backward - a series of jobs submitted using a shell loop is the middle ground.
Thanks a lot GenoMax and Ram! After a short look into that manual https://samtools.github.io/bcftools/howtos/variant-calling.html I went with a pretty straight forward way as they suggested and just did this in a SLURM job without array jobs:
It resulted in a pretty small (2KB) .bcf file. While most bam files are between 500 and 1000 KB. I assume it is possible, depending on the amount of variances (?) but still feels suspicious. Now I am wondering if that script might be too simple and misses sth. What do you guys think.
Read my script - change your bcftools to output VCF, not BCF. File size (as long as it's >0) is not an indicator of anything significant.
*.bam
is too broad, ensure you use a narrower glob - something that matches a smaller subset of files.Ok, but in general this approach would work? And why would I not want to use all the bam and Bam.bai files? Why would I subset it?
If you only have BAM files you need in the directory you are working in then
*.bam
would be ok. Most people working with human samples call variants individually so this feels odd.If you have several hundred sample files getting a 2kb bcf file indeed sounds suspicious. Are you actually seeing calls in that file?
The approach works, yes, but you're no longer calling per-sample variants.
*.bam
is dangerous for a beginner - the chances of something going wrong are high and you noticing when it does go wrong are low. That's a dangerous combination. The narrower the glob, the more control you have. Here's an example from a relatively recent mistake I made.I was using snakemake and I needed to replace all files that had
..
in their name with just.
- remove a.
fromxyz..bam
for example. My snakemake file had the linefor f in *..* do; mv ${f} ${f/../.}; done
. I tested it and it went fine and I thought it was all great. This line was the least complicated of 50+ commands and concepts. The other concepts had (in my mind) a higher probability of causing failure so when I started getting "file not found" errors pointing to a possibility in HPC file system latency, I thought one of the major tool commands was failing or snakemake was not playing well with my HPC. Things would work sometimes, other times they would not. There was no single factor I could see that predicated failure. Not even job submission order mattered.It took a while for me to realize what was going on. I was running a bunch of snakemake jobs in parallel in the same directory, and the
mv
command was renaming all the..
files it encountered. This meant that the first job to reach themv
line was altering files being worked on by other jobs, and the other jobs were complaining because their files were disappearing. All because my glob wasn't narrow enough. When I added{wildcards.sample}
to the glob (for f in {wildcards.sample}*..* do...
), everything started running smoothly.That's why I say that broad globs can trip you up when you least expect it. Go with narrow, predictable globs.