Hi everyone!
I generated 4 sorted .bam files from paired end data (1 plant per file). When I visualize this in IGV i can clearly see the alignments and the SNPs compared to the reference genome. (you can see in the picture attached:) )
I now want to find SNPS that are linked to a trait of interest in my plants, normally I use TASSEL5 to visualize the .vcf file and do statistics on the SNPs. However, it does not seem to be able to get the 4 bam files into 1 vcf file.
When I use a python script and load the generated vcf file into TASSEL5 it says Error: Index: 0, Size: 0
First I thought that there might be no variants, but since i saw the variants in IGV already i know that they are present. Anyone tips?
Major thanks in advance!
Nilo
Ps: I now use this code:
import subprocess
# List of input BAM files
bam_files = ["sample1_sorted.bam", "sample2_sorted.bam", "sample3_sorted.bam", "sample4_sorted.bam"]
# Output VCF file name
output_vcf = "output.vcf"
# Define a function to perform variant calling using bcftools
def call_variants(bam_files, output_vcf):
# Create a command to call variants using bcftools
bcftools_cmd = [
"bcftools",
"mpileup",
"-Ou",
"-f", "reference.fasta", # Replace with your reference genome file
] + bam_files + [
"|",
"bcftools", "call",
"-mv",
"--skip-variants", "indels",
"-o", output_vcf,
]
# Run the bcftools command
subprocess.run(" ".join(bcftools_cmd), shell=True, check=True)
if __name__ == "__main__":
# Call variants and generate the VCF file
call_variants(bam_files, output_vcf)
before looking in IGV, can you just see the variant in the VCF file ?
no i cant... But, i tried it with more paired end sequences and now it works... Maybe there was too little overlap afterall...
only problem now is that i see many variants in the bam files, but very very very few end up in the VCF file, any idea how that comes?
See this picture, how comes that the A and the TT which all seem to be structurally present in the sorted.bam file do not end up in the VCF file?? The blue and cyan blocks above are the variantlocations accoring to the VCF file, screenshot taken from IGV...
Anyone an idea?
What you are seeing are not true SNPs but differences between the reads and the reference. If a SNP has a lot of support (i.e. lots of reads with the same base), then the variant caller will more likely call it as a variant. But there is more to calling variants then comparing reads to the reference. Read depth, read balance, mapping scores, per base scores etc etc are all factored in.
You should also see what bcftools is doing, go read the manual on how it calls variants. You are running it on default settings, what are the defaults? Do they make sense for your experience?
Do not look at the BAM file directly, you're just eyeballing things.