Hi.
I'm pre-processing a bam file according to this script https://github.com/Yonsei-TGIL/Mosaic-Reference-Standards/blob/master/1.A.pipe_Align_Preprocess.sh.
The authours of this paper (https://www.nature.com/articles/s41597-022-01133-8) shared for the samples lines the bam files preprocessed. For example SetB_M3_2.preprocessed.bam. While for others samples, they provided only the fastq file. So I am generating and preprocessing the bam file for these files.
I'm doing the base recalibration step, but I'm running into issues. I need to specify the --known-sites $dbSNP, for which I downloaded the https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/, 00-All.vcf.gz This vcf only contains the main contigs, so I thought of filtering the input bam file and reference genome for the main chromosomes as well.
But the bam files of the authors contain also the additional contings. I want to run paired sample variant calling using SetB_M3_2.preprocessed.bam as tumor sample and the bam I'm generating and proprocessing as control. To do this the two bam files should have the same contigs.
I'd really appreciate any help in this.
Thank you in advance.
Are you sure that the chromosome notation is the same? Which reference genome was used by authors?
This is from their bam file: @PG ID:bwa-12AC583 PN:bwa VN:0.7.17-r1188 CL:bwa mem -t 4 -M /data/resource/reference/human/NCBI/GRCh38_GATK/BWAIndex/genome.fa /data/project/RefStand/1.raw_Mosaic/MergeAll/FP-M1-1_R1.fq.gz /data/project/RefStand/1.raw_Mosaic/MergeAll/FP-M1-1_R2.fq.gz
I'm using GRCh38_full_analysis_set_plus_decoy_hla.fa: names chromosomes as chr1, chr2, ..., etc.
I had noticed that 00-All.vcf.gz named chromosomes as 1, 2, ..., etc. I did rename it.
Wouldn't be better to use an already converted dbsnp file from the same genome? You could get it from the Broad resource boundle https://console.cloud.google.com/storage/browser/_details/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.gz;tab=live_object