Hi,
I am looking to create a panel of normals for a somatic variant caller. For normals, I have been provided with a set of WES bam files that have been preprocessed according to GATK best practices. However, they have been aligned to another reference genome than my case samples - and I cannot find out which.
My thought process in the code below was as follows: i) sort bam file; ii) convert back to fastq; iii) realign to a new reference (the same as my case samples); iv) subset to the intervals of my case samples (panels).
Here is the code:
#!/bin/bash
###STANDARD_SLURM_HEADER
module load samtools
module load bedtools
module load bwa/0.7.17
#############################
### PATHS AND VARIABLES #####
#############################
reference="/SOME_PATH/human_ref/hg19.fa"
regions="/SOME_PATH/gene_panel_intervals.chr.bed.gz"
bam_loc="/SOME_PATH/control_bams"
bam_file=$(ls ${bam_loc}/*.recalibrated.bam | sed -n ${SLURM_ARRAY_TASK_ID}p)
sample=$(echo ${bam_file} | sed 's@.*/@@' | sed "s/\..*//" )
#############################
### SORT BY READ NAME #######
#############################
samtools sort -n -l 1 -@ 4 -o ${bam_loc}/${sample}_s.bam ${bam_file}
#############################
### CONVERT BACK TO FASTQ ###
#############################
bedtools bamtofastq -i ${bam_loc}/${sample}_s.bam \
-fq ${bam_loc}/${sample}_e1.fq \
-fq2 ${bam_loc}/${sample}_e2.fq
#############################
### REALIGN TO REFERENCE ####
#############################
bwa mem \
-t 4 \
-M \
${reference} \
${bam_loc}/${sample}_e1.fq \
${bam_loc}/${sample}_e2.fq \
| \
samtools sort -@ 4 -o ${bam_loc}/${sample}_a.bam -
#############################
### SUBSET BAM TO BED FILE ##
#############################
samtools view -L ${regions} -b ${bam_loc}/${sample}_a.bam > ${bam_loc}/${sample}_pon.bam
#############################
### FINAL CLEANUP ###########
#############################
rm ${bam_loc}/${sample}_s.bam
rm ${bam_loc}/${sample}_a.bam
rm ${bam_loc}/${sample}_e1.fq
rm ${bam_loc}/${sample}_e2.fq
This runs without an error. However, I then proceeded to validate my files as such:
# validate bam file
java -jar ${PICARD} ValidateSamFile \
I=122722_pon.bam \
MODE=SUMMARY
## HISTOGRAM java.lang.String
Error Type Count
ERROR:INVALID_VERSION_NUMBER 1
ERROR:MATE_NOT_FOUND 261784
ERROR:MISSING_READ_GROUP 1
WARNING:RECORD_MISSING_READ_GROUP 2653476
These represent two different errors: missing read groups, and "MATE_NOT_FOUND". I am unsure on how to proceed from here.
i) The read group error should be relatively straightforward, by using the header from the fastq files to build the RG line, right? Like so:
R1_file=$(ls ${fastq_loc}/*R1* | sed -n ${SLURM_ARRAY_TASK_ID}p)
R2_file=$(echo ${R1_file} | sed s/R1/R2/)
sample=$(echo ${R1_file} | sed s/_S.*// | sed s/.*A00.*_// )
lib=$(echo $sample"_lib")
rgid=$(gzip ${R1_file} -cd | head -1 | sed 's/\s.*$//')
rg=$(echo "@RG\tID:$rgid\tSM:$sample\tLB:$lib\tPL:ILLUMINA")
ii) The "MATE_NOT_FOUND" error is stumping me. From what I can see online, this happens if I run bwa mem with paired reads, but either the file does not contain paired reads, or they are not recognized as such. My PI is telling me that the WES BAMs were created with paired-end reads. How can I check this assumption?
Any help or guidance would be greatly appreciated. Thanks.
Christian
Worked like a charm. Thanks.
I am trying to use this in 2024 and it does not seem to work. Is it because the option names have changed?
if you have an error message, can you post it?
you may have to additionally 1) run "bwa index ref.fa" 2) change the grep ^@RG to grep "^@RG" in some shells