After alot of hairpulling over why strelka wouldn't run on my bam files, I found that for 18 contig labels were wrong, and they're off by one (picture below). I've figured out which labels should be changed and to what, but I have no idea how to get there. I tried using sed, first changing the labels in question to a unique name (since if i just use sed normally if would keep replacing the label i just changed), but i get the error: failed to read the header for '-'. and the resulting inter.bam is empty.
samtools view -H ${names[${SLURM_ARRAY_TASK_ID}]}_final.bam | sed -e 's/SN:chrUn_KI270741v1/SN:bebop1/' | sed -e 's/SN:chrUn_KI270742v1/SN:bebop2/' | \ sed -e 's/SN:chrUn_KI270743v1/SN:bebop3/' | sed -e 's/SN:chrUn_KI270744v1/SN:bebop4/' | \ sed -e 's/SN:chrUn_KI270745v1/SN:bebop5/' | sed -e 's/SN:chrUn_KI270746v1/SN:bebop6/' | \ sed -e 's/SN:chrUn_KI270747v1/SN:bebop7/' | sed -e 's/SN:chrUn_KI270748v1/SN:bebop8/' | \sed -e 's/SN:chrUn_KI270749v1/SN:bebop9/' | sed -e 's/SN:chrUn_KI270750v1/SN:bebop10/' | \ sed -e 's/SN:chrUn_KI270751v1/SN:bebop11/' | sed -e 's/SN:chrUn_KI270752v1/SN:bebop12/' | \ sed -e 's/SN:chrUn_KI270753v1/SN:bebop13/' | sed -e 's/SN:chrUn_KI270754v1/SN:bebop14/' | \ sed -e 's/SN:chrUn_KI270755v1/SN:bebop15/' | sed -e 's/SN:chrUn_KI270756v1/SN:bebop16/' | \ sed -e 's/SN:chrUn_KI270757v1/SN:bebop17/' | sed -e 's/SN:chrY_KI270740v1_random/SN:bebop18/'| \ samtools reheader - ${names[${SLURM_ARRAY_TASK_ID}]}_final.bam > ${names[${SLURM_ARRAY_TASK_ID}]}_inter.bam
I've also just tried plugging in a single file (instead of the slurm array), and i get [mii] aborted by user!
Any ideas on how to solve this? also apologies for the long line of code, for some reason I can't get it to be as a block!
I have no idea what this is all about but my recommendation is to rerun alignment if BAMs are odd. Don't do custom manipulation. bwa mem, duplicate marking, sort, then strelka. No custom things should be needed.
I would point out that it seems you are trying to get around a much bigger problem...
why do you have a fasta with incorrect chromosome names? and how did it happen?
In general, little to no modification should be done to them after being downloaded from UCSC/NCBI etc...
You are now taking your bam file and applying incorrect chromosome names (i.e. they differ from the TRUE reference - they seem to be correct in your BAM and wrong in your FASTA). Although this is specifically in non-standard chromosomes (and likely outside of what your study is focused on) if you published results with these swaps this could be grounds for a future correction/retraction of the paper(s)...
that's a very good question, I was just given these BAMs (ie I didn't do upstream processing). The fasta has the correct chromosome names, I redownloaded it just to be sure, but I will look into seeing why the contig labels are off
https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chrUn_KI270741v1
shows a length of 157,432
and https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&position=chrUN_ki270757v1
shows a length of 71,251
-
so it looks like the fasta is incorrect. if i'm reading your table correctly and the first column is the fasta chr name and the second column the fasta chr length
EDIT: also confirmed on a copy of the UCSC genome i have locally
oh my goodness thank you so much for catching that, you're right, i must've done something to mess up the original fasta! I've corrected it and now everything is running smoothly, thank you again for your help, it's very much appreciated and saved me alot of time! :)