Hi there I have a BAM file which needs to be edited and re-headed. Now, I'm aware of how to do so the problem is that for some reason the sed
command I'm using does not catch the sequence I have to remove... Below, an example of the BAM header
@HD VN:1.5 SO:coordinate
@SQ SN:chr12:0-133265294#0 LN:133265294
@SQ SN:chr12:0-17769809#0 LN:17769809
@SQ SN:chr12:0-34719407#0 LN:34719407
@SQ SN:chr12:0-86846879#0 LN:86846879
@SQ SN:chr12:17859256-133275309#0 LN:115416053
@SQ SN:chr12:37246182-133275309#0 LN:96029127
@SQ SN:chr12:86858727-133275309#0 LN:46416582
As you can see, the @SQ
lines are in the format of SN:<chr_number>:<region>#0
what I really need is to get rid of the :<region>#0
to run an evaluation against the GIAB benchmark set, which uses the chr_number
format.
What I've been running is the following
samtools view -H HG002_chr.bam | \
sed -E 's/:[[:digit:]]+-[[:digit:]]+#0//' | \
samtools reheader - HG002_chr.bam > test.bam
Specifically, for some reason, I'm getting an error related to the header
containing duplicate sequences. Does anyone knows a way around it, granted the command I'm running is the correct one? Thanks in advance!
if you are running against GIAB then perhaps what you need to change is the VCF file chromosome naming, and you can do that in multiple ways
use the current BAM file to make the VCF then rename the chromosomes in the VCF
Thanks, I appreciate the suggestion. Indeed, this is what I did at first; however,
hap.py
then complains about regions in the VCF pointing to contigs in the reference which have the same header because they are fragments of the original input reference. This is the result of extracting the reference from a pangenome graph and, unfortunately, there is not really a way around it; it doesn't preserve contigs names and identity and when I rename them to match the formatchr_number
this is the result.Alternatively, I could edit the GIAB truth set VCF but this would require specific knowledge of what-is-what in the two VCF file for making calls against the same exact regions to benchmark the effectiveness of the alignment tools used.