We want to remove the alternative contigs from both reference fasta and aligned BAM. The procedure we chose for BAM is to use Samtools View to keep only alignments chr1-22 and X,Y; we then: * extracted the header with samtools view -H, * deleted all the unused contigs * renamed the .txt file to .sam * used reheader BAM according to instructions.
The contents of the sam used for reheader is given below:
@HD VN:1.5 SO:coordinate
@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@SQ SN:chr3 LN:198295559
@SQ SN:chr4 LN:190214555
@SQ SN:chr5 LN:181538259
@SQ SN:chr6 LN:170805979
@SQ SN:chr7 LN:159345973
@SQ SN:chr8 LN:145138636
@SQ SN:chr9 LN:138394717
@SQ SN:chr10 LN:133797422
@SQ SN:chr11 LN:135086622
@SQ SN:chr12 LN:133275309
@SQ SN:chr13 LN:114364328
@SQ SN:chr14 LN:107043718
@SQ SN:chr15 LN:101991189
@SQ SN:chr16 LN:90338345
@SQ SN:chr17 LN:83257441
@SQ SN:chr18 LN:80373285
@SQ SN:chr19 LN:58617616
@SQ SN:chr20 LN:64444167
@SQ SN:chr21 LN:46709983
@SQ SN:chr22 LN:50818468
@SQ SN:chrX LN:156040895
@SQ SN:chrY LN:57227415
@RG ID:1 PL:Illumina HiSeq SM:Plasma
@PG ID:bwa PN:bwa CL:/opt/bwa-0.7.13/bwa mem -M -R @RG\tID:1\tPL:Illumina HiSeq\tSM:Plasma -t 8 Homo_sapiens_assembly38.fasta /sbgenomics/Projects/8e3d501e-ff08-4ae4-b3ec-cc73e54a9b9a/ERR855951_SarcomacfDNA_1.fastq.gz /sbgenomics/Projects/8e3d501e-ff08-4ae4-b3ec-cc73e54a9b9a/ERR855951_SarcomacfDNA_2.fastq.gz VN:0.7.13-r1126
@PG ID:SAMBLASTER CL:samblaster -i /dev/stdin -o /dev/stdout -M VN:0.1.22
@PG ID:GATK ApplyBQSR VN:4.beta.2 CL:ApplyBQSR --output ERR855951_SarcomacfDNA.recalibrated.bam --bqsr_recal_file /sbgenomics/Projects/8e3d501e-ff08-4ae4-b3ec-cc73e54a9b9a/workspace/5ef73ca8-e545-4e5c-9b91-beb69b529d41/ctdna-align_WES___BWA___GATK_4_0__PLASMA_GATK_BaseRecalibrator/ERR855951_SarcomacfDNA.recal_data.grp --input /sbgenomics/Projects/8e3d501e-ff08-4ae4-b3ec-cc73e54a9b9a/workspace/5ef73ca8-e545-4e5c-9b91-beb69b529d41/ctdna-align_WES___BWA___GATK_4_0__PLASMA_BWA_MEM_Bundle_0_7_13/ERR855951_SarcomacfDNA.bam --createOutputBamIndex true --preserve_qscores_less_than 6 --useOriginalQualities false --quantize_quals 0 --round_down_quantized false --emit_original_quals false --globalQScorePrior -1.0 --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 40 --cloudIndexPrefetchBuffer -1 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --disableToolDefaultReadFilters false PN:GATK ApplyBQSR
After this any tool using the BAM fails, for example Samtools reports a truncated file. The only way I can access it if I used Samtools in the header-only mode, where again I get the same header. What could be the problem here?
Thanks for the answer. Is there a use-case where samtools reheader works correctly?
Please use
ADD COMMENT/ADD REPLY
when responding to existing posts to keep threads logically organized.This comment should have gone under @Devon's answer.
Sure, changing things like "chr1" to "1", or removing chromosomes from the end (or adding chromosomes there). I consider
samtools reheader
to be a very advanced function that largely requires the user be familiar with how BAM files are internally structured.I am asking because we have tried removing chromosomes from the end, as far as I can tell. Your advice works, but I am trying to understand what went wrong.