Truncated Bam Error
1
0
Entering edit mode
6.7 years ago
vivekruhela ▴ 20

Hi,

I am currently working on next-gen sequencing data and I have recently complete my preprocessing pipeline. But there are some point I want to ask and get opinion whether I am going in write direction or not:

  1. While converting from sam file to bam file I only take properly paired reads, my command line is as follows :

    samtools view -S -@ 30 -M -f 0x02 -b input_sam -o input_bam

where -f stands for considering only properly paired reads. I have checked that how many reads I have missed (means 0x04,0x08 etc). Very small amount of reads I have missed i.e. the size of original sam file is 26 gb and there is another sam with which has all the reads excluding 0x02 (i.e. missing reads) is 152 mb in size and . So it is ok to not to take all the reads other than properly paired?

  1. My post processing steps are as follows:

    Sam to Bam conversion and take only properly paired reads

    Bam Validation

    Sorting of bam file with sorting order "queryname" (because fixmate require sorted bam file)

    fixmate using samtools

    Sorting again with sorting order "coordinate" (because samtools rmdup requires coordinated sorted bam file)

    Remove duplicates

    Indel Realignment

    BQSR

Now the problem is till indel realignment is OK but after base quality score recalibration, I am always getting truncated file and EOF missing (I have checked this by the command samtools view -c file.bam at every stage). Due to this later stages of my pipeline are affected. Surprisingly GATK is working with truncated bam file with some error at the last line. So what I am missing, I don't know. I am looking for advice for getting better performance.

EDIT: Sorry for incomplete post. I have completed this by adding last two steps of post processing. My apologies.

EDIT2: I am posting warning and errors (I recently found them)

While calculating recalibrating score and getting .table file I am getting following warning:

WARN  04:28:08,395 IndexDictionaryUtils - Track knownSites doesn't have a
sequence dictionary built in,skipping dictionary validation

While getting recalibrated bam I am getting following warning:

Failed to write core dump. Core dumps have been disabled. To enable core dumping,
try "ulimit -c unlimited" before starting Java again'

While using that recalibrated bam file for variant calling using gatk haplotype caller I am getting the following error:

ERROR MESSAGE: File out_recalibrated_bam.bai is malformed: Premature end-of-file while
reading BAM index file out_recalibrated_bam.bai It's likely that this file is truncated or corrupt -- 
Please try re-indexing the corresponding BAM file.

Any idea about those error messages.

Thanks

R next-gen sequencing software error • 6.1k views
ADD COMMENT
0
Entering edit mode

What version of samtools?

ADD REPLY
0
Entering edit mode

I am using Samtools-1.7

ADD REPLY
0
Entering edit mode

Can you check to see if the solutions provided in this thread help: How to systematically check if a bam file is truncated

ADD REPLY
0
Entering edit mode

Thanks for reply. I have checked the link sent by you. This link will help us if we don't know which bam file is truncated or eof missing. Well, I have done that by both ways means 'samtools view -c' and 'samtools quickcheck' . I know that which file is truncated. So my question is all the steps of post processing are working fine except the last one. Why? And how to correct the error. I have also checked 'tail out.bam | hexdump -C ' to check 28 byte code for rog and unfortunately i did not find it. So how to deal with this error. Thanks.

ADD REPLY
0
Entering edit mode
samtools view -S -@ 30 -M -f 0x02 -b input_sam -o input_bam
  

You need -h, otherwise you bam won't have the headers.

ADD REPLY
0
Entering edit mode

You mean that all of the error are due to this (means my bam files doesn't have header). Please clarify.

ADD REPLY
0
Entering edit mode

I tried adding -h also. Same error is coming. Till indel realignment, BAM file is OK but after BAM recalibration, EOF is missing and file is truncated. Let me show the exact error message by samtools view -c recalibrated_out.bam : [W::bam_hdr_read] EOF marker is absent. The input is probably truncated [E::bgzf_read] Read block operation failed with error -1 after 8 of 32 bytes [main_samview] truncated file.

ADD REPLY
1
Entering edit mode
6.7 years ago
vivekruhela ▴ 20

After a lot of research, finally my problem is solved. The reason of error is as follows:

I am using "The Genome Analysis Toolkit (GATK) v3.8-0-ge9d806836"

And the command line are as follows : For table file:

java -Xms32g -Djar.io.tmpdir=/tmp -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R reference.fastq -I indel_realignment_outfile.bam -knownsites All_20170710.vcf.gz -o outfile.table

For recalibration of bam:

java -Xms32g -jar GenomeAnalysisTK.jar -T PrintReads -R reference.fastq -I indel_realignment_outfile.bam -BQSR outfile.table -o outflie_recalibrated.bam

GATK 3.8 has a bug for memory allocation due to old Intel GKL. Intel GKL is updated in the latest version 3.8-1 and GATK 4.0 releases so they don't have this bug.

So I removed Xms32g and then it is working fine.

ADD COMMENT
0
Entering edit mode

It is odd that the bug bit you with only one file. Thanks for posting the answer to provide closure.

ADD REPLY
0
Entering edit mode

I was also thinking the same.

ADD REPLY
0
Entering edit mode

I also have this problem, but it doesn't work even I remove the Xms32g. The result bam is still malformed.

ADD REPLY

Login before adding your answer.

Traffic: 2629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6