Hi,
I am currently working on next-gen sequencing data and I have recently complete my preprocessing pipeline. But there are some point I want to ask and get opinion whether I am going in write direction or not:
While converting from sam file to bam file I only take properly paired reads, my command line is as follows :
samtools view -S -@ 30 -M -f 0x02 -b input_sam -o input_bam
where -f stands for considering only properly paired reads. I have checked that how many reads I have missed (means 0x04,0x08 etc). Very small amount of reads I have missed i.e. the size of original sam file is 26 gb and there is another sam with which has all the reads excluding 0x02 (i.e. missing reads) is 152 mb in size and . So it is ok to not to take all the reads other than properly paired?
My post processing steps are as follows:
Sam to Bam conversion and take only properly paired reads
Bam Validation
Sorting of bam file with sorting order "queryname" (because fixmate require sorted bam file)
fixmate using samtools
Sorting again with sorting order "coordinate" (because samtools rmdup requires coordinated sorted bam file)
Remove duplicates
Indel Realignment
BQSR
Now the problem is till indel realignment is OK but after base quality score recalibration, I am always getting truncated file and EOF missing (I have checked this by the command samtools view -c file.bam
at every stage). Due to this later stages of my pipeline are affected. Surprisingly GATK is working with truncated bam file with some error at the last line. So what I am missing, I don't know. I am looking for advice for getting better performance.
EDIT: Sorry for incomplete post. I have completed this by adding last two steps of post processing. My apologies.
EDIT2: I am posting warning and errors (I recently found them)
While calculating recalibrating score and getting .table
file I am getting following warning:
WARN 04:28:08,395 IndexDictionaryUtils - Track knownSites doesn't have a
sequence dictionary built in,skipping dictionary validation
While getting recalibrated bam I am getting following warning:
Failed to write core dump. Core dumps have been disabled. To enable core dumping,
try "ulimit -c unlimited" before starting Java again'
While using that recalibrated bam file for variant calling using gatk haplotype caller I am getting the following error:
ERROR MESSAGE: File out_recalibrated_bam.bai is malformed: Premature end-of-file while
reading BAM index file out_recalibrated_bam.bai It's likely that this file is truncated or corrupt --
Please try re-indexing the corresponding BAM file.
Any idea about those error messages.
Thanks
What version of samtools?
I am using Samtools-1.7
Can you check to see if the solutions provided in this thread help: How to systematically check if a bam file is truncated
Thanks for reply. I have checked the link sent by you. This link will help us if we don't know which bam file is truncated or eof missing. Well, I have done that by both ways means 'samtools view -c' and 'samtools quickcheck' . I know that which file is truncated. So my question is all the steps of post processing are working fine except the last one. Why? And how to correct the error. I have also checked 'tail out.bam | hexdump -C ' to check 28 byte code for rog and unfortunately i did not find it. So how to deal with this error. Thanks.
You need
-h
, otherwise you bam won't have the headers.You mean that all of the error are due to this (means my bam files doesn't have header). Please clarify.
I tried adding
-h
also. Same error is coming. Till indel realignment, BAM file is OK but after BAM recalibration, EOF is missing and file is truncated. Let me show the exact error message bysamtools view -c recalibrated_out.bam
:[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
[E::bgzf_read] Read block operation failed with error -1 after 8 of 32 bytes
[main_samview] truncated file.