Hello all, I had a WGS carried out which resulted in a 57gb bam file. I have split this with the instruction 'bamtools split -in file.bam -reference' into chromosone order and was surprised to find an 'unmapped.bam' which was 28.1gb (nearly half of the size of the total bam file). What is the unmapped.bam? Is it not useable in any way?
I indexed and then used 'samtools flagstat' on unmapped.bam which produced this : (if it helps)
360510943 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (0.00% : N/A)
360510943 + 0 paired in sequencing
180165653 + 0 read1
180345290 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Any explanation would be appreciated. Thank you.
run samtools flagstat on original bam file.
There are around 180 million read pairs which did not map to the reference genome.
Thank you for your help. Is that something I need to be concerned about? Why don't they map to a reference genome? Mine was HG19. The flagstat on the original bam file is:
839667458 + 0 in total (QC-passed reads + QC-failed reads)
23306152 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
466993833 + 0 mapped (55.62% : N/A)
816361306 + 0 paired in sequencing
408091264 + 0 read1
408270042 + 0 read2
430841954 + 0 properly paired (52.78% : N/A)
431524999 + 0 with itself and mate mapped
12162682 + 0 singletons (1.49% : N/A)
213835 + 0 with mate mapped to a different chr
213835 + 0 with mate mapped to a different chr (mapQ>=5)
Thank you for your help.
Thats a different question why they dint map to reference genome. You need to provide more details like, what genome it is , that type of data it is ? The tools and the command used etc to understand why there are many unmapped reads.