Question

What is the very large unmapped.bam within my full WGS bam?

0

Entering edit mode

7.3 years ago

ycsm ▴ 10

Hello all, I had a WGS carried out which resulted in a 57gb bam file. I have split this with the instruction 'bamtools split -in file.bam -reference' into chromosone order and was surprised to find an 'unmapped.bam' which was 28.1gb (nearly half of the size of the total bam file). What is the unmapped.bam? Is it not useable in any way?

I indexed and then used 'samtools flagstat' on unmapped.bam which produced this : (if it helps)

360510943 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 secondary

0 + 0 supplementary

0 + 0 duplicates

0 + 0 mapped (0.00% : N/A)

360510943 + 0 paired in sequencing

180165653 + 0 read1

180345290 + 0 read2

0 + 0 properly paired (0.00% : N/A)

0 + 0 with itself and mate mapped

0 + 0 singletons (0.00% : N/A)

0 + 0 with mate mapped to a different chr

0 + 0 with mate mapped to a different chr (mapQ>=5)

Any explanation would be appreciated. Thank you.

sequencing genome gene • 1.7k views

ADD COMMENT • link 7.3 years ago by ycsm ▴ 10

0

Entering edit mode

run samtools flagstat on original bam file.

There are around 180 million read pairs which did not map to the reference genome.

ADD REPLY • link 7.3 years ago by GouthamAtla 12k

0

Entering edit mode

Thank you for your help. Is that something I need to be concerned about? Why don't they map to a reference genome? Mine was HG19. The flagstat on the original bam file is:

839667458 + 0 in total (QC-passed reads + QC-failed reads)

23306152 + 0 secondary

0 + 0 supplementary

0 + 0 duplicates

466993833 + 0 mapped (55.62% : N/A)

816361306 + 0 paired in sequencing

408091264 + 0 read1

408270042 + 0 read2

430841954 + 0 properly paired (52.78% : N/A)

431524999 + 0 with itself and mate mapped

12162682 + 0 singletons (1.49% : N/A)

213835 + 0 with mate mapped to a different chr

213835 + 0 with mate mapped to a different chr (mapQ>=5)

Thank you for your help.

ADD REPLY • link 7.3 years ago by ycsm ▴ 10

0

Entering edit mode

Thats a different question why they dint map to reference genome. You need to provide more details like, what genome it is , that type of data it is ? The tools and the command used etc to understand why there are many unmapped reads.

ADD REPLY • link 7.3 years ago by GouthamAtla 12k