How to remove singletons in a BAM file using samtools
1
2
Entering edit mode
6.0 years ago
neranjan ▴ 70

Dear All,

I want to remove the singletons from the aligned bam file.

Downloaded fastq files using

fastq-dump --split-files SRR1517848

Then aligned the paired end fastq files, using BWA by:

bwa mem hg19.fa R_1.fa R_2.fa -o SRR1517848.sam

Then converted it to BAM format and I used samtools to remove the singletons using : according to the paper, article

samtools view -@ 8 -F 0x04 -b SRR1517848.sam > SRR1517848.bam

Then I looked into the stats using samtools flagstat command by;

samtools flagstat SRR1517848.bam

which gave;

   Before filtering 
    4614120 + 0 in total (QC-passed reads + QC-failed reads)
    4549753 + 0 mapped (98.61% : N/A)
    4609656 + 0 paired in sequencing
    2304828 + 0 read1
    2304828 + 0 read2
    4461356 + 0 properly paired (96.78% : N/A)
    4517788 + 0 with itself and mate mapped
    27501 + 0 singletons (0.60% : N/A)
    39296 + 0 with mate mapped to a different chr
    33576 + 0 with mate mapped to a different chr (mapQ>=5)

And after filtering

4549753 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
4464 + 0 supplementary
0 + 0 duplicates
4549753 + 0 mapped (100.00% : N/A)
4545289 + 0 paired in sequencing
2281318 + 0 read1
2263971 + 0 read2
4461356 + 0 properly paired (98.15% : N/A)
4517788 + 0 with itself and mate mapped
27501 + 0 singletons (0.61% : N/A)
39296 + 0 with mate mapped to a different chr
33576 + 0 with mate mapped to a different chr (mapQ>=5)

So my Questions:

1) So the command to remove the singletons by -F 0x04 have only removed unmapped reads ?

2) In both before and after stats the singletons remain, and what are those?

Before: 27501 + 0 singletons (0.60% : N/A)

After : 27501 + 0 singletons (0.61% : N/A)

3) Is there a way to remove singletons in variant detection GATK pipeline ? what benefits does it have?

Thanks

BWA alignment samtools singletons • 6.0k views
ADD COMMENT
5
Entering edit mode
6.0 years ago
rizoic ▴ 250

As per the SAM flag guide the flag that you are filtering for i.e. 0x04 is for unmapped reads. This explains the removal of unmapped reads.

In case you want to filter for singletons i.e. reads for which the mate is unmapped you an select that as the condition. This gives a SAM flag value of 0x08. You can hence change your command to

samtools view -@ 8 -F 0x08 -b SRR1517848.sam > SRR1517848.bam

This should remove the singleton reads.

ADD COMMENT
3
Entering edit mode

I'd also add the header to the bam file (by adding the -h parameter), to avoid problems in downstream analysis.

ADD REPLY
0
Entering edit mode

Thanks when I used the above command it gave singletons removed reads:

4549662 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
4373 + 0 supplementary
0 + 0 duplicates
4522161 + 0 mapped (99.40% : N/A)
4545289 + 0 paired in sequencing
2263971 + 0 read1
2281318 + 0 read2
4461356 + 0 properly paired (98.15% : N/A)
4517788 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
39296 + 0 with mate mapped to a different chr
33576 + 0 with mate mapped to a different chr (mapQ>=5)

As for SAM flag guide when I use 0x08 I am removing mate unmapped (0x8) reads, and with a warning: Flag(s) and 0x8 cannot be set when read is not paired

eg: R1 and R2 are mate pairs, where:
R1 = mapped (mate is unmapped)
R2 = unmapped .
So with this I am removing the mate unmapped reads only ?
i.e so I am removing R2 reads and keeping the R1 ?

2) My Second question is How important is to remove the singletons in GATK variant detection pipe line? Is this a step that I need to be doing? or can I avoid this?

ADD REPLY
0
Entering edit mode

Thanks when I used the above command it gave singletons removed reads:

4549662 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
4373 + 0 supplementary
0 + 0 duplicates
4522161 + 0 mapped (99.40% : N/A)
4545289 + 0 paired in sequencing
2263971 + 0 read1
2281318 + 0 read2
4461356 + 0 properly paired (98.15% : N/A)
4517788 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
39296 + 0 with mate mapped to a different chr
33576 + 0 with mate mapped to a different chr (mapQ>=5)

As for SAM flag guide when I use 0x08 I am removing mate unmapped (0x8) reads, and with a warning: Flag(s) and 0x8 cannot be set when read is not paired

eg: R1 and R2 are mate pairs, where:
R1 = mapped (mate is unmapped)
R2 = unmapped .
So with this I am removing the mate unmapped reads only ?
i.e so I am removing R2 reads and keeping the R1 ?

2) My Second question is How important is to remove the singletons in GATK variant detection pipe line? Is this a step that I need to be doing? or can I avoid this?

ADD REPLY
1
Entering edit mode

In your e.g. R1 is the singleton read since its mate is unmapped and thus R1 read will be removed when you filter for the flag 0x08.

ADD REPLY
0
Entering edit mode

Thanks for the clarification. So why is it important to remove the singleton ?
(So my thinking is::: since its mate can not be mapped to the chromosome , the mapped read can not be accurately determined. )

ADD REPLY

Login before adding your answer.

Traffic: 1915 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6