Question

Extracting supplementary reads from bam file

0

Entering edit mode

3 months ago

g1ang ▴ 20

Hi everyone, bioinformatics noob here :)

I'm trying to extract supplementary reads (Illumina paired-end 150bp) from a series of bam files. Below is the samtools flagstat output for one of the bam files:

784306555 + 0 in total (QC-passed reads + QC-failed reads)
776501797 + 0 primary
0 + 0 secondary
7804758 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
781629426 + 0 mapped (99.66% : N/A)
773824668 + 0 primary mapped (99.66% : N/A)
776002786 + 0 paired in sequencing
388001393 + 0 read1
388001393 + 0 read2
752413378 + 0 properly paired (96.96% : N/A)
772405530 + 0 with itself and mate mapped
1419127 + 0 singletons (0.18% : N/A)
16842836 + 0 with mate mapped to a different chr
10571871 + 0 with mate mapped to a different chr (mapQ>=5)

I used the following command to extract paired-end fastq files:

samtools fastq -1 -2 -s -0 -c -n -@ ${sample_name}.sorted.bam

Then I read that samtools fastq actually ignores secondary and supplementary reads. So I tried using the command below to extract supplementary reads but I'm retrieving far less than the reported number in flagstat. Also, some of the reads that are extracted are 150bp and so they aren't split hits either. \

samtools view -b -h -@ -f 0x800 ${sample_name}.sorted.bam | samtools sort -m -@ -n -T - | bedtools bamtofastq -i -fq -fq2

These bam files were created using the -Y flag in bwa mem so the supplementary reads were soft clipped and so according to this post (Converting BAM to Fastq - losing reads), I might already be extracting the supplementary reads in the first samtools fastq command?

Could anyone confirm this or provide an alternative way to extract supplementary reads. Thank you for your help!

samtools bwa-mem • 666 views

ADD COMMENT • link updated 3 months ago by cmdcolin ★ 4.0k • written 3 months ago by g1ang ▴ 20

0

Entering edit mode

samtools fastq -1 -2 -s -0 -c -n -@ ${sample_name}.sorted.bam

that's wrong. unless sorted.bam was sorted on QUERY name (NOT coordinate).

The input to this program must be collated by name. Run 'samtools collate' or 'samtools sort -n' to achieve this.

ADD REPLY • link 3 months ago by Pierre Lindenbaum 164k

1

Entering edit mode

hi thanks for replying. i did sort the bam file by name using samtools collate prior to samtools fastq

ADD REPLY • link 3 months ago by g1ang ▴ 20

score 3 · Accepted Answer · 2024-08-22

3

Entering edit mode

3 months ago

cmdcolin ★ 4.0k

one thing that is important is to know "why" you are doing this. do you feel like you are missing the supplementary reads when you are converting to fastq? if so, then you are likely operating under a faulty assumption. there is a reason that secondary and supplementary reads are not included when converting to fastq....it's because they are just "artifacts" of the alignment. the hint of this is that all supplementary and secondary share the same QNAME. they share the same query sequence! only that one copy of that query sequence should be converted back to FASTQ, and the primary alignment is all that is needed to do so. the primary alignment contains the full SEQ record.

ADD COMMENT • link 3 months ago by cmdcolin ★ 4.0k

1

Entering edit mode

Hi, thank you for your response! I was under the assumption that for chimeric reads, the primary alignment would be stored separately from the shorter split hits and as such, would need to be extracted separately. However, it sounds like I'm already extracting the full SEQ record, which makes things a lot easier :) P.S. I'm helping to recover reads from bam files after raw sequences were lost in a mass deletion.

ADD REPLY • link 3 months ago by g1ang ▴ 20

0

Entering edit mode

yes...the primary read contains the full SEQ field that can be translated back to FASTQ. best luck with recovery! i think it should be ok, maybe except for adapter trimming if it was done, then it should basically be the same as the original fastq

ADD REPLY • link 3 months ago by cmdcolin ★ 4.0k