Hi everyone, bioinformatics noob here :)
I'm trying to extract supplementary reads (Illumina paired-end 150bp) from a series of bam files. Below is the samtools flagstat output for one of the bam files:
784306555 + 0 in total (QC-passed reads + QC-failed reads)
776501797 + 0 primary
0 + 0 secondary
7804758 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
781629426 + 0 mapped (99.66% : N/A)
773824668 + 0 primary mapped (99.66% : N/A)
776002786 + 0 paired in sequencing
388001393 + 0 read1
388001393 + 0 read2
752413378 + 0 properly paired (96.96% : N/A)
772405530 + 0 with itself and mate mapped
1419127 + 0 singletons (0.18% : N/A)
16842836 + 0 with mate mapped to a different chr
10571871 + 0 with mate mapped to a different chr (mapQ>=5)
I used the following command to extract paired-end fastq files:
samtools fastq -1 -2 -s -0 -c -n -@ ${sample_name}.sorted.bam
Then I read that samtools fastq actually ignores secondary and supplementary reads. So I tried using the command below to extract supplementary reads but I'm retrieving far less than the reported number in flagstat. Also, some of the reads that are extracted are 150bp and so they aren't split hits either. \
samtools view -b -h -@ -f 0x800 ${sample_name}.sorted.bam | samtools sort -m -@ -n -T - | bedtools bamtofastq -i -fq -fq2
These bam files were created using the -Y flag in bwa mem so the supplementary reads were soft clipped and so according to this post (Converting BAM to Fastq - losing reads), I might already be extracting the supplementary reads in the first samtools fastq command?
Could anyone confirm this or provide an alternative way to extract supplementary reads. Thank you for your help!
that's wrong. unless sorted.bam was sorted on QUERY name (NOT coordinate).
hi thanks for replying. i did sort the bam file by name using samtools collate prior to samtools fastq