I'm working with paired-end wgs data I downloaded from TCGA.
I'm trying to extract the reads that align to a specific region, extract only those reads to two fastq files, one for each pair. Unfortunately, I am getting a different number of reads in both fastq files because some of the reads in file.1.fq
are not in file.2.fq
.
Based on the answer to this question (Why number of #read1 and #read2 is different in samtools flagstat output?) I've tried subsetting my bam file as follows:
samtools view -bh -F 2048 TCGA_bam.bam chr1:200500500-202500000 chr2:200500500-202500000 > subset_bam.bam
when running samtools flagstat subset_bam.bam
on the output file, the number of reads for read1 and read2 do not match. However, I've managed to make them match by 1) sorting by read name and 2) running samtools fixmate.
I've also tried different flags in samtools view including
-f 3 -F 3080
-f 2 -F 3080
-f 2 -F 3080
-F 3072
(there may be a few others). However, in each case, after converting subset_bam.bam
with samtools fastq
the number of reads in the fastq's derived from subset_bam.bam
do not match. For example:
$ echo $(cat subset.1.fastq | wc -l)/4 | bc
20057
$ echo $(cat subset.2.fastq | wc -l)/4 | bc
19991
I've also extracted the reads from the original bam file into two fastq's. However, at this stage they have the same number of reads:
$ echo $(cat full_file.1.fq | wc -l)/4 | bc
223500494
$ echo $(cat full_file.2.fq | wc -l)/4 | bc
223500494
So this does not appear to be an issue with the input data, but rather is related to subsetting.
How can I extract reads that fall into a specific region, convert to fastq and ensure I have the same number of reads in each fastq?
you should use -F 2304 , not just -F 2048 , to remove secondary alignments. https://broadinstitute.github.io/picard/explain-flags.html
Hi Pierre, thanks for the suggestion. I've just tried it again with
-F 2304
and unfortunately I'm getting the same issue: 30205 for my first fq file and 30071 for my second file. Any idea what is going on?