I am working with paired-end Illumina Hiseq data. My goal is to create a new set of fastq files containing only paired reads that do not map to an E. coli reference.
I created an alignment to an E. coli reference, then extracted unmapped paired reads to a new file:
samtools view -u -f 12 ecoli_sam.sam > unmapped.bam
I sorted to maintain the order of paired reads
samtools sort -n unmapped.bam sorted_unmapped
And converted to fastq
bamToFastq -useTags -bam sorted_unmapped.bam -fq1 unmappedR1.fastq -fq2 unmappedR2.fastq
When I view the resulting fastq files, the first file looks normal (unmappedR1.fastq), but the paired file (unmappedR2.fastq) has only the read headers but not sequence. An example is below.
@HISEQ:242:H32LYADXX:2:1101:1272:81236/2
+
@HISEQ:242:H32LYADXX:2:1101:1273:64750/2
+
Can someone tell me where I'm going wrong?
could you run the following:
and post the results?
Sure, here's the output:
That is weird. You are supposed to have at least 11 fields: https://samtools.github.io/hts-specs/SAMv1.pdf
Try using 13 as the SAM flag, this specifies that the read and its mate are unmapped and that they are paired.
I just gave this a try, and unfortunately using the 13 flag still gives the same incomplete fastq file for the paired reads.
Maybe the error is that you are passing in a SAM file to tools which require a BAM file? Your first filtering command will produce SAM output, but you've given it a BAM extension. Maybe try filtering again using the -b flag to ensure a BAM format is output?
Hm, still doesn't seem to be solving the problem. I tried using a BAM file as input rather than a SAM file and changing the -u flag to a -b flag, but I'm still getting the same result.
EDIT: my mistake these are unmapped reads, ignore this comment
Have you done any type of filtering beforehand, such as by mapping quality? If so you'll end up with paired reads where one of the mates has been removed, but the mates flag is still set to being paired. Run the samtools fixmate command on your file and then try your steps again.