I am trying to include filtering reads in my pipeline. I do this by aligning the reads to a reference sequence with bwa mem
(afterwards I run modifying headers with sed
and sorting reads by name with samtools sort -n
as advised by samtools
manual) and outputting the reads that didn't map with samtools fastq
. When I use simple test data, there shouldn't me any reads matching reference and that's what I observe with samtools flagstat
:
5000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
However while I run samtools fastq
I only get 2500 sequences. I checked if it's not the case that the headers are not unique and maybe samtools
is collapsing them it doesn't seem so. Data was generated with InSilicoSeq
.
The command I ran on this dataset are as follows:
bwa mem -C -t 4 ref.idx input.fq | sed -r "s/^([^@].*)\t([^\t]*)$/\\1\tZI:Z:\\2/" | samtools sort -n | tee >(samtools flagstat --threads 4 - > output.stats) >(samtools fastq -f 4 -T ZI --threads 4 - > output_filtered.fq) >> /dev/null
I wonder if for some reason samtools
is missinterpreting flags or am I missing something.
+1 for use of
tee
and process substitution :)This 2500 sequences, how did you calculate this? Was this printed from samtools fastq to stderr?
Thanks :) I just grepped my filtered fastq for the header (@plasmid) and checked number of printed lines with
wc -l
, which returned 2500 read headers (which makes sense since output file is 10 000 lines, considering a single read being represented by 4 lines in fastq).Odd, I did quickly run your command on a dummy file and I get the same number of seqs in fq as in the original bam.
Did you also run sorting by name? Apparently, when I run the command without
-n
option forsamtools sort
, I get all the reads I should get.I also tried this approach on paired-end data with appropriate flags (
samtools fastq -f 13 -T ZI --threads 4 -1 output_filtered_R1.fq -2 output_filtered_R2.fq
) and it returns exactly the number of reads it should.What happens if you omit the sed command? Why are you modifying the headers like that anyway? Why not just name the references what you want them to be named? Or if you have to change the headers, extract them with samtools view -H and modify them then add them back to the rest of the .sam?
Removing sed from the pipeline doesn't change the outcome - only when I remove
-n
argument fromsamtools sort
I get the number of reads I'm supposed to according tosamtools flagstat
Did you find out what happened?
Yes - the problem was the dummy data I was generating, the reads headers specifically - it seems that for R1/R2 reads this information was getting lost in BAM file while sorting by name and as a result, twice fewer reads were returned.