Question

Samtools fastq returns less unmapped than flagstat

2

Entering edit mode

6.0 years ago

ciemanek ▴ 140

I am trying to include filtering reads in my pipeline. I do this by aligning the reads to a reference sequence with bwa mem (afterwards I run modifying headers with sed and sorting reads by name with samtools sort -n as advised by samtools manual) and outputting the reads that didn't map with samtools fastq. When I use simple test data, there shouldn't me any reads matching reference and that's what I observe with samtools flagstat:

5000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

However while I run samtools fastq I only get 2500 sequences. I checked if it's not the case that the headers are not unique and maybe samtools is collapsing them it doesn't seem so. Data was generated with InSilicoSeq.

The command I ran on this dataset are as follows:

bwa mem -C -t 4 ref.idx input.fq | sed -r "s/^([^@].*)\t([^\t]*)$/\\1\tZI:Z:\\2/" | samtools sort -n | tee >(samtools flagstat --threads 4 - > output.stats) >(samtools fastq -f 4 -T ZI --threads 4 - > output_filtered.fq) >> /dev/null

I wonder if for some reason samtools is missinterpreting flags or am I missing something.

samtools ngs filtering alignment • 2.2k views

ADD COMMENT • link updated 5.9 years ago by Biostar 20 • written 6.0 years ago by ciemanek ▴ 140

0

Entering edit mode

+1 for use of tee and process substitution :)

This 2500 sequences, how did you calculate this? Was this printed from samtools fastq to stderr?

ADD REPLY • link 6.0 years ago by ATpoint 89k

0

Entering edit mode

Thanks :) I just grepped my filtered fastq for the header (@plasmid) and checked number of printed lines with wc -l, which returned 2500 read headers (which makes sense since output file is 10 000 lines, considering a single read being represented by 4 lines in fastq).

ADD REPLY • link 6.0 years ago by ciemanek ▴ 140

0

Entering edit mode

Odd, I did quickly run your command on a dummy file and I get the same number of seqs in fq as in the original bam.

ADD REPLY • link 6.0 years ago by ATpoint 89k

0

Entering edit mode

Did you also run sorting by name? Apparently, when I run the command without -n option for samtools sort, I get all the reads I should get.

ADD REPLY • link 6.0 years ago by ciemanek ▴ 140

0

Entering edit mode

I also tried this approach on paired-end data with appropriate flags (samtools fastq -f 13 -T ZI --threads 4 -1 output_filtered_R1.fq -2 output_filtered_R2.fq) and it returns exactly the number of reads it should.

ADD REPLY • link 6.0 years ago by ciemanek ▴ 140

0

Entering edit mode

What happens if you omit the sed command? Why are you modifying the headers like that anyway? Why not just name the references what you want them to be named? Or if you have to change the headers, extract them with samtools view -H and modify them then add them back to the rest of the .sam?

ADD REPLY • link 6.0 years ago by swbarnes2 15k

0

Entering edit mode

Removing sed from the pipeline doesn't change the outcome - only when I remove -n argument from samtools sort I get the number of reads I'm supposed to according to samtools flagstat

ADD REPLY • link 6.0 years ago by ciemanek ▴ 140

0

Entering edit mode

Did you find out what happened?

ADD REPLY • link 5.9 years ago by ATpoint 89k

0

Entering edit mode

Yes - the problem was the dummy data I was generating, the reads headers specifically - it seems that for R1/R2 reads this information was getting lost in BAM file while sorting by name and as a result, twice fewer reads were returned.

ADD REPLY • link 5.9 years ago by ciemanek ▴ 140