A bit of a confusing one here. First I mapped my fastq reads using BWA, converted my sam files to bam, sorted and indexed them. Next, I obtained some stats and I wanted to use two different programs so I ran the following:
bamtools stats -in sample_sorted.bam -insert > sample_sorted_bamtools.stats.txt
samtools flagstat -t 16 "sample_sorted.bam" > "sample_sorted_samtools.stats.txt" # quotes needed for this, don't know why but wouldn't work otherwise
However, the stat outputs for mapped reads from each are different for some reason. Same with singletons. Below is an example for sample1_sorted.bam
BAMTOOLS
Total reads: 29075080
Mapped reads: 28157243 (96.8432%)
Forward strand: 14993199 (51.5672%)
Reverse strand: 14081881 (48.4328%)
Failed QC: 0 (0%)
Duplicates: 0 (0%)
Paired-end reads: 29075080 (100%)
'Proper-pairs': 26900935 (92.5223%)
Both pairs mapped: 28148319 (96.8125%)
Read 1: 14538803
Read 2: 14536277
Singletons: 8924 (0.030693%)
Average insert size (absolute value): 878.48
Median insert size (absolute value): 230
SAMTOOLS
29075080 + 0 in total (QC-passed reads + QC-failed reads)
155590 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
28157243 + 0 mapped (96.84%:N/A)
28919490 + 0 paired in sequencing
14459745 + 0 read1
14459745 + 0 read2
26852874 + 0 properly paired (92.85%:N/A)
27993140 + 0 with itself and mate mapped
8513 + 0 singletons (0.03%:N/A)
1026542 + 0 with mate mapped to a different chr
308644 + 0 with mate mapped to a different chr (mapQ>=5)
Then I used sambamba to separate mapped and unmapped reads from my fastq files with the following commands:
# For all R1 and R2 reads do the following:
# Get all unmapped reads
sambamba view -f bam -F 'unmapped and mate_is_unmapped' -t 16 sample_sorted.bam
sambamba sort -t 16 -o sample_unmapped_sorted.bam
# Get all mapped reads
sambamba view -f bam -F 'not (unmapped or mate_is_unmapped) and paired' -t 16 sample_sorted.bam
sambamba sort -t 16 -o sample_mapped_sorted.bam
I then used samtools view -c to see how many reads are in each of the sorted bamfiles to check that they equal the initial amount of reads (which should be 29075080 for this sample).
samtools view -c sample_mapped_sorted.bam
# output = 28148319
samtools view -c sample_unmapped_sorted.bam
# output = 909324
Total of mapped and unmapped reads in bamfiles = 29,057,643 which is 17,437 reads short. Furthermore, if I add the singletons from the bamtools stat file (8924 reads) and the samtools stat file (8513 reads), they equal 17,437. Why is it just not showing as 17,437 in both stat outputs? Why are there discrepancies in the two stat files, why would singletons from both stat files equal the amount of missing reads from my mapped and unmapped files, and why am I missing reads? Is this an issue of sambamba? Did I use an incorrect flag for sambamba?
Thank you for your help.
If I want to extract all mapped reads, can I just change it to 'paired and proper_pair' instead? If I leave it out, I will get all reads including unmapped but I need those as separate files. Or did you mean simply just remove the "and paired" part of the line?
What I meant was that the complement of 'unmapped and mate_is_unmapped' is not 'not (unmapped or mate_is_unmapped) and paired', therefore the sum of both read sets won't be equal to the total number of reads. If you just remove the 'and paired' part of the line you'd still be considering reads from the 'unmapped and mate_is_unmapped' set, as the complement of 'unmapped and mate_is_unmapped' is simply 'not (unmapped and mate_is_unmapped)'.
If you want mapped reads on one side and unmapped reads on other you can do it in a much simpler way:
Thanks I did try that but samtools is slow, hence why I was opting to use sambamba. Thank you though for the explanation!
If speed is an issue you can use threads (
-@
option) in order to improve samtools' performance:I tried it before with samtools -@, still wayyyy too slow for some reason.