Hi, I have a question between the 2 different ways of counting unique mapped reads. I used paired-end sequencing and mapped the reads to hg19. My library size is ~300bp and am using 2*151bp paired end sequencing. - my total reads from the below calculation is 1473870
samtools view -c myfile.sam
my total mapped read is 723134. I understand that this includes multi-mapped reads
samtools view -S -F0x4 -c myfile.sam
when I try to calculate the number of unique reads using the command below, I get 338787.
samtools view -F 0x4 myfile.sam | cut -f 1 | sort | uniq | wc -l
I also compared the another way of calculating uniquely mapped reads. Here, I get 392467.
samtools view -S -q 1 -c myfile.sam
I understand that the flag 1 is for paired end reads but not quite sure which number (338787 vs 392467) is the right uniquely mapped reads, which I define it as the number of reads that are mapped the most accurately and are mapped only once to the reference genome.
Thanks a lot for the help!
I realize this is an old thread, but in case anyone else lands here like I did:
1) It looks like the -S option is not needed (anymore?), samtools auto-detects the input format.
2) -q 1 is not a flag for paired end reads, that would be -f 1. -q 1 returns any matches with a MAPQ score >= 1. This means different things for different alignment programs, so use with caution...
I get the impression that -c does not take unique sequences into account, from the command line help:
This sounds to me like it returns a count of all records, even if there are duplicates.