I have a two fastq files per sample (one for the forward stand and the other for the backward), an accepted_hits.bam, and an unmapped.bam. I am wanting to calculate the percent of mapped and unmapped vs. the total number of reads, but I am perplexed by the numbers I am currently getting.
That is, when I calculate the total number of reads like so:
FORWARD = wc -l sample_1.fq | cut -d ' ' -f1
REVERSE = wc -l sample_2.fq | cut -d ' ' -f1
TOT_READS = ((FORWARD + REVERSE ) / 4) * 2
And the total reads in accepted_hits.bam and unmapped.bam like so:
ACCEPTED = samtools view accepted_hits.bam | wc -l | cut -d ' ' -f1
UNMAPPED = samtools view unmapped_hits.bam | wc -l | cut -d ' ' -f1
I find that ACCEPTED + UNMAPPED is not equal to TOT_READS.
Can someone explain this to me?
UPDATE: Hi. Sorry for the delay. So depending on the files used I get different outcomes. On some I get that ACCEPTED + UNMAPPED = TOT_READS, which is not what I was expecting (See below); and with others the ACCEPTED is lower than TOT_READS, but not by half (ACCEPTED is within 80-90% of TOT_READS.
If I run: samtools view acceptedhits.bam | cut -f1 | sort | uniq | wc -l
And then run: cat sample.fq | wc -l result_from_ previous_command / 4
The two values are exactly the same.
Thanks
Can you provide the numbers you get for FORWARD, REVERSE, TOT_READS, ACCEPTED, and UNMAPPED?
If you're off by twofold it could be because the mapped reads are counted as pairs (bam) while you've counted each read in each pair individually (fasta).
Never off by two fold.