Trying To Calculate Percent Of Mapped And Unmapped Vs.Total Reads
3
0
Entering edit mode
12.0 years ago

I have a two fastq files per sample (one for the forward stand and the other for the backward), an accepted_hits.bam, and an unmapped.bam. I am wanting to calculate the percent of mapped and unmapped vs. the total number of reads, but I am perplexed by the numbers I am currently getting.

That is, when I calculate the total number of reads like so:

FORWARD = wc -l sample_1.fq | cut -d ' ' -f1 

REVERSE = wc -l sample_2.fq | cut -d ' ' -f1 

TOT_READS = ((FORWARD + REVERSE ) / 4) * 2

And the total reads in accepted_hits.bam and unmapped.bam like so:

ACCEPTED = samtools view accepted_hits.bam | wc -l | cut -d ' ' -f1

UNMAPPED = samtools view unmapped_hits.bam | wc -l | cut -d ' ' -f1

I find that ACCEPTED + UNMAPPED is not equal to TOT_READS.

Can someone explain this to me?

UPDATE: Hi. Sorry for the delay. So depending on the files used I get different outcomes. On some I get that ACCEPTED + UNMAPPED = TOT_READS, which is not what I was expecting (See below); and with others the ACCEPTED is lower than TOT_READS, but not by half (ACCEPTED is within 80-90% of TOT_READS.

If I run: samtools view acceptedhits.bam | cut -f1 | sort | uniq | wc -l

And then run: cat sample.fq | wc -l result_from_ previous_command / 4

The two values are exactly the same.

Thanks

tophat2 • 8.0k views
ADD COMMENT
1
Entering edit mode

Can you provide the numbers you get for FORWARD, REVERSE, TOT_READS, ACCEPTED, and UNMAPPED?

If you're off by twofold it could be because the mapped reads are counted as pairs (bam) while you've counted each read in each pair individually (fasta).

ADD REPLY
0
Entering edit mode

Never off by two fold.

ADD REPLY
5
Entering edit mode
12.0 years ago

(Edit: sorry, I've never seen an output from tophat.)

what do you want to achieve with the following line ?

  wc -l sample_1.fq | cut -d ' ' -f1

there is no 'field' in a fastq file and each record usually takes 4 lines. So you'd rather use:

  wc -l sample_1.fq | sed 's%$%/4%' | bc

to get the number of mapped reads, you should use the option F in samtools

samtools view -F 4 your.bam | wc -l

but you could just use

samtools flagstat <in.bam>
ADD COMMENT
0
Entering edit mode

I think erichpeterson is using reads mapped with bowtie2/tophat. Tophat splits the mapped and unmapped reads into separate files. The flagstat output would be super helpful, however.

ADD REPLY
0
Entering edit mode

OK, I've never seen an output from tophat.

ADD REPLY
2
Entering edit mode
12.0 years ago

It's probably because of multi-mapping reads. A read that maps to several locations will be reported several times in the BAM file (and incidentally samtools flagstat). You need to count the unique read identifiers in the BAM file, or use another tool such as Picard CollectAlignmentMetrics to get the number of reads mapped 1+ times.

ADD COMMENT
0
Entering edit mode
12.0 years ago
JC 13k

As Mikael pointed, you have a problem with multimapped reads. Another solution is using samtool flagstat in your BAMs.

ADD COMMENT

Login before adding your answer.

Traffic: 2443 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6