Thanks gorge. I will try samtools to count reads. Actually, I counted
fastq file and sam file by C. The fastq file line * 1/4 would be the
fastq reads count, and the sam file line *1/2 would be the mapped
reads count right?
I have never done this kind of analysis. However, I have been using bwa for a year now. Although I know about samtools, I often use python or C++ to analyze my SAM files directly. It's just faster to count everything I want in one pass rather than running samtools many times.
If you use bwa then everything you need should be in the SAM file. There should be a line for all your reads both mapped and unmapped. You don't need the FASTQ file to figure out how many reads you have total or which ones are mapped.
It makes a lot of sense to just use the samtool commands that I suggested, just to take out the unknown factor of whether your C code is correct.
Note that to count which reads are mapped, you have to look at the second column of the SAM file, the number there should NOT have the third bit set. In C, this would be something like:
if (!(flag&4)) mapped_count++;
You should ignore all lines that start with '@' and count the rest. Then your mapped reads are
((float)mappedcount)/totalreads
If you are not doing something like this then your counts might not be right.
This is not a direct answer to your question, but you can do some basic counts using samtools
number of mapped reads: samtools view -S -F0x4 -c reads.sam
number of unmapped reads: samtools view -S -f0x4 -c reads.sam
number of uniquely mapped reads: samtools view -S -F0x4 -c -q 1 reads.sam
Thanks gorge. I will try samtools to count reads. Actually, I counted fastq file and sam file by C. The fastq file line * 1/4 would be the fastq reads count, and the sam file line *1/2 would be the mapped reads count right?
Hi, George. I use gatk to dedup, indel realign and table recalibaration. After that, I summarize unique reads of bam file using samtools, it's about 118M. Does that sounds good? And, I found that although I used gatk to deduplicate, the output still has 10% of duplicates by samtools. So what's the problem here? Thanks
Also, typically bwa would output reads that mapped to multiple positions. When you do the mapping, do you use the whole human genome or just the part that you think you have captured? If so, is it possible some of the reads fall outside the 2Mb region?
I used the whole genome to do mapping. Yes, definitely some reads fall outside 2Mb reagion. Often we capture most of the region we want, and also some other region randomly maybe.
A capture efficiency of 40-60% is not unexpected, at least in my experience.
Thanks Davis! After filtering reads by library size and deduplicate, I got about an efficiency of 22%. Is this acceptable?
Can you post the samtools flagstats output?
Thanks Rm! I am trying samtools. I used C to count reads myself~
hi, I use gatk to dedup, indel realign and table recalibaration. After that, I summarize unique reads of bam file using samtools, it's about 118M.