Number Of Reads In A Sam File Which Are Assigned To Each Chromosome?
6
5
Entering edit mode
13.1 years ago
KCC ★ 4.1k

Given a sam file which was produced by using bwa to align a fastq of reads to a reference genome, how can I enumerate the number of reads which were assigned to each chromosome of the reference genome?

alignment sam • 31k views
ADD COMMENT
13
Entering edit mode
13.1 years ago

Your question refers to SAM format, but in the event that you may also have a sorted/indexed BAM, you can simply do the following to get this very quickly.

samtools idxstats aln.bam | cut -f 1,3
ADD COMMENT
6
Entering edit mode
10.0 years ago

The solution based on samtools idxstats aln.bam should be used with caution. This is because AFAIK the numbers reported by samtools idxstats (& flagstat) represent the number of alignments of reads that are mapped to chromosomes, not the (non-redundant) number of reads, as stated in the documentation. I stumbled across this by observing that the total numbers reported by samtools idxstats (& flagstat) exceeded the total number of reads in my input fastq files. See further discussion on this issue in the comments on this post: http://left.subtree.org/2012/04/13/counting-the-number-of-reads-in-a-bam-file/

ADD COMMENT
2
Entering edit mode

Incidentally, I wrote a program that calculates read coverage from a sam file (number of reads mapped to each sequence, average depth, median depth, standard deviation, fraction of bases covered, etc).

pileup.sh in=mapped.sam out=coverage.txt

It has a toggle for including or ignoring secondary alignments; secondary=false or secondary=true (default).

ADD REPLY
0
Entering edit mode

When generating your SAM file make sure it has the header section before using pileup.sh.

samtools view -h file.bam > file.sam

ADD REPLY
0
Entering edit mode

Example pipe with filtering:

cat file.bam | samtools view -h -F 0x4 -F 0x100 | samtools idxstats -  |  cut -f 1,3
ADD REPLY
0
Entering edit mode

Brian, I'd be interested to read the source code of your program, but your link is broken. Could I ask you to provide a new link?

ADD REPLY
0
Entering edit mode

Sometimes Sourceforge goes down for a few hours for maintenance, or something. I put the latest version (35.14) on my Google drive; please let me know if you still have trouble accessing it.

https://drive.google.com/file/d/0B3llHR93L14wTzZSMjZvdjdmYzg/view?usp=sharing

The source code for pileup.sh is mainly in /bbmap/current/jgi/CoveragePileup.java

ADD REPLY
1
Entering edit mode
13.1 years ago

based on the first column only:

awk ' $1 !~ /@/ {print $1}' toy.sam | uniq -c

if you want to count only aligned reads you can also use awk to count only aligned reads.

ADD COMMENT
1
Entering edit mode

Won't $1 denote the query name? $3 is the chrom

ADD REPLY
0
Entering edit mode

Oops that's right

ADD REPLY
0
Entering edit mode

Oops, you're correct.

ADD REPLY
1
Entering edit mode
9.4 years ago
Kamil ★ 2.3k

Here's how I count the number per sequence of primary alignments with proper read pairs:

ADD COMMENT
1
Entering edit mode
3.6 years ago

Now there is a more convenient way with seqkit bam:

seqkit bam -C mybam.bam
ADD COMMENT
0
Entering edit mode
7.2 years ago

This is the Samtools Solution + bash

Enjoy

PROCESSORS=10
BAM=blah.bam | blah.sam

samtools view --threads $PROCESSORS $BAM | cut -f 3 | sort | uniq -c | sort -nr | sed -e 's/^ *//;s/ /\t/' | awk 'OFS="\t" {print $2,$1}' | sort -n -k1,1 > Chr_freqs.txt
ADD COMMENT

Login before adding your answer.

Traffic: 1952 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6