Number Of Reads In A Sam File Which Are Assigned To Each Chromosome?

5

Entering edit mode

13.4 years ago

KCC ★ 4.1k

Given a sam file which was produced by using bwa to align a fastq of reads to a reference genome, how can I enumerate the number of reads which were assigned to each chromosome of the reference genome?

alignment sam • 32k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 13.4 years ago by KCC ★ 4.1k

13

Entering edit mode

13.4 years ago

Aaronquinlan 12k

Your question refers to SAM format, but in the event that you may also have a sorted/indexed BAM, you can simply do the following to get this very quickly.

samtools idxstats aln.bam | cut -f 1,3

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 13.4 years ago by Aaronquinlan 12k

6

Entering edit mode

10.3 years ago

Casey Bergman 18k

The solution based on samtools idxstats aln.bam should be used with caution. This is because AFAIK the numbers reported by samtools idxstats (& flagstat) represent the number of alignments of reads that are mapped to chromosomes, not the (non-redundant) number of reads, as stated in the documentation. I stumbled across this by observing that the total numbers reported by samtools idxstats (& flagstat) exceeded the total number of reads in my input fastq files. See further discussion on this issue in the comments on this post: http://left.subtree.org/2012/04/13/counting-the-number-of-reads-in-a-bam-file/

ADD COMMENT • link 10.3 years ago by Casey Bergman 18k

2

Entering edit mode

Incidentally, I wrote a program that calculates read coverage from a sam file (number of reads mapped to each sequence, average depth, median depth, standard deviation, fraction of bases covered, etc).

pileup.sh in=mapped.sam out=coverage.txt

It has a toggle for including or ignoring secondary alignments; secondary=false or secondary=true (default).

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 10.3 years ago by Brian Bushnell 20k

0

Entering edit mode

When generating your SAM file make sure it has the header section before using pileup.sh.

samtools view -h file.bam > file.sam

ADD REPLY • link 10.2 years ago by victorcmason • 0

0

Entering edit mode

Example pipe with filtering:

cat file.bam | samtools view -h -F 0x4 -F 0x100 | samtools idxstats -  |  cut -f 1,3

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 5.7 years ago by shouldsee.gem • 0

0

Entering edit mode

Brian, I'd be interested to read the source code of your program, but your link is broken. Could I ask you to provide a new link?

ADD REPLY • link 9.7 years ago by Kamil ★ 2.3k

0

Entering edit mode

Sometimes Sourceforge goes down for a few hours for maintenance, or something. I put the latest version (35.14) on my Google drive; please let me know if you still have trouble accessing it.

https://drive.google.com/file/d/0B3llHR93L14wTzZSMjZvdjdmYzg/view?usp=sharing

The source code for pileup.sh is mainly in /bbmap/current/jgi/CoveragePileup.java

ADD REPLY • link updated 2.4 years ago by Ram 45k • written 9.7 years ago by Brian Bushnell 20k

1

Entering edit mode

13.4 years ago

Zev.Kronenberg 12k

based on the first column only:

awk ' $1 !~ /@/ {print $1}' toy.sam | uniq -c

if you want to count only aligned reads you can also use awk to count only aligned reads.

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 13.4 years ago by Zev.Kronenberg 12k

1

Entering edit mode

Won't $1 denote the query name? $3 is the chrom

ADD REPLY • link updated 2.4 years ago by Ram 45k • written 13.4 years ago by Aaronquinlan 12k

0

Entering edit mode

Oops that's right

ADD REPLY • link updated 2.4 years ago by Ram 45k • written 13.4 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Oops, you're correct.

ADD REPLY • link updated 2.4 years ago by Ram 45k • written 13.4 years ago by Zev.Kronenberg 12k

1

Entering edit mode

9.7 years ago

Kamil ★ 2.3k

Here's how I count the number per sequence of primary alignments with proper read pairs:

	#!/usr/bin/env bash
	# reads_per_seq
	#
	# Run 8 jobs in parallel on many BAM files:
	#
	# parallel -j8 reads_per_seq ::: *.bam

	set -eou pipefail

	if [[ $# -ne 1 \|\| "$1" != *.bam ]]; then
	echo "usage: $(basename $0) file.bam"
	echo
	echo " Produce a tab-delimited file.seq with number per sequence of"
	echo " primary alignments with proper read pairs."
	echo
	exit 1
	fi

	bam="$1"

	# Read more about samtools options: http://www.htslib.org/doc/samtools.html
	#
	# Learn more about samtools flags: https://www.samformat.info/sam-format-flag
	#
	samtools view -f 2 -F 104 "$bam" \
	\| cut -f3 \| sort \| uniq -c \| sort -k1nr \
	\| awk 'OFS="\t" {print $1,$2}' \
	> "${bam%.bam}.seq"

view raw reads_per_seq.sh hosted with ❤ by GitHub

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 9.7 years ago by Kamil ★ 2.3k

1

Entering edit mode

3.9 years ago

olle.nordesjo ▴ 20

Now there is a more convenient way with seqkit bam:

seqkit bam -C mybam.bam

ADD COMMENT • link 3.9 years ago by olle.nordesjo ▴ 20

0

Entering edit mode

7.5 years ago

Simply Bioinformatics ▴ 200

This is the Samtools Solution + bash

Enjoy

PROCESSORS=10
BAM=blah.bam | blah.sam

samtools view --threads $PROCESSORS $BAM | cut -f 3 | sort | uniq -c | sort -nr | sed -e 's/^ *//;s/ /\t/' | awk 'OFS="\t" {print $2,$1}' | sort -n -k1,1 > Chr_freqs.txt

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 7.5 years ago by Simply Bioinformatics ▴ 200

Login before adding your answer.