Question

Counting mapped reads from a SAM file

6

Entering edit mode

10.2 years ago

kobipe3 ▴ 90

I am using BWA to align RNA-seq data from a miRNA experiment.

I want to compare myself to another article - so I am using the same techniques its authors used.

That is using fastq-mcf to clip the adapters, and BWA (Burrows Wheeler Aligner) to align the reads.

The reads are not aligned to the entire genome. Instead they are aligned to the mature miRNA forms, downloaded as fasta files from miRBase (example below).

After doing the alignment I get a SAM file (example below).

I want to get the count of each miRNA from that SAM file.

The way I thought of doing the count, is to count the number of rows that map to each miRNA in the SAM file. I would only take the rows in which FLAG (second column) & 4 == 0. That is - the read is mapped. The miRNA identity will be computed from the RNAME (third column).

Is this the right way to do the counting?

To begin with, I thought of using htseq-count, but the SAM file doesn't really contain the coordinates of the miRNAs in the reference sequence, so htseq cannot work.

Samples lines from the fasta file of the mature miRNA forms:

>mmu-let-7g-5p MIMAT0000121 Mus musculus let-7g-5p
TGAGGTAGTAGTTTGTACAGTT
ACTGTACAGGCCACTGCCTTGC
>mmu-let-7g-3p MIMAT0004519 Mus musculus let-7g-3p

Sample lines from the SAM assembly file:

HISEQ:47:C6FUWANXX:1:1101:2080:1992     4       *       0       0       *       *       0       0       TTGGCAATGGTAGAACTCACACCGAAA     <BBBFFFFFFFFFFFFFFFFFFFFFFF
HISEQ:47:C6FUWANXX:1:1101:8020:1999     0       mmu-miR-182-5p  2       37      24M     *       0       0       TTGGCAATGGTAGAACTCACACCG        <BBBFFBFF<FBFFF/</F<F<FF        XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:
24
HISEQ:47:C6FUWANXX:1:1101:8660:1996     4       *       0       0       *       *       0       0       CGGATCCGTCTGAGCTTGGCTTT <BBBFBFFFFFFFFF<FBBBFFB
HISEQ:47:C6FUWANXX:1:1101:16161:2000    4       *       0       0       *       *       0       0       TTTCCGTAGTGTAGTGGTTATCACGTTCGCCTC       <BBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
HISEQ:47:C6FUWANXX:1:1101:19163:2000    0       mmu-miR-182-5p  2       37      23M     *       0       0       TTGGCAATGGTAGAACTCACACC <BBBFFFFFFFFFFFFFFFFFFF XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:23
HISEQ:47:C6FUWANXX:1:1101:20578:2000    0       mmu-miR-182-5p  2       37      22M     *       0       0       TTGGCAATGGTAGAACTCACAC  <BBBFFFFFFFFFFFFFFFFFF  XT:A:U  NM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:22
HISEQ:47:C6FUWANXX:1:1101:1200:2043     4       *       0       0       *       *       0       0       CACAAATGATGAACCTTTTGACGGGCGGACAGAAACTCTGTGCTGAATGTC     BBBBBFFFFFFFFFFFFFFB<<FFFFFFFFFFFFFFFFFFFFFFFFFFFFF
HISEQ:47:C6FUWANXX:1:1101:1212:2073     4       mmu-miR-92a-3p  1       25      22M     *       0       0       TATTGCACTTGTCCCGGCCTGT  BBBBBFFFFFFFFFFFFFFFFF  XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:21C0
HISEQ:47:C6FUWANXX:1:1101:1139:2100     4       mmu-miR-92a-3p  1       25      22M     *       0       0       TATTGCACTTGTCCCGGCCTGT  BBBBBFFFFFFFFFFFF/FFFF  XT:A:U  NM:i:1  X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:21C0

RNA-Seq • 22k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by kobipe3 ▴ 90

3

Entering edit mode

10.2 years ago

Brian Bushnell 20k

You can also do this with pileup from the BBMap package, which accepts unsorted sam files:

pileup.sh in=mapped.sam out=coverage.txt rpkm=rpkm.txt

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 10.2 years ago by Brian Bushnell 20k

2

Entering edit mode

10.2 years ago

michael.ante ★ 4.0k

Since you mapped to the miRNA-nome, a simple samtools idxstats call should be enough:

samtools index foo.bam
samtools idxstats foo.bam

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by michael.ante ★ 4.0k

Ram · Accepted Answer · 2015-06-03

6

Entering edit mode

10.2 years ago

Devon Ryan 105k

You probably want to ignore unmapped and secondary alignments:

samtools view -F 260 foo.bam | cut -f 3 | sort | uniq -c | awk '{printf("%s\t%s\n", $2, $1)}' > counts.txt

or something along those lines.

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.2 years ago by Devon Ryan 105k

1

Entering edit mode

Devon keeps beating me to it today. Using datamash and assuming it's a sorted bam (else add '-s' to the datamash command) ...

samtools view -F260 <yourbam> | cut -f3 | datamash -g 1 count 1 > counts

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 10.2 years ago by george.ry ★ 1.2k

1

Entering edit mode

I'm an hour ahead, so I have a bit of an advantage :P

ADD REPLY • link 10.2 years ago by Devon Ryan 105k

0

Entering edit mode

Hi Devon, I have small RNAs reads from 18 - 30 nt. How can I count number of mapped reads according to their length? Thanks!

ADD REPLY • link 8.7 years ago by SpreeFU • 0

0

Entering edit mode

Please post things like this as new questions in the future. However, one possible method would be:

samtools view -F 260 foo.bam | awk '{print length($10)}' | sort | uniq -c

I know you can do arrays in awk and that that'd be faster than sorting, but for a one off sort of thing this is quick enough.

ADD REPLY • link 8.7 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks a lot Devon! I'll post another relative question in a new one.

ADD REPLY • link 8.7 years ago by SpreeFU • 0

0

Entering edit mode

Shouldn't samtools flagstat give the same result? I just run this and see that it gives exact counts in the third row from 'flagstat' output.e.g. "919853 + 0 mapped (96.88%:-nan%)"

ADD REPLY • link 8.0 years ago by bioinfo ▴ 840

0

Entering edit mode

When I run the above command, I get an error:

[E::sam_parse1] SEQ and QUAL are of different length 
[W::sam_read1] Parse error at line 3054 
[main_samview] truncated file.

ADD REPLY • link 6.8 years ago by glady ▴ 320