Question

Counting Short-Read Gene Alignments With Bowtie/Tophat.

8

Entering edit mode

14.2 years ago

Chris Warth ▴ 110

Let me start by saying I am very new to this so I may be asking naive questions.

I am aligning short-reads against the mm9 mouse genome using tophat (which uses bowtie, of course). What I would like to get is a count of how many times each known gene is hit by one of the reads. I have seen suggestions that one only need to sort and count the RNAME field of the resulting SAM file to get this info.

  cat accepted_hits.sam | cut -f 3 | sort | uniq -c

However because I am aligning to the complete mouse genome, my SAM file has RNAME values like "chr1", "chr10", etc.

So how do people normally get a hit count per gene in experiments like this? It seems like there are two ways I can proceed.

1) align against a known gene table downloaded from ucsc. That way the RNAME field in each alignment will identify the name of the gene it aligned to. The problem is this will not discover novel splice sites (the purpose of using tophat), nor will it allow me to discover novel gene expression.

2) align against the complete genome as usual, but then postprocess to find overlap with between the alignment location and any known genes.

Is that how people do this, and are there any existing tools that can help turn a SAM file (or BAM) into a gene expression count?

Thank you in advance.

tophat alignment gene • 8.0k views

ADD COMMENT • link updated 14.2 years ago by Ryan Dale 5.0k • written 14.2 years ago by Chris Warth ▴ 110

0

Entering edit mode

I'm missing something here. The SAM/BAM files only describes the alignments (short reads vs ref). If you want to find some novel splice sites, don't you have to 'call' the new variants (with, e.g. samtools ) ?

ADD REPLY • link 14.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Being a novice, I'm not sure I understand the question. Tophat outputs a "junctions.bed" file with splice sites that it has discovered without reference to a pre-existing gene map (although you can provide a file of known genes to help it avoid spurious junctions.)

ADD REPLY • link 14.2 years ago by Chris Warth ▴ 110

0

Entering edit mode

ok, I didn't know Tophat

ADD REPLY • link 14.2 years ago by Pierre Lindenbaum 164k

score 5 · Answer 1 · 2010-10-01

To some extent, you can do either, but I think 2) is better. You've identified two problems with option 1) - missing novel genes and novel splice variants. There is also the potential problem of any redundancy in your reference sequences (e.g. from multiple transcripts per gene) causing mapping scores to be reduced. Now that the technology has improved so that "short" reads are relatively long, there isn't so much benefit to be had from mapping directly to transcriptome sequences, especially now that we have tools like TopHat.

You will most likely find that if you plot mean gene coverage by percentile of gene length for all your genes, that there will be a 3' bias (deeper at the 3' end). Some people only count reads in the 3' region for this reason. You may, or may not find this, depending on the lab methods used.

Counting reads per gene and finding expression levels are two separate problems. For the former, you can use any of the many overlap-finders e.g. BEDTools. For the latter there are packages such as DEGSeq and the (unfortunately) very similarly named DESeq.

Importantly, for any differential gene expression, you should also have an experimental design including biological replicates, but that's another story.

score 1 · Answer 2 · 2010-10-02

1

Entering edit mode

14.2 years ago

Ryan Dale 5.0k

Another option for finding overlaps is HTSeq, by the author of DESeq. You can either use the off-the-shelf script htseq-count that comes with it, or you can roll your own to get more detailed output (provided you know some Python).

ADD COMMENT • link 14.2 years ago by Ryan Dale 5.0k

0

Entering edit mode

Thanks for the pointer. I thought I was going to have to code my own overlap-finder in python so rolling my own script to work with HTSeq shouldn't be a problem.

ADD REPLY • link 14.2 years ago by Chris Warth ▴ 110