Counting Short-Read Gene Alignments With Bowtie/Tophat.
2
8
Entering edit mode
14.2 years ago
Chris Warth ▴ 110

Let me start by saying I am very new to this so I may be asking naive questions.

I am aligning short-reads against the mm9 mouse genome using tophat (which uses bowtie, of course). What I would like to get is a count of how many times each known gene is hit by one of the reads. I have seen suggestions that one only need to sort and count the RNAME field of the resulting SAM file to get this info.

  cat accepted_hits.sam | cut -f 3 | sort | uniq -c

However because I am aligning to the complete mouse genome, my SAM file has RNAME values like "chr1", "chr10", etc.

So how do people normally get a hit count per gene in experiments like this? It seems like there are two ways I can proceed.

1) align against a known gene table downloaded from ucsc. That way the RNAME field in each alignment will identify the name of the gene it aligned to. The problem is this will not discover novel splice sites (the purpose of using tophat), nor will it allow me to discover novel gene expression.

2) align against the complete genome as usual, but then postprocess to find overlap with between the alignment location and any known genes.

Is that how people do this, and are there any existing tools that can help turn a SAM file (or BAM) into a gene expression count?

Thank you in advance.

tophat alignment gene • 8.0k views
ADD COMMENT
0
Entering edit mode

I'm missing something here. The SAM/BAM files only describes the alignments (short reads vs ref). If you want to find some novel splice sites, don't you have to 'call' the new variants (with, e.g. samtools ) ?

ADD REPLY
0
Entering edit mode

Being a novice, I'm not sure I understand the question. Tophat outputs a "junctions.bed" file with splice sites that it has discovered without reference to a pre-existing gene map (although you can provide a file of known genes to help it avoid spurious junctions.)

ADD REPLY
0
Entering edit mode

ok, I didn't know Tophat

ADD REPLY
5
Entering edit mode
14.2 years ago

To some extent, you can do either, but I think 2) is better. You've identified two problems with option 1) - missing novel genes and novel splice variants. There is also the potential problem of any redundancy in your reference sequences (e.g. from multiple transcripts per gene) causing mapping scores to be reduced. Now that the technology has improved so that "short" reads are relatively long, there isn't so much benefit to be had from mapping directly to transcriptome sequences, especially now that we have tools like TopHat.

You will most likely find that if you plot mean gene coverage by percentile of gene length for all your genes, that there will be a 3' bias (deeper at the 3' end). Some people only count reads in the 3' region for this reason. You may, or may not find this, depending on the lab methods used.

Counting reads per gene and finding expression levels are two separate problems. For the former, you can use any of the many overlap-finders e.g. BEDTools. For the latter there are packages such as DEGSeq and the (unfortunately) very similarly named DESeq.

Importantly, for any differential gene expression, you should also have an experimental design including biological replicates, but that's another story.

ADD COMMENT
0
Entering edit mode

Thank you very much. I'll look into using BEDTools as an overlap finder.

ADD REPLY
0
Entering edit mode

which packages do you think is better, DEGSeq or DEseq?

ADD REPLY
1
Entering edit mode
14.2 years ago
Ryan Dale 5.0k

Another option for finding overlaps is HTSeq, by the author of DESeq. You can either use the off-the-shelf script htseq-count that comes with it, or you can roll your own to get more detailed output (provided you know some Python).

ADD COMMENT
0
Entering edit mode

Thanks for the pointer. I thought I was going to have to code my own overlap-finder in python so rolling my own script to work with HTSeq shouldn't be a problem.

ADD REPLY

Login before adding your answer.

Traffic: 2463 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6