Let me start by saying I am very new to this so I may be asking naive questions.
I am aligning short-reads against the mm9 mouse genome using tophat (which uses bowtie, of course). What I would like to get is a count of how many times each known gene is hit by one of the reads. I have seen suggestions that one only need to sort and count the RNAME field of the resulting SAM file to get this info.
cat accepted_hits.sam | cut -f 3 | sort | uniq -c
However because I am aligning to the complete mouse genome, my SAM file has RNAME values like "chr1", "chr10", etc.
So how do people normally get a hit count per gene in experiments like this? It seems like there are two ways I can proceed.
1) align against a known gene table downloaded from ucsc. That way the RNAME field in each alignment will identify the name of the gene it aligned to. The problem is this will not discover novel splice sites (the purpose of using tophat), nor will it allow me to discover novel gene expression.
2) align against the complete genome as usual, but then postprocess to find overlap with between the alignment location and any known genes.
Is that how people do this, and are there any existing tools that can help turn a SAM file (or BAM) into a gene expression count?
Thank you in advance.
I'm missing something here. The SAM/BAM files only describes the alignments (short reads vs ref). If you want to find some novel splice sites, don't you have to 'call' the new variants (with, e.g. samtools ) ?
Being a novice, I'm not sure I understand the question. Tophat outputs a "junctions.bed" file with splice sites that it has discovered without reference to a pre-existing gene map (although you can provide a file of known genes to help it avoid spurious junctions.)
ok, I didn't know Tophat