One way to do this is to get gene annotations that contain exon information. For example, GENCODE is one such source.
If you get the annotations in GTF format, you can use gtf2bed to convert them to BED format:
$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz \
| gunzip -c - \
| gtf2bed - \
> genes.bed
If you are new to the command line, this block is made up of three commands wrapped into one command via a pipe character (|
). The wget
command downloads the compressed annotation file, which is in turn passed to gunzip
, which extracts the GTF file. That GTF file is passed to gtf2bed
to convert to BED format.
Next, you can use bam2bed to convert your BAM file to a BED file:
$ bam2bed < reads.bam > reads.bed
The file genes.bed
has lots of genes, each with lots of exons.
If your goal is to (as a basic learning exercise) to count all reads around all exons regardless of the gene, you can construct the (disjoint) genomic space of each exon by using grep
and bedops.
For example, for gene annotations labeled exon_number 2
:
$ grep -w 'exon_number 2' genes.bed | bedops --merge - > exon_number_2.bed
This filters the file genes.bed
for annotations with the desired label. The bedops
command merges the genomic regions for those exons, and the result is written to a file called exon_number_2.bed
.
You can repeat this for other exons. Or use other grep
commands to filter for a specific gene before getting specific exons, etc.
Once you have the genomic space of each exon, you can use bedmap to count the number of reads that overlap each of those exons.
To go back to the second-exon example file:
$ bedmap --echo --count --delim '\t' exon_number_2.bed reads.bed > exon_number_2_with_read_counts.bed
In the fourth column of exon_number_2_with_read_counts.bed
, you have the number of reads that overlap each of the second-exons.
Edit: On re-reading your question, you want the space in between exons, not over them. In that case, you must pre-process the BED file to get a gene from all exons, and then generate the intronic space for each gene from those exons.
You won't find the "exon" (or any other annotation) in BAM files since they contain only alignment information. You would use an interval file (BED) or genome annotation (GTF) that would contain information about the features you are interested in (e.g. exon). You would then use a tool such as bedtools/bedops/samtools etc. with your BAM file to extract the read count information you need.