Question

Identifying Exons and number of reads between exons from a BAM file

2

Entering edit mode

8.5 years ago

startup_biostar ▴ 20

I have a big BAM file from which I need to identify the number of reads between Exon-1 and Exon-3 exon-21 and exon-24. What is the best tool to do that?

Firstly how would I identify Exon from a BAM file. When I view the BAM file using SAMTools I get

HISEQ-KLK:1383:BHFJ7USAXX:1:1213:11121:31110 16 chr11 3155263 255 75M * 0 0 CACACGGGAGGAATCGGGTCCATGTCCACCCACAGGAAGGTTAAAAGCCCACTCATCTACGGATGAGAAAATCAT IIIIIIIIIIIIIIIIIIIIIIIIHHHHHIHIIHHIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIDDDDD NH:i:1 HI:i:1 AS:i:71 nM:i:1

I don't see a notation 'EXON' in this. How should I proceed with solving this problem

Learning Bioinformatics - if it is a silly question - apologies.

RNA-Seq ChIP-Seq BAM EXONS • 3.9k views

ADD COMMENT • link updated 8.5 years ago by Alex Reynolds 36k • written 8.5 years ago by startup_biostar ▴ 20

0

Entering edit mode

You won't find the "exon" (or any other annotation) in BAM files since they contain only alignment information. You would use an interval file (BED) or genome annotation (GTF) that would contain information about the features you are interested in (e.g. exon). You would then use a tool such as bedtools/bedops/samtools etc. with your BAM file to extract the read count information you need.

ADD REPLY • link 8.5 years ago by GenoMax 147k

score 2 · Answer 1 · 2016-05-20

One way to do this is to get gene annotations that contain exon information. For example, GENCODE is one such source.

If you get the annotations in GTF format, you can use gtf2bed to convert them to BED format:

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz \
    | gunzip -c - \
    | gtf2bed - \
    > genes.bed

If you are new to the command line, this block is made up of three commands wrapped into one command via a pipe character (|). The wget command downloads the compressed annotation file, which is in turn passed to gunzip, which extracts the GTF file. That GTF file is passed to gtf2bed to convert to BED format.

Next, you can use bam2bed to convert your BAM file to a BED file:

$ bam2bed < reads.bam > reads.bed

The file genes.bed has lots of genes, each with lots of exons.

If your goal is to (as a basic learning exercise) to count all reads around all exons regardless of the gene, you can construct the (disjoint) genomic space of each exon by using grep and bedops.

For example, for gene annotations labeled exon_number 2:

$ grep -w 'exon_number 2' genes.bed | bedops --merge - > exon_number_2.bed

This filters the file genes.bed for annotations with the desired label. The bedops command merges the genomic regions for those exons, and the result is written to a file called exon_number_2.bed.

You can repeat this for other exons. Or use other grep commands to filter for a specific gene before getting specific exons, etc.

Once you have the genomic space of each exon, you can use bedmap to count the number of reads that overlap each of those exons.

To go back to the second-exon example file:

$ bedmap --echo --count --delim '\t' exon_number_2.bed reads.bed > exon_number_2_with_read_counts.bed

In the fourth column of exon_number_2_with_read_counts.bed, you have the number of reads that overlap each of the second-exons.

Edit: On re-reading your question, you want the space in between exons, not over them. In that case, you must pre-process the BED file to get a gene from all exons, and then generate the intronic space for each gene from those exons.