Question

Basic Bam File Annotation

21

Entering edit mode

13.4 years ago

Jeremy Leipzig 22k

What are the options for obtaining basic annotation information from BAM alignments of NGS data. The stuff everyone wants to know:

% intergenic
% genic
% intronic
% exonic
% within 5kbp 5' UTR
% within repeats
% within CpG islands

Please provide the basic command line syntax with your answer.

bedtools bam htseq annotation rseqc • 13k views

ADD COMMENT • link 3.5 years ago by Jeremy Leipzig 22k

1

Entering edit mode

I would add to the list: coding, 5' UTR, 3' UTR, synonymous & non-synonymous

ADD REPLY • link 13.4 years ago by Casey Bergman 18k

0

Entering edit mode

i will also accept an expansion of the list of stuff everyone wants to know

ADD REPLY • link 13.4 years ago by Jeremy Leipzig 22k

0

Entering edit mode

Uh, is this from RNASeq data?

ADD REPLY • link 13.4 years ago by Ketil 4.1k

0

Entering edit mode

anything, DNA-Seq, RNA-Seq, ChIP-Seq, exome, whatever

ADD REPLY • link 13.4 years ago by Jeremy Leipzig 22k

0

Entering edit mode

https://github.com/databio/GenomicDistributions might be worth adding here

ADD REPLY • link 4.3 years ago by Jeremy Leipzig 22k

score 17 · Answer 1 · 2011-06-23

17

Entering edit mode

13.4 years ago

Aaronquinlan 12k

I have been tinkering with a new tool for BEDTools (it's currently alpha, so it is in the development repo) that attempts to do this. I pushed it to the public repository, if you'd like to play with it. The tool is called "tagBam" and works as follows. You supply a BAM file, and a list of annotation files and associated labels. For each alignment, it will search for overlaps with each annotation file. Each time there is an overlap, the label you supply will be appended to a custom "YB" tag that will be added to the BAM alignment. For example:

$ tagBam -i aln.bam -files exons.bed introns.bed cpg.bed utrs.bed \
                    -tags exonic intonic cpg utr \
                    > aln.tagged.bam

For alignments that have overlaps, you should see new BAM tags like "YB:Z:exonic", "YB:Z:cpg;utr"

I should emphasize that this is experimental, but I am hoping to make it available in the next release. As always, comments and suggestions are welcome.

Another option is to write a custom script using existing interfaces such as pysam, the BioPerl BAM interface, Picard, samtools C-API, bamtools C++-API, etc. These solutions are nice because you will inevitably run up against a nuanced rule for the annotations that can't be addressed in a one-stop-shop. In particular, if you are a Python person, the pybedtools suite or the HTSeq suite are good options.

ADD COMMENT • link 13.4 years ago by Aaronquinlan 12k

0

Entering edit mode

Awesome! I have been having a lot of trouble not as much with computing these annotations but storing them in a coherent non-adhoc format that I can recall a few weeks later ;-) Putting it back right back into the BAM file is a phenomenal idea - I need to adopt this for other things as well.

ADD REPLY • link 13.4 years ago by Istvan Albert 101k

0

Entering edit mode

Right. I plan to add an option such that one can define there own tag in addition to the default "YB" tag.

ADD REPLY • link 13.4 years ago by Aaronquinlan 12k

0

Entering edit mode

Wish I could edit my comment..."their", of course.

ADD REPLY • link 13.4 years ago by Aaronquinlan 12k

0

Entering edit mode

I agree, very cool. I guess you could just tabulate the YB tag as it streams to STDOUT to avoid an extra BAM around. One thing: what if some annos are stranded (genes) and some are not (cpg's)?

ADD REPLY • link 13.4 years ago by brentp 24k

0

Entering edit mode

Nice catch Brent. Currently, there is a "-s" option to enforce that overlaps are on the same strand. However, this rule applies to all annotation files. I'll give some thought to how to best handle that case.

ADD REPLY • link 13.4 years ago by Aaronquinlan 12k

0

Entering edit mode

Added ability to override the default "YB" tag with the -tag option. -tags is now -labels

ADD REPLY • link 13.4 years ago by Aaronquinlan 12k

score 6 · Answer 2 · 2011-06-24

As Aaron mentioned, HTseq can be really useful (and importantly, flexible) for this sort of thing. As an example, this classify_reads.py script will count up

total reads in each of the combinatorial overlapping classes (e.g., exon only, intron only, and intron+exon which you can get from overlapping genes or multiple isoforms)
the total reads falling in each class (regardless of what other classes overlap, so "intron_all" and "exon_all")
the total reads
the total unannotated reads
total sequence space in each class

You can use the --exclude or --include options to specify what features in the GFF/GTF to look for, e.g.,

classify_reads.py --bam $BAM --gff $GFF \
                  --include intron exon repeat_region > report.txt

score 4 · Answer 3 · 2011-06-24

Picard's RnaSeqMetrics:

wget <ftp://genome-ftp.cse.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz> .
gunzip refFlat.txt.gz
java -jar picard-tools-1.48/CollectRnaSeqMetrics.jar INPUT=myBamfile.bam REF_FLAT=refFlat.txt OUTPUT=myBamfile.bam.rnaSeqMetric STRAND=NONE VALIDATION_STRINGENCY=LENIENT

output:

## net.sf.picard.metrics.StringHeader
# net.sf.picard.analysis.CollectRnaSeqMetrics REF_FLAT=refFlat.txt STRAND_SPECIFICITY=NONE INPUT=myBamfile.bam OUTPUT=myBamfile.bam.rnaSeqMetric VALIDATION_STRINGENCY=LENIENT    ASSUME_SORTED=true STOP_AFTER=0 TMP_DIR=/tmp/leipzig VERBOSITY=IN
O QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
## net.sf.picard.metrics.StringHeader
# Started on: Mon Jun 20 15:18:21 EDT 2011

## METRICS CLASS    net.sf.picard.analysis.RnaSeqMetrics
PF_ALIGNED_BASES    RIBOSOMAL_BASES CODING_BASES    UTR_BASES   INTRONIC_BASES  INTERGENIC_BASES    CORRECT_STRAND_READS    INCORRECT_STRAND_READS  PCT_RIBOSOMAL_BASES PCT_CODING_BASES    PCT_UTR_BASES   PCT_INTRONIC_BASES  PCT_INTERGENIC_BASES    PCT_MRNA_BASES  PCT_CO
RRECT_STRAND_READS
974654759   0   19072146    8005286 407874405   539702922   0   0   0   0.019568    0.008213    0.418481    0.553738    0.027782    0

score 0 · Answer 4 · 2011-12-24

Hi I tried classify_reads.py and got an error. please help

python classify_reads.py --bam my.bam --gff refFlat.gtf --include intron exon repeat_region > report.txt

Traceback (most recent call last): File "classify_reads.py", line 191, in [?] main() File "classify_reads.py", line 182, in main chroms=args.assembly) File "classify_reads.py", line 107, in classify for iv, step_set in gaos[read.iv].steps(): File "_HTSeq.pyx", line 523, in HTSeq._HTSeq.GenomicArray.getitem (src/_HTSeq.c:8927) KeyError: None

Joseph

score 0 · Answer 5 · 2021-06-06

0

Entering edit mode

3.5 years ago

Jeremy Leipzig 22k

http://rseqc.sourceforge.net/#read-distribution-py enter image description here

ADD COMMENT • link 3.5 years ago by Jeremy Leipzig 22k