Basic Bam File Annotation
5
21
Entering edit mode
13.4 years ago

What are the options for obtaining basic annotation information from BAM alignments of NGS data. The stuff everyone wants to know:

% intergenic
% genic
% intronic
% exonic
% within 5kbp 5' UTR
% within repeats
% within CpG islands

Please provide the basic command line syntax with your answer.

bedtools bam htseq annotation rseqc • 13k views
ADD COMMENT
1
Entering edit mode

I would add to the list: coding, 5' UTR, 3' UTR, synonymous & non-synonymous

ADD REPLY
0
Entering edit mode

i will also accept an expansion of the list of stuff everyone wants to know

ADD REPLY
0
Entering edit mode

Uh, is this from RNASeq data?

ADD REPLY
0
Entering edit mode

anything, DNA-Seq, RNA-Seq, ChIP-Seq, exome, whatever

ADD REPLY
0
Entering edit mode
ADD REPLY
17
Entering edit mode
13.4 years ago

I have been tinkering with a new tool for BEDTools (it's currently alpha, so it is in the development repo) that attempts to do this. I pushed it to the public repository, if you'd like to play with it. The tool is called "tagBam" and works as follows. You supply a BAM file, and a list of annotation files and associated labels. For each alignment, it will search for overlaps with each annotation file. Each time there is an overlap, the label you supply will be appended to a custom "YB" tag that will be added to the BAM alignment. For example:

$ tagBam -i aln.bam -files exons.bed introns.bed cpg.bed utrs.bed \
                    -tags exonic intonic cpg utr \
                    > aln.tagged.bam

For alignments that have overlaps, you should see new BAM tags like "YB:Z:exonic", "YB:Z:cpg;utr"

I should emphasize that this is experimental, but I am hoping to make it available in the next release. As always, comments and suggestions are welcome.

Another option is to write a custom script using existing interfaces such as pysam, the BioPerl BAM interface, Picard, samtools C-API, bamtools C++-API, etc. These solutions are nice because you will inevitably run up against a nuanced rule for the annotations that can't be addressed in a one-stop-shop. In particular, if you are a Python person, the pybedtools suite or the HTSeq suite are good options.

ADD COMMENT
0
Entering edit mode

Awesome! I have been having a lot of trouble not as much with computing these annotations but storing them in a coherent non-adhoc format that I can recall a few weeks later ;-) Putting it back right back into the BAM file is a phenomenal idea - I need to adopt this for other things as well.

ADD REPLY
0
Entering edit mode

Right. I plan to add an option such that one can define there own tag in addition to the default "YB" tag.

ADD REPLY
0
Entering edit mode

Wish I could edit my comment..."their", of course.

ADD REPLY
0
Entering edit mode

I agree, very cool. I guess you could just tabulate the YB tag as it streams to STDOUT to avoid an extra BAM around. One thing: what if some annos are stranded (genes) and some are not (cpg's)?

ADD REPLY
0
Entering edit mode

Nice catch Brent. Currently, there is a "-s" option to enforce that overlaps are on the same strand. However, this rule applies to all annotation files. I'll give some thought to how to best handle that case.

ADD REPLY
0
Entering edit mode

Added ability to override the default "YB" tag with the -tag option. -tags is now -labels

ADD REPLY
6
Entering edit mode
13.4 years ago
Ryan Dale 5.0k

As Aaron mentioned, HTseq can be really useful (and importantly, flexible) for this sort of thing. As an example, this classify_reads.py script will count up

  • total reads in each of the combinatorial overlapping classes (e.g., exon only, intron only, and intron+exon which you can get from overlapping genes or multiple isoforms)
  • the total reads falling in each class (regardless of what other classes overlap, so "intron_all" and "exon_all")
  • the total reads
  • the total unannotated reads
  • total sequence space in each class

You can use the --exclude or --include options to specify what features in the GFF/GTF to look for, e.g.,

classify_reads.py --bam $BAM --gff $GFF \
                  --include intron exon repeat_region > report.txt
ADD COMMENT
0
Entering edit mode

Nice. This is a solid example of how to effectively use HTseq.

ADD REPLY
4
Entering edit mode
13.4 years ago

Picard's RnaSeqMetrics:

wget <ftp://genome-ftp.cse.ucsc.edu/goldenPath/hg19/database/refFlat.txt.gz> .
gunzip refFlat.txt.gz
java -jar picard-tools-1.48/CollectRnaSeqMetrics.jar INPUT=myBamfile.bam REF_FLAT=refFlat.txt OUTPUT=myBamfile.bam.rnaSeqMetric STRAND=NONE VALIDATION_STRINGENCY=LENIENT

output:

## net.sf.picard.metrics.StringHeader
# net.sf.picard.analysis.CollectRnaSeqMetrics REF_FLAT=refFlat.txt STRAND_SPECIFICITY=NONE INPUT=myBamfile.bam OUTPUT=myBamfile.bam.rnaSeqMetric VALIDATION_STRINGENCY=LENIENT    ASSUME_SORTED=true STOP_AFTER=0 TMP_DIR=/tmp/leipzig VERBOSITY=IN
O QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
## net.sf.picard.metrics.StringHeader
# Started on: Mon Jun 20 15:18:21 EDT 2011

## METRICS CLASS    net.sf.picard.analysis.RnaSeqMetrics
PF_ALIGNED_BASES    RIBOSOMAL_BASES CODING_BASES    UTR_BASES   INTRONIC_BASES  INTERGENIC_BASES    CORRECT_STRAND_READS    INCORRECT_STRAND_READS  PCT_RIBOSOMAL_BASES PCT_CODING_BASES    PCT_UTR_BASES   PCT_INTRONIC_BASES  PCT_INTERGENIC_BASES    PCT_MRNA_BASES  PCT_CO
RRECT_STRAND_READS
974654759   0   19072146    8005286 407874405   539702922   0   0   0   0.019568    0.008213    0.418481    0.553738    0.027782    0
ADD COMMENT
0
Entering edit mode

Didn't know about CollectRnaSeqMetrics -- which made me go check out the rest of Picard. Lots of great stuff in there.

ADD REPLY
0
Entering edit mode

oh man, thanks for that ftp link. anyone getting their reference from ensembl will want to add 'chr' to the chromosome column. not having it will result in java.lang.NullPointerException at net.sf.picard.annotation.RefFlatReader.makeTranscriptFromRefFlatLine(RefFlatReader.java:173).

ADD REPLY
0
Entering edit mode
12.9 years ago
Joseph D • 0

Hi I tried classify_reads.py and got an error. please help

python classify_reads.py --bam my.bam --gff refFlat.gtf --include intron exon repeat_region > report.txt

Traceback (most recent call last): File "classify_reads.py", line 191, in [?] main() File "classify_reads.py", line 182, in main chroms=args.assembly) File "classify_reads.py", line 107, in classify for iv, step_set in gaos[read.iv].steps(): File "_HTSeq.pyx", line 523, in HTSeq._HTSeq.GenomicArray.getitem (src/_HTSeq.c:8927) KeyError: None

Joseph

ADD COMMENT
0
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 1904 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6