Dear community,
I apologise for this question regarding probably the most annoying thing in our workflows. Annotation.
Yes, I have searched and googled but I have not found a satisfying answer without re-inventing the wheel again and again and again... it has been years since the last post, therefore I would like your input on how you do this.
Question: What is a quick, easy and comprehensive way to annotate of a large set of Genomic Intervals
Input
R GRanges object or Bed file
chr1 1234567 1234890 +
chr2 2345678 2345899 -
... ... ... ...
Output
For each interval:
- Gene ID / Transcript ID: In which gene / transcript does overlap: not nearby feature!
- Gene Type: ncRNA, protein coding, ...
- Gene Region: intron, exon, intergenic, 5' UTR, 3' UTR, CDS, ..
- Decision logic: ncRNAs > microRNAs > protein coding, exon > intron, ... etc.
Other info would be also very nice.
Here some reasons why I am not very happy about certain ways:
1. ChIP seq tools
Chip Seq annotation packages, they all annotate the nearby gene, but the annotation can be from a gene which is > 10 kb away, and has nothing to do with the genomic region they annotate. I think this is sometimes missleading. This packages include
- R package ChIPpeakAnno: does not use strand information for annotation, does wrong annotation, should not be used anymore
- R package ChIPseeker: fixes ChIPpeakAnno problems, but does annotate only nearest features, no annotation of genomic region. no apparent decision logic, takes the first hit of TxDb if multiple features are hit, which is fine if the TxDb is tuned (see TxDb).
- HOMER: does annotate nearest features, do not know of any decision logic
- PeakAnalyzer: does annotate nearest features, do not know of any decision logic
2. R TxDb
Absolutely a valid option, but the standard ones do not contain gene type and decision logic. Basically, every user has to build their TxDb (so basically that the first hit they receive is already in the right decision order so the processing can be fast) and I personally think creating TxDbs is really not well documented if documented at all. If you have a good documentation, please let me know!
3. Bed tools / Bedops
Same as TxDb, absolutely valid option. Fast, but you have to download and prepare everything yourself. Re-inventing the wheel.
Here is a nice way to do Bedops Annotating Genomic Intervals for one annotation type. Multiple annotation types and logic has to be implemented. Not complicated but probably a time-saver would be nice. If you have good scripts or efficient workflow, please share it.
4. DAS services
Haven't used them, probably slow and not ideal for annotation of ten thousands of intervals.
I would appreciate your comments and help on this. Probably we can collect some time-savers here. Thank you.
A while back, we published the tool we used at the lab where I was for the task you described: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026715
We implemented a hierarchical reporting where exonic variants would trump intronic ones which would themselves trump downstream ones. You can have a look if you want :-)
Hi Gabriel, I did not know this. Looks like a great tools, I will try it and will add feedback to the list. Thanks.
First, could you precise what you call decision logic?
I am having a similar issue. I have generated a de novo annotation from RNA Seq data, which I want to compare to a reference annotation (working on mice here).
Both annotations are quite similar but the UTRs are sometimes more precise in the de novo annotation. It also has no Ensembl IDs, but some feature annotation (5', 3', CDS, intron, exon) and a shared id across the elements of a given transcript. What I can do is grab the reference annotation features and find their intersection with the de novo annotation; this allows me to assign Ensembl IDs.
I have huge issues with overlapping genes however using Bedtools/Bedops. I keep only Ensembl transcripts with a specific GO terms, which allows me to simplify a lot my reference annotation. If I presume that the CDS stay roughly the same, I can then match both annotations and keep the updated UTRs. There is no overlap between the transcripts of different genes in this case, which helps.
Of course, because I am a masochist I want some more information on my genes. What I do is use Biomart to query all possible information on the genes matching my GO terms, as well as the
Mus_musculus.gene_info
file. I get this way the type of each gene (coding or pseudogene). I also use the alias field of thegene_info
file which is often useful for finding synonyms.I agree that TxDb lacks proper documentation; I almost started building my own transcript database... Bedtools was the solution for me, but for you the fact that some intervals may overlap many different features is an issue.
I roughly have several databases/dataframes: old and new annotations of each transcripts plus transcript ID indexed by the gene Ensembl ID, lots of information for each gene Ensembl ID, gene name synonyms matching a gene Ensembl ID, features anem and positions associated to a given transcript ID.
You would need similar databases, and to sort the transcripts so they are accessible. Some kind of self balancing binary tree structure, but I have not looked into this.
Decision logic for me is when you have overlapping features, which can either be from the same gene - e.g. multiple transcripts (coding region, utr, intron), what is done. At the moment this issue is rather ignored because it is very annoying.
I solved it with an R package and TxDb, annotating everything first and then applying this logic. However, this slows it down significantly. Someone pointed out that when you build it using GTF you apply your logic to your GTF and then build your TxDb. This is a pain in the neck, that is the reason why I am looking for easy solutions.
I'm not sure what you mean here by multiple annotation types, can you clarify? BEDOPS operations work based on overlaps and are agnostic about the annotations going in, so long as they are 1) BED-formatted and 2) sorted.
BEDOPS tools are easy to join with pipes, so maybe using Unix I/O piping would help you out. Would like to help but I don't know what you mean.
Thanks for the comment. Let's assume we want to annotate in following order: coding region (cds) and other exonic regions, 5' UTRs, 3' UTRs, introns.
Now we have the situation that one interval spans over two features (second feature has different splice forms)
Interval1:
According to the logic above we want to have following output1 (exon beats intron, ncRNA beats protein coding)
How would I simply do this in Bedops, please?
I can imagine getting this information with Bedops
by first annotating everything: (would this be the output? the order would indicate which gene it belongs to)
and then applying the logic ourselves with some simple scripts.
Or is there a better simpler way to achieve output1? Thanks!
bedmap doesn't know about the particulars of non-BED annotation formats, so it has no way to apply logic.
But it is easy to pipe results to downstream scripts written in awk, Perl, Python, etc. that read from a Unix standard input stream and implement your custom logic:
This design choice is deliberate. BEDOPS subscribes to the core philosophies put forth by original Unix engineer Doug McIlroy:
Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.
Text goes into a BEDOPS tool and text comes out. Anything that isn't within BEDOPS functional purview should be easy to manage with another tool somewhere downstream. One tool that does everything is not The Unix Way.