I want to create gene data-set (as big as possible), hence I am using several gene annotations. However, genes in different annotations overlap (it's the same gene). For reducing biases I overlap different annotations and if genes overlap leave only one gene.
Question:
To ensure this overlap I was thinking to expand gene coordinates - is this necessary? If so, how big extension should be (5bp/100bp)?
Example:
Want to create lncRNA data-set (in the following steps it will be used to search for genomic features).
Input:
- GENCODE lncRNA annotation (version 18 - 04/09/2013);
- Cabili lncRNA annotation (Cabili et al., 2011 (CSHLP)).
Workflow:
- Extract GENCODE genes start/end coordinates;
- Extract Cabili genes start/end coordinates;
- Extend Cabili coordinates ( -/+ nbp );
- Use BedTools intersect;
- If genes intersect leave GENCODE gene (as it's a newer annotation (though this step is really subjective)).
I do realize that this extension question depends on the situation and how reliable annotation is, but still hope that someone could suggest something.
What do you plan on doing with this dataset?
I updated my question: "in the following steps it will be used to search for genomic features"
You should think about what you exactly will want to do with these features. For RNA-seq? For wetlab (primers/probes..)? For phylogenetic studies? Your strategy of how you want to merge the features might be different for these purposes. There probably isn't one single method of merging these annotations that will be good for all purposes.
This should be simply enrichment analysis for any feature (e.g., sequence motif, chromatin modification, repeat count).