I would like to use bedtools merge to collapse together all the features sharing a same gene_id
in my bed file (which contains the annotation of various genes - the name column (4th one) also corresponds to the gene_id
). Due to splicing, elements may be quite distant....
chr14 49894259 49895806 ENSMUST00000053290 0.000000 - mm10_ensGene exon . gene_id "ENSMUST00000053290"; transcript_id "ENSMUST00000053290";
chr14 49894873 49894876 ENSMUST00000053290 0.000000 - mm10_ensGene stop_codon . gene_id "ENSMUST00000053290"; transcript_id "ENSMUST00000053290";
chr14 49894876 49895800 ENSMUST00000053290 0.000000 - mm10_ensGene CDS 0 gene_id "ENSMUST00000053290"; transcript_id "ENSMUST00000053290";
chr14 49895797 49895800 ENSMUST00000053290 0.000000 - mm10_ensGene start_codon . gene_id "ENSMUST00000053290"; transcript_id "ENSMUST00000053290";
chr14 49901908 49901941 ENSMUST00000053290 0.000000 - mm10_ensGene exon . gene_id "ENSMUST00000053290"; transcript_id "ENSMUST00000053290";
What are you really trying to achieve in the end? For example, do you need coding gene lengths because you want to calculate transcript-size corrected gene expression values from RNA sequencing (aka [FRC]PKM?) In that case, have a look at this post.
Nice link. I wanted to collapse my annotation to a more convenient form to intersect a de novo annotation to a reference annotation with
gene_id
s (see How To Get Annotation For Bed File From Another Bed File). I do this to assign the official Ensembl ids to my new annotation. I don't really need the whole transcript...However, some parts of my annotation have a rather insufficient coverage - your script will help me a lot to identify them.
Are you starting out with a GTF or a bed file? I think I have done something similar starting out with a GTF and using bedtools merge.