adding gene and CDS feature types in ensemble gtf annotation with transcript and exon features
1
0
Entering edit mode
7 months ago

Hi

I am looking for adding gene and CDS feature into my ensembl GTF annotation file and appreciate your help.

chr01   ensembl transcript      16321180        16321695        .       -       .       gene_id "annotation_36"; transcript_id "annotation_45"; biotype "protein_coding"; translation_coords "16321671:16321695:1:16321180:16321403:224";
chr01   ensembl exon    16321671        16321695        .       -       .       gene_id "annotation_36"; transcript_id "annotation_45"; exon_number "1";
chr01   ensembl exon    16321583        16321634        .       -       .       gene_id "annotation_36"; transcript_id "annotation_45"; exon_number "2";
chr01   ensembl exon    16321440        16321531        .       -       .       gene_id "annotation_36"; transcript_id "annotation_45"; exon_number "3";
chr01   ensembl exon    16321180        16321403        .       -       .       gene_id "annotation_36"; transcript_id "annotation_45"; exon_number "4";
chr01   ensembl transcript      16321064        16322597        .       -       .       gene_id "annotation_36"; transcript_id "annotation_46"; biotype "protein_coding"; translation_coords "16321955:16322597:1:16321064:16321864:800";
chr01   ensembl exon    16321955        16322597        .       -       .       gene_id "annotation_36"; transcript_id "annotation_46"; exon_number "1";
chr01   ensembl exon    16321064        16321864        .       -       .       gene_id "annotation_36"; transcript_id "annotation_46"; exon_number "2";
chr01   ensembl transcript      16321064        16322597        .       -       .       gene_id "annotation_36"; transcript_id "annotation_47"; biotype "protein_coding"; translation_coords "16322232:16322597:1:16321064:16321684:621";
chr01   ensembl exon    16322232        16322597        .       -       .       gene_id "annotation_36"; transcript_id "annotation_47"; exon_number "1";
chr01   ensembl exon    16321955        16322197        .       -       .       gene_id "annotation_36"; transcript_id "annotation_47"; exon_number "2";
chr01   ensembl exon    16321718        16321864        .       -       .       gene_id "annotation_36"; transcript_id "annotation_47"; exon_number "3";
chr01   ensembl exon    16321064        16321684        .       -       .       gene_id "annotation_36"; transcript_id "annotation_47"; exon_number "4";

After adding gene feature

chr01   ensembl gene      16321180        16321684        .       -       .       gene_id "annotation_36";
chr01   ensembl transcript      16321180        16321695        .       -       .       gene_id "annotation_36"; transcript_id "annotation_45"; biotype "protein_coding"; translation_coords "16321671:16321695:1:16321180:16321403:224";
chr01   ensembl exon    16321671        16321695        .       -       .       gene_id "annotation_36"; transcript_id "annotation_45"; exon_number "1";
chr01   ensembl exon    16321583        16321634        .       -       .       gene_id "annotation_36"; transcript_id "annotation_45"; exon_number "2";
chr01   ensembl exon    16321440        16321531        .       -       .       gene_id "annotation_36"; transcript_id "annotation_45"; exon_number "3";
chr01   ensembl exon    16321180        16321403        .       -       .       gene_id "annotation_36"; transcript_id "annotation_45"; exon_number "4";
chr01   ensembl transcript      16321064        16322597        .       -       .       gene_id "annotation_36"; transcript_id "annotation_46"; biotype "protein_coding"; translation_coords "16321955:16322597:1:16321064:16321864:800";
chr01   ensembl exon    16321955        16322597        .       -       .       gene_id "annotation_36"; transcript_id "annotation_46"; exon_number "1";
chr01   ensembl exon    16321064        16321864        .       -       .       gene_id "annotation_36"; transcript_id "annotation_46"; exon_number "2";
chr01   ensembl transcript      16321064        16322597        .       -       .       gene_id "annotation_36"; transcript_id "annotation_47"; biotype "protein_coding"; translation_coords "16322232:16322597:1:16321064:16321684:621";
chr01   ensembl exon    16322232        16322597        .       -       .       gene_id "annotation_36"; transcript_id "annotation_47"; exon_number "1";
chr01   ensembl exon    16321955        16322197        .       -       .       gene_id "annotation_36"; transcript_id "annotation_47"; exon_number "2";
chr01   ensembl exon    16321718        16321864        .       -       .       gene_id "annotation_36"; transcript_id "annotation_47"; exon_number "3";
chr01   ensembl exon    16321064        16321684        .       -       .       gene_id "annotation_36"; transcript_id "annotation_47"; exon_number "4";
GTF GFF transcript CDS exon • 708 views
ADD COMMENT
1
Entering edit mode
7 months ago
Pratik ★ 1.1k

As a note, you can do these kinds of operations in R as well using GenomicFeatures and it's associated packages. I'm not sure what your end objective is here.

Regardless, this should help you accomplish this in bash:

awk '$3 == "gene" || $3 == "transcript" || $3 == "CDS"' /home/pmehta/Downloads/Homo_sapiens.GRCh38.111.chr.gtf

replace /home/pmehta/Downloads/Homo_sapiens.GRCh38.111.chr.gtf with the location of your .gtf file.


This will extract all lines that have the following matches in column 3 ($3): gene, transcript, and CDS.

I think what you did was use grep with gene, but I think that matched gene_id in column 9.

ADD COMMENT

Login before adding your answer.

Traffic: 1997 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6