Entering edit mode
2.9 years ago
simplitia
▴
130
Hi, I realize that https://www.gencodegenes.org/human/release_33lift37.html gene codes hg19 gtf files has a strange annotation with an underscore append to the end of each of the gene id. Is there a way to safely remove all the underscore? The file annotations looks something like this.
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5_4"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2_4"; remap_status "full_contig"; remap_num_mappings 1; remap_target_status "overlap";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5_4"; transcript_id "ENST00000456328.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level 1; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2_4"; havana_transcript "OTTHUMT00000362751.1_1"; remap_num_mappings 1; remap_status "full_contig"; remap_target_status "overlap";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.5_4"; transcript_id "ENST00000456328.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1_1"; level 2; transcript_support_level 1; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2_4"; havana_transcript "OTTHUMT00000362751.1_1"; remap_original_location "chr1:+:11869-12227"; remap_status "full_contig";
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5_4"; transcript_id "ENST00000456328.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1_1"; level 2; transcript_support_level 1; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2_4"; havana_transcript "OTTHUMT00000362751.1_1"; remap_original_location "chr1:+:12613-12721"; remap_status "full_contig";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5_4"; transcript_id "ENST00000456328.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1_1"; level 2; transcript_support_level 1; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2_4"; havana_transcript "OTTHUMT00000362751.1_1"; remap_original_location "chr1:+:13221-14409"; remap_status "full_contig";
chr1 HAVANA transcript 12010 13670 . + . gene_id "ENSG00000223972.5_4"; transcript_id "ENST00000450305.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:37102"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2_4"; havana_transcript "OTTHUMT00000002844.2_1"; remap_num_mappings 1; remap_status "full_contig"; remap_target_status "overlap";
chr1 HAVANA exon 12010 12057 . + . gene_id "ENSG00000223972.5_4"; transcript_id "ENST00000450305.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; exon_number 1; exon_id "ENSE00001948541.1_1"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:37102"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2_4"; havana_transcript "OTTHUMT00000002844.2_1"; remap_original_location "chr1:+:12010-12057"; remap_status "full_contig";
chr1 HAVANA exon 12179 12227 . + . gene_id "ENSG00000223972.5_4"; transcript_id "ENST00000450305.2_1"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; exon_number 2; exon_id "ENSE00001671638.2_1"; level 2; transcript_support_level "NA"; hgnc_id "HGNC:37102"; ont "PGO:0000005"; ont "PGO:0000019"; tag "basic"; havana_gene "OTTHUMG00000000961.2_4"; havana_transcript "OTTHUMT00000002844.2_1"; remap_original_location "chr1:+:12179-12227"; remap_status "full_contig";
thanks; it did'nt really work right so I ended up using sed command as such. The important switch here was /g for replacing everything. What I'm still a bit worry about is some obscure gene or issues where there is an underscore with a number GENENAME_2 will mess this up.
To remove only "_" , try this (From ENSG00000223972.5_4 to ENSG00000223972.54 in gene_id column):
head gencode.v37lift37.basic.annotation.gtf | awk -v FS="\"" -v OFS="\"" '{sub("_","",$2)}1'
To remove only "_number" , try this (From ENSG00000223972.5_4 to ENSG00000223972.5 in gene_id column):
head gencode.v37lift37.annotation.gtf | awk -F "gene_id" -v OFS="gene_id" '{sub("_[0-9]","",$2)}1'
thanks again for your help. Yes what I want is to eliminate any number after a _ underscore, so
ENSG00000223972.5_4 to ENSG00000223972.5
sed does this correctly but the new awk command you sent collapses and givesENSG00000223972.54
instead. Its actually a bit weird why no one else seem to be bother by this since this underscore messes up a lot of downstream programs, may be is the lack of hg19 ? what would be really useful is if there is a way to make sure to only replace when the term starts with space follow by^ENS, ^OTT or ^ENST
that way I think it would be a safer route in case there are some other important nomenclature that uses this pattern.Updated the code. Please try the second one. Problem is not with editing. But with restoring the original format (gtf). Without going through multiple replacements, it is difficult to restore to original format with generic tools. Please post expected output next time.
thanks that is great. Do you know if its possible to put OR statements to include
havana_gene
andhgnc_id
flags as well, that way I don't have to run it muliple times for each annotations.can you post single line input and output example? I haven't seen versioning for
hgnc_id
. Try following and post if there are any issues:it works thanks really appreciate this.