Entering edit mode
2.5 years ago
Biologist
▴
290
I have a gtf sample.gtf
like:
GL000009.2 ENSEMBL exon 56140 58376 . - . transcript_id "transc_00000026"; gene_id "ENSG00000278704.1"; gene_name "ENSG00000278704.1"; exon_number "1"; inf "known"; true_gene_id "XLOC_000032";
GL000009.2 ENSEMBL transcript 56140 58376 . - . transcript_id "transc_00000026"; gene_id "ENSG00000278704.1"; gene_name "ENSG00000278704.1"; oId "ENST00000618686.1"; tss_id "TSS35"; inf "known"; true_gene_id "XLOC_000032";
GL000009.2 Cufflinks exon 59669 59932 . + . transcript_id "transc_00000028"; gene_id "XLOC_000023"; gene_name "XLOC_000023"; exon_number "1"; inf "unknown"; true_gene_id "XLOC_000023";
GL000009.2 Cufflinks transcript 59669 61563 . + . transcript_id "transc_00000028"; gene_id "XLOC_000023"; gene_name "XLOC_000023"; oId "TCONS_00000027"; class_code "u"; tss_id "TSS25"; inf "unknown"; true_gene_id "XLOC_000023";
I converted gtf
to bed
using gtf2bed
gtf2bed < sample.gtf > sample.bed
And the bed file looks like:
GL000009.2 56139 58376 XLOC_000032 . - ENSEMBL exon . transcript_id "transc_00000026"; gene_id "ENSG00000278704.1"; gene_name "ENSG00000278704.1"; exon_number "1"; inf "known"; true_gene_id "XLOC_000032";
GL000009.2 56139 58376 XLOC_000032 . - ENSEMBL transcript . transcript_id "transc_00000026"; gene_id "ENSG00000278704.1"; gene_name "ENSG00000278704.1"; oId "ENST00000618686.1"; tss_id "TSS35"; inf "known"; true_gene_id "XLOC_000032";
GL000009.2 59668 59932 XLOC_000023 . + Cufflinks exon . transcript_id "transc_00000028"; gene_id "XLOC_000023"; gene_name "XLOC_000023"; exon_number "1"; inf "unknown"; true_gene_id "XLOC_000023";
GL000009.2 59668 61563 XLOC_000023 . + Cufflinks transcript . transcript_id "transc_00000028"; gene_id "XLOC_000023"; gene_name "XLOC_000023"; oId "TCONS_00000027"; class_code "u"; tss_id "TSS25"; inf "unknown"; true_gene_id "XLOC_000023";
Why is the 4th column in the bed file, not gene_id
? It looks like it is taking true_gene_id
. I want the output to be like below:
GL000009.2 56139 58376 ENSG00000278704.1 . - ENSEMBL exon . transcript_id "transc_00000026"; gene_id "ENSG00000278704.1"; gene_name "ENSG00000278704.1"; exon_number "1"; inf "known"; true_gene_id "XLOC_000032";
GL000009.2 56139 58376 ENSG00000278704.1 . - ENSEMBL transcript . transcript_id "transc_00000026"; gene_id "ENSG00000278704.1"; gene_name "ENSG00000278704.1"; oId "ENST00000618686.1"; tss_id "TSS35"; inf "known"; true_gene_id "XLOC_000032";
GL000009.2 59668 59932 XLOC_000023 . + Cufflinks exon . transcript_id "transc_00000028"; gene_id "XLOC_000023"; gene_name "XLOC_000023"; exon_number "1"; inf "unknown"; true_gene_id "XLOC_000023";
GL000009.2 59668 61563 XLOC_000023 . + Cufflinks transcript . transcript_id "transc_00000028"; gene_id "XLOC_000023"; gene_name "XLOC_000023"; oId "TCONS_00000027"; class_code "u"; tss_id "TSS25"; inf "unknown"; true_gene_id "XLOC_000023";
How to get the desired output?
I got a bit frustrated with gtf2bed and started using GTFtools python package:
http://www.genemine.org/gtftools.php