Hello,
I have a GTF file with only exon features. There's a way to extract the gene coordinates? Or should I write a script?
- INPUT: GTF file.
- OUTPUT: the gene coordinates, whatever the format is.
Thanks.
Hello,
I have a GTF file with only exon features. There's a way to extract the gene coordinates? Or should I write a script?
Thanks.
The question becomes exactly what you want in terms of coordinates for a gene. I'm guessing that you just want the 5' most and 3' most position along with the strand an chromosome, but perhaps you have something else in mind.
Presuming you do want what I mentioned, you could easily do this in R with GenomicFeatures.
library(GenomicFeatures)
txdb <- makeTranscriptDbFromGFF("some_file.gtf", format="gtf")
genes <- genes(txdb)
write.table(as.data.frame(genes)[,-4], file="Just_genes.txt", colnames=F, sep="\t")
The -4
just removes the width column.
Hi @Devon Ryan, i have a gff file in this format:
chr11 Gnomon gene 24482947 24484914 . - . ID=gene26171;Name=LOC103966979;Name=gene26171
chr11 Gnomon mRNA 24482947 24484914 . - . Parent=gene26171;ID=rna33198
chr11 Gnomon five_prime_UTR 24484810 24484914 . - . ID=five_prime_UTR:rna33198:1;Parent=rna33198
chr11 Gnomon start_codon 24484807 24484809 . - 0 ID=start_codon:rna33198:1;Parent=rna33198
chr11 Gnomon exon 24484587 24484914 . - . ID=exon:rna33198:1;Parent=rna33198
chr11 Gnomon exon 24484138 24484445 . - . ID=exon:rna33198:2;Parent=rna33198
chr11 Gnomon exon 24482947 24483988 . - . ID=exon:rna33198:3;Parent=rna33198
chr11 Gnomon CDS 24484587 24484809 . - 0 Parent=rna33198;ID=CDS:rna33198:1
chr11 Gnomon CDS 24484138 24484445 . - 2 Parent=rna33198;ID=CDS:rna33198:2
chr11 Gnomon CDS 24483413 24483988 . - 0 Parent=rna33198;ID=CDS:rna33198:3
chr11 Gnomon gene 21571688 21575140 . - . Name=LOC103939934;ID=gene39438;Name=gene39438
chr11 Gnomon mRNA 21571688 21575140 . - . ID=rna49862;Parent=gene39438
chr11 Gnomon five_prime_UTR 21575032 21575140 . - . ID=five_prime_UTR:rna49862:1;Parent=rna49862
chr11 Gnomon five_prime_UTR 21574449 21574449 . - . Parent=rna49862;ID=five_prime_UTR:rna49862:2
chr11 Gnomon exon 21575032 21575140 . - . ID=exon:rna49862:1;Parent=rna49862
chr11 Gnomon exon 21574389 21574449 . - . Parent=rna49862;ID=exon:rna49862:2
chr11 Gnomon exon 21572908 21572989 . - . ID=exon:rna49862:3;Parent=rna49862
chr11 Gnomon exon 21572290 21572417 . - . ID=exon:rna49862:4;Parent=rna49862
chr11 Gnomon exon 21571688 21572198 . - . ID=exon:rna49862:5;Parent=rna49862
chr11 Gnomon start_codon 21574446 21574448 . - 0 Parent=rna49862;ID=start_codon:rna49862:1
chr11 Gnomon CDS 21574389 21574448 . - 0 ID=CDS:rna49862:1;Parent=rna49862
chr11 Gnomon CDS 21572908 21572989 . - 0 Parent=rna49862;ID=CDS:rna49862:2
chr11 Gnomon CDS 21572290 21572417 . - 2 Parent=rna49862;ID=CDS:rna49862:3
chr11 Gnomon CDS 21571866 21572198 . - 0 Parent=rna49862;ID=CDS:rna49862:4
and i have a genes ID:
LOC103966979
LOC103939934
and i want to extract there transcripts info in this format:
chr11 Gnomon mRNA 24482947 24484914 . - . ID=LOC103966979
chr11 Gnomon five_prime_UTR 24484810 24484914 . - . ID=five_prime_UTR:rna33198:1;Parent=LOC103966979
chr11 Gnomon start_codon 24484807 24484809 . - 0 ID=start_codon:rna33198:1;Parent=LOC103966979
chr11 Gnomon CDS 24484587 24484809 . - 0 ID=CDS:rna33198:1;Parent=LOC103966979
chr11 Gnomon CDS 24484138 24484445 . - 2 ID=CDS:rna33198:2;Parent=LOC103966979
chr11 Gnomon CDS 24483413 24483988 . - 0 ID=CDS:rna33198:3;Parent=LOC103966979
chr11 Gnomon mRNA 21571688 21575140 . - . ID=LOC103939934
chr11 Gnomon five_prime_UTR 21575032 21575140 . - . ID=five_prime_UTR:rna49862:1;Parent=LOC103939934
chr11 Gnomon five_prime_UTR 21574449 21574449 . - . ID=five_prime_UTR:rna49862:2;Parent=LOC103939934
chr11 Gnomon start_codon 21574446 21574448 . - 0 ID=start_codon:rna49862:1;Parent=LOC103939934
chr11 Gnomon CDS 21574389 21574448 . - 0 ID=CDS:rna49862:1;Parent=LOC103939934
chr11 Gnomon CDS 21572908 21572989 . - 0 ID=CDS:rna49862:2;Parent=LOC103939934
chr11 Gnomon CDS 21572290 21572417 . - 2 ID=CDS:rna49862:3;Parent=LOC103939934
chr11 Gnomon CDS 21571866 21572198 . - 0 ID=CDS:rna49862:4;Parent=LOC103939934
thanks for adivice.
Using gtf2bed:
$ gtf2bed < foo.gtf | cut -f1-3 > foo_coords.bed3
If you want strand information:
$ gtf2bed < foo.gtf | cut -f1-6 > foo_coords.bed6
using awk and sqlite:
curl -sL "https://rseqflow.googlecode.com/files/mouse_refseq_anno.gtf" |\ awk -F ' ' 'BEGIN {printf("create temp table T(chrom,start,end,gene); begin transaction;\n");} $3=="exon" {n=split($9,a,/[ ;]+/);for(i=1;i+1< n;i++) if(a[i]=="gene_id") printf("insert into T(chrom,start,end,gene) values (\"%s\",%s,%s,%s);\n",$1,$4,$5,a[i+1]);} END {printf("commit; select chrom,gene,min(start),max(end) from T group by chrom,gene;\n");}' |\ sqlite3 tmp.db (...) chrY|Rbm31y|12688110|17402718 chrY|Rbmy1a1|2830680|3783271 chrY|Sly|55213720|75222053
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Please make it more clear by showing your Input file and desired output
Done, I cannot be more clear.