Question

What is agi and bgi in the Ensembl gtf file

2

Entering edit mode

7.3 years ago

lakhujanivijay 5.9k

I am trying to use the new tuxedo pipeline for my RNA-seq data.

I have downloaded the Oryza Sativa indica GTF file from Ensembl and have pasted few lines below

#!genome-build ASM465v1
#!genome-version ASM465v1
#!genome-date 2005-01
#!genome-build-accession GCA_000004655.2
#!genebuild-last-updated 2010-07
1       agi     gene    13717   13879   .       +       .       gene_id "EPlOING00000043550"; gene_name "SNORA23"; gene_source "agi"; gene_biotype "snoRNA";
1       agi     transcript      13717   13879   .       +       .       gene_id "EPlOING00000043550"; transcript_id "EPlOINT00000043550"; gene_name "SNORA23"; gene_source "agi"; gene_biotype "snoRNA"; transcript_name "SNORA23"; transcript_source "agi"; transcript_biotype "snoRNA";
1       agi     exon    13717   13879   .       +       .       gene_id "EPlOING00000043550"; transcript_id "EPlOINT00000043550"; exon_number "1"; gene_name "SNORA23"; gene_source "agi"; gene_biotype "snoRNA"; transcript_name "SNORA23"; transcript_source "agi"; transcript_biotype "snoRNA"; exon_id "EPlOINE00000043550";
1       bgi     gene    18113   20165   .       +       .       gene_id "BGIOSGA002568"; gene_source "bgi"; gene_biotype "protein_coding";
1       bgi     transcript      18113   20165   .       +       .       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding";
1       bgi     exon    18113   19150   .       +       .       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "1"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; exon_id "BGIOSGA002568-TA.1";
1       bgi     CDS     18113   19150   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "1"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; protein_id "BGIOSGA002568-PA"; protein_version "1";
1       bgi     start_codon     18113   18115   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "1"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding";
1       bgi     exon    19344   20165   .       +       .       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "2"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; exon_id "BGIOSGA002568-TA.2";
1       bgi     CDS     19344   20162   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "2"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; protein_id "BGIOSGA002568-PA"; protein_version "1";
1       bgi     stop_codon      20163   20165   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "2"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding";
1       agi     gene    21086   21198   .       -       .       gene_id "EPlOING00000001909"; gene_name "MIR408"; gene_source "agi"; gene_biotype "miRNA";
1       agi     transcript      21086   21198   .       -       .       gene_id "EPlOING00000001909"; transcript_id "EPlOINT00000001909"; gene_name "MIR408"; gene_source "agi"; gene_biotype "miRNA"; transcript_name "MIR408"; transcript_source "agi"; transcript_biotype "miRNA";
1       agi     exon    21086   21198   .       -       .       gene_id "EPlOING00000001909"; transcript_id "EPlOINT00000001909"; exon_number "1"; gene_name "MIR408"; gene_source "agi"; gene_biotype "miRNA"; transcript_name "MIR408"; transcript_source "agi"; transcript_biotype "miRNA"; exon_id "EPlOINE0000000

The number of coding genes (40,745) matches the outcome of following command

awk -F "\t" '$3=="gene"{print }' Oryza_indica.ASM465v1.38.gtf | grep bgi | wc -l

I want to know what is bgi and agi in the 2nd column. Shall I keep only bgi enteries? I know that this represent different sources i.e. bgi is Bejing genomics. However, keeping both may be an issue

bgi agi gtf ensembl RNA-Seq • 2.3k views

ADD COMMENT • link updated 7.3 years ago by Emily 24k • written 7.3 years ago by lakhujanivijay 5.9k

0

Entering edit mode

In that snippet posted above the entries do not appear to overlap/match (is that the case for the entire file, are all agi entries non-coding entities)? If the entities are unique then you may need to keep them. Does that suddenly double the gene number?

ADD REPLY • link 7.3 years ago by GenoMax 152k

0

Entering edit mode

Yes, entries do not overlap

command (extracting biotype information for agi entries)

awk -F "\t" '$3=="gene"{print $9 }' Oryza_indica.ASM465v1.38.gtf | grep agi | awk -F ";" '{print $4}' | sort | uniq

output

 gene_biotype "antisense"
 gene_biotype "miRNA"
 gene_biotype "misc_RNA"
 gene_biotype "ncRNA"
 gene_biotype "P_RNA"
 gene_biotype "ribozyme"
 gene_biotype "RNase_MRP_RNA"
 gene_biotype "rRNA"
 gene_biotype "snoRNA"
 gene_biotype "snRNA"
 gene_biotype "SRP_RNA"
 gene_biotype "telomerase_RNA"
 gene_biotype "tmRNA"
 gene_biotype "tRNA"

ADD REPLY • link 7.3 years ago by lakhujanivijay 5.9k

score 4 · Accepted Answer · 2018-03-26

4

Entering edit mode

7.3 years ago

Emily 24k

The indica rice genome has two sources of annotation, BGI (Beijing Genome Institute) for coding genes and AGI (Arizona Genome Institute) for non-coding.

ADD COMMENT • link 7.3 years ago by Emily 24k

0

Entering edit mode

Thanks Emily_Ensembl

That was helpful. Can you help me understand the stats here

Non coding genes    48,978
Small non coding genes  43,562
Long non coding genes   240
Misc non coding genes   5,176

The output of below command does not match the stats at this page for non coding genes

command

$awk -F "\t" '$3=="gene"{print}' Oryza_indica.ASM465v1.38.gtf | grep agi | wc -l
$47693

ADD REPLY • link 7.3 years ago by lakhujanivijay 5.9k

0

Entering edit mode

There are also other sources of ncRNA genes:

tRNAs are generated by using tRNAscan
Rfam for many types ncRNAs
some from ENA

ADD REPLY • link 7.3 years ago by Emily 24k

0

Entering edit mode

Thanks Emily, that was of immense help! Everything is clear now. :D

ADD REPLY • link 7.3 years ago by lakhujanivijay 5.9k