I am trying to use the new tuxedo pipeline for my RNA-seq data.
I have downloaded the Oryza Sativa indica GTF file from Ensembl and have pasted few lines below
#!genome-build ASM465v1
#!genome-version ASM465v1
#!genome-date 2005-01
#!genome-build-accession GCA_000004655.2
#!genebuild-last-updated 2010-07
1 agi gene 13717 13879 . + . gene_id "EPlOING00000043550"; gene_name "SNORA23"; gene_source "agi"; gene_biotype "snoRNA";
1 agi transcript 13717 13879 . + . gene_id "EPlOING00000043550"; transcript_id "EPlOINT00000043550"; gene_name "SNORA23"; gene_source "agi"; gene_biotype "snoRNA"; transcript_name "SNORA23"; transcript_source "agi"; transcript_biotype "snoRNA";
1 agi exon 13717 13879 . + . gene_id "EPlOING00000043550"; transcript_id "EPlOINT00000043550"; exon_number "1"; gene_name "SNORA23"; gene_source "agi"; gene_biotype "snoRNA"; transcript_name "SNORA23"; transcript_source "agi"; transcript_biotype "snoRNA"; exon_id "EPlOINE00000043550";
1 bgi gene 18113 20165 . + . gene_id "BGIOSGA002568"; gene_source "bgi"; gene_biotype "protein_coding";
1 bgi transcript 18113 20165 . + . gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding";
1 bgi exon 18113 19150 . + . gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "1"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; exon_id "BGIOSGA002568-TA.1";
1 bgi CDS 18113 19150 . + 0 gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "1"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; protein_id "BGIOSGA002568-PA"; protein_version "1";
1 bgi start_codon 18113 18115 . + 0 gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "1"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding";
1 bgi exon 19344 20165 . + . gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "2"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; exon_id "BGIOSGA002568-TA.2";
1 bgi CDS 19344 20162 . + 0 gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "2"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; protein_id "BGIOSGA002568-PA"; protein_version "1";
1 bgi stop_codon 20163 20165 . + 0 gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "2"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding";
1 agi gene 21086 21198 . - . gene_id "EPlOING00000001909"; gene_name "MIR408"; gene_source "agi"; gene_biotype "miRNA";
1 agi transcript 21086 21198 . - . gene_id "EPlOING00000001909"; transcript_id "EPlOINT00000001909"; gene_name "MIR408"; gene_source "agi"; gene_biotype "miRNA"; transcript_name "MIR408"; transcript_source "agi"; transcript_biotype "miRNA";
1 agi exon 21086 21198 . - . gene_id "EPlOING00000001909"; transcript_id "EPlOINT00000001909"; exon_number "1"; gene_name "MIR408"; gene_source "agi"; gene_biotype "miRNA"; transcript_name "MIR408"; transcript_source "agi"; transcript_biotype "miRNA"; exon_id "EPlOINE0000000
The number of coding genes (40,745) matches the outcome of following command
awk -F "\t" '$3=="gene"{print }' Oryza_indica.ASM465v1.38.gtf | grep bgi | wc -l
I want to know what is bgi
and agi
in the 2nd column. Shall I keep only bgi
enteries? I know that this represent different sources i.e. bgi is Bejing genomics. However, keeping both may be an issue
In that snippet posted above the entries do not appear to overlap/match (is that the case for the entire file, are all
entries non-coding entities)? If the entities are unique then you may need to keep them. Does that suddenly double the gene number?Yes, entries do not overlap
command (extracting biotype information for